Download Alternative Approach to Mining Association Rules

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Principal component analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Rauch, J. - Šimůnek, M.: Alternative Approach to Mining Association Rules. in FDM 2002, The Foundation of Data Mining and
Knowledge Discovery, The Proceedings of the Workshop of ICDM02, pp 157-162.
Alternative Approach to Mining Association Rules
1
Jan Rauch1, Milan Šimůnek1 2
Faculty of Informatics and Statistics, University of Economics Prague, Czech Republic
2
Institute of Computer Sciences, Czech Academy of Sciences, Czech Republic
[email protected], simunek@{vse.cz, cs.cas.cz}
Abstract
An alternative approach to mining association rules is
described. Some special techniques and algorithms are
used that lead to a much richer syntax of association
rules with only linear complexity of computation. A free
and open system LISp-Miner implements these algorithms
and can serve as a demonstration of used techniques. The
same techniques can be used in other kinds of mining e.g.
multi-relation mining and conditional frequency analysis.
1. Introduction
An association rule is in common way understood as an
expression of the form of X→Y, where X and Y are sets
of items. The intuitive meaning is that transactions (e.g.
supermarket baskets) containing set X of items tend to
contain set Y of items. Two measures of intensity of
association rule are used, confidence and support.
An association rule discovery task is a task to find all
association rules of the form X→Y such that the support
and confidence of X→Y are above the user-defined
thresholds minsup and minconf.
The conventional algorithm of association rules
discovery proceeds in two steps. All frequent itemsets are
found in the first step. The frequent itemset is the itemset
that is included in at least minsup transactions. The
association rules with the confidence at least minconf are
generated in the second step [1].
Particular items can be represented by Boolean
attributes and a Boolean data matrix can represent the
whole set of transactions. The algorithm can be modified
to deal with attributes with more than two values. Thus,
the association rules of the form e.g. A(a1)∧B(b3)→C(c7)
can be mined. We suppose that the attribute A has k
particular values a1, …, ak. The expression A(a1) denotes
the Boolean attribute that is true if the value of attribute A
is a1 etc.
The goal of this paper is to draw attention to an
alternative approach for mining association rules based on
representation of each possible value of each attribute by
a single string of bits. It is possible to mine for association
rules of the form e.g. A(α) ∧ B(β) → C(δ) where α is a
coefficient (a subset of all the possible values) of the
attribute A. The expression A(α) denotes the Boolean
attribute that is true for particular row of data matrix if the
value of A in this row belongs to α, similarly for B(β) and
C(δ).
The bit string approach makes also possible to easy
compute all necessary frequencies. Then we can mine not
only for association rules based on confidence and
support but also for rules corresponding to further various
relations of Boolean attributes including relations
described by statistical hypotheses tests. It is also possible
to mine for conditional association rules and to deal with
missing information. The presented form of association
rules can be understood as a contribution to the discussion
about the notion of interesting patterns.
Several data structures consisting of disjunctions and
conjunctions of bit strings representing particular values
of attributes are maintained to optimise generation and
verification of association rules. Final algorithm is very
fast and it is linearly dependent on the number of rows of
the analysed data matrix. Time and memory complexity
are discussed in section 3.
As a demonstration of capabilities of bit string
approach we present the procedure 4ft-Miner (see section
2). The 4ft-Miner procedure is a part of the academic data
mining system LISp-Miner (see http://lispminer.vse.cz).
The bit string approach proved to be very efficient.
Experiences with it lead to development of new mining
procedures, an example can be found in section 4.
The presented approach was first applied in
connection of development of the GUHA method of
mechanized hypotheses formation [2], [3].
2. Procedure 4ft-Miner
Procedure 4ft-Miner mines for association rules of the
form ϕ ≈ ψ and for conditional association rules of the
form ϕ ≈ ψ / χ. Here ϕ, ψ and χ are conjunctions of
Boolean attributes automatically derived from manyvalued attributes in various ways.
The symbol ≈ is called 4ft-quantifier. The association
rule ϕ ≈ ψ means that Boolean attributes ϕ and ψ are
somehow associated in the sense of the 4ft-quantifier ≈. A
conditional association rule ϕ ≈ ψ / χ means that ϕ and ψ
are associated (in the sense of ≈) if the condition χ is
satisfied.
1
Rauch, J. - Šimůnek, M.: Alternative Approach to Mining Association Rules. in FDM 2002, The Foundation of Data Mining and
Knowledge Discovery, The Proceedings of the Workshop of ICDM02, pp 157-162.
The left part of association rule (ϕ) is called
antecedent, part denoted as ψ is called succedent and χ is
condition. All parts together are referred as cedents.
This section describes features of the procedure 4ftMiner to show advantages of the bit string approach. The
first one is richness of possibilities how to define in a
simple way the set of interesting association rules to be
automatically generated and verified, see section 2.1. The
second one is possibility to deal with many types of
association rules, see section 2.2. The important features
of output of 4ft-Miner are outlined in section 2.3.
2.1. Sets of Interesting Association Rules
Analysed data for the procedure are stored in data matrix.
Rows of the data matrix correspond to observed objects
and columns correspond to attributes – properties of
observed object. An example is the data matrix Loans, see
Figure 1.
Client Age Sex Salary District
1
2
3
4
...
6180
6181
45
22
37
53
...
54
30
M very high Prague
F very low Plzen
F average
Brno
F
high
Benesov
...
...
...
M
low
Kolin
F
high
Brod
Quality
good
bad
good
good
...
bad
good
Figure 1. – Data matrix Loans
Each row of the data matrix Loans describes one loan
given to a client of bank. There are 6 181 loans. The first
row describes a loan that received a 45 years old man.
This man has a very high salary and he lives in the district
of Prague. The quality of his loan is good.
Each cedent is a conjunction of Boolean attributes
called literals. Literal is the expression of the form A(α),
here A is an attribute and α is the subset of all possible
values (i.e. categories) of the attribute A. The subset α is
called a coefficient of the literal A(α). Examples of
cedents ϕ, ψ and χ are:
• ϕ = Age<20;30) – it is true if value of the attribute
Age is in the interval <20;30),
• ψ = Quality(good) – it is true if value of the attribute
Quality is good,
• χ = District(Prague, Plzen) ∧ Salary(very high) – it is
true if both the value of the attribute District is
Prague or Plzen and the value of the attribute Salary
is “very high”.
The set of interesting association rules to be generated
and tested on the given data matrix is defined by:
• Simple definition of all antecedents.
• Analogous simple definition of all succedents.
• Analogous simple definition of all conditions (if
desired).
• Definition of a 4ft-quantifier – there are 17 types of
4ft-quantifiers.
The antecedents are conjunctions of literals automatically
generated from the given set of antecedent attributes. It is
also possible to divide this set into several subsets called
partial antecedents. A partial antecedent is also
conjunction of literals, and the antecedent as whole is
conjunction of partial antecedents. The partial antecedent
is given by:
• a list of attributes – some of these attributes are
marked as basic (partial antecedent must contain at
least one basic attribute),
• a minimal and maximal number of attributes to be
used in partial cedent,
• a simple definition of the set of all literals to be
generated from each attribute.
Any literal can positive or negative. The positive literal is
the literal A(α) itself. The negative literal is the
expression ¬A(α) – the Boolean negation of A(α).
The set of all literals to be generated for the particular
attribute is given by:
• a type of coefficient. There are available six types of
coefficients: subsets, intervals, left cuts, right cuts,
cuts, one particular value.
• A minimal and maximal number of values in the
coefficient.
• Positive/negative literal option:
a) only positive literals to be generated,
b) only negative literals to be generated,
c) both positive and negative literals to be
generated.
We use the attribute A with categories {1, 2, 3, 4, 5} to
give examples of particular types of coefficients:
• Subsets: definition of subsets with 2-3 categories
defines literals A(1,2), A(1,3), A(1,4), A(1,5), A(2,3),
…, A(3,4), ..., A(4,5), A(1,2,3), A(1,2,4), A(1,2,5),
A(2,3,4), …, A(3,4,5).
• Intervals: definition of intervals with 2-3 categories
defines literals A(1,2), A(2,3), A(3,4), A(4,5),
A(1,2,3), A(2,3,4) and A(3,4,5).
• Left cuts: definition of left cuts with maximally 3
categories defines literals A(1), A(1,2,3) and
A(1,2,3).
• Right cuts: definition of right cuts with maximally 4
categories defines literals A(5), A(5,4), A(5,4,3) and
A(5,4,3,2).
• Cuts: means both left cuts and right cuts.
2
Rauch, J. - Šimůnek, M.: Alternative Approach to Mining Association Rules. in FDM 2002, The Foundation of Data Mining and
Knowledge Discovery, The Proceedings of the Workshop of ICDM02, pp 157-162.
An example of the antecedent definition is in Figure 2.
Figure 2. – Example of the antecedent definition
There are two partial antecedents in the Figure 2. The
partial antecedent Client_Basic contains attributes Sex,
Salary and District. Each line defines types of
coefficients to be generated for corresponding attribute.
Line “Sex(*), 1 – 1” means that subset of categories of
the length from 1 to 1 are to be generate for attribute Sex.
It means literals Sex(F) and Sex(M).
Cuts are to be generated for attribute Salary. This
attribute has categories very low, low, average, high and
very high. All the possible cuts of the length from 1 to 2
are literals Salary(very low), Salary(very low, low),
Salary(very high) and Salary(high, very high).
Subsets of the length from 1 to 2 are to be generated
for the attribute District, see “District(*), 1 – 2”. It means
that all single district e.g. District(Prague) and all pairs of
districts e.g. District(Plzen, Prague) will be generated.
There are 77 particular districts thus 3 003 literals are
defined this way.
The partial cedent Client_Basic has length from 1 to
3. So at least one of attributes Sex, Salary, District will
be always used in the antecedent.
The partial cedent Client_Age is defined such that
none or one of two types of literals concerning Age will
be used. By defining “Age(int) 5 – 5” we want all the
intervals of the length 5 to be generated. In other way we
can say that there will be a sliding window of the length
5. The definition “Age(lcut) 1 – 10” means that left cuts
will generated, thus we will investigate young clients.
An example of the coefficient given by one value is in
Figure 3. In such a case we concentrate on the loans with
bad quality.
2.2. Verification of Association Rules
The association rule ϕ ≈ ψ means that Boolean attributes
ϕ and ψ are associated in the sense of the 4ft-quantifier ≈.
The rule ϕ ≈ ψ can be true or false in the analysed data
matrix M. The conditional association rule ϕ ≈ ψ / χ is
true in the analysed data matrix M if the rule ϕ ≈ ψ is true
in the data matrix M / χ. The data matrix M / χ consists of
all rows of matrix M satisfying the condition χ. There
must exist at least one such a row for ϕ ≈ ψ / χ to be true.
The association rule ϕ ≈ ψ is verified on the basis of
four-fold table 4ft(ϕ, ψ, M) of ϕ, ψ in M see Figure 4.
M
ϕ
¬ϕ
Here a is the number of objects satisfying both ϕ and ψ, b
is the number of objects satisfying ϕ and not satisfying ψ,
c is the number of objects not satisfying ϕ and satisfying
ψ, and d is the number of objects satisfying neither ϕ nor
ψ.
A true/false function based on frequencies from the
four-fold table <a,b,c,d> is defined by each 4ftquantifier ≈. The association rule ϕ ≈ ψ is true in the data
matrix M if the function defined by the 4ft-quantifier is
true in the four-fold table 4ft(ϕ, ψ, M) of ϕ, ψ in M.
Various 4ft-quantifiers are defined in [2] and [4]. Here
follow some examples:
• Founded implication ⇒p;Base
Parameters: 0 < p ≤ 1 and Base > 0
True iff a ≥ p ∧ a ≥ Base
a + b
The association rule ϕ ⇒p;Base ψ can be interpreted
as “100p per cent of objects satisfying ϕ satisfy also
ψ” or “ϕ implies ψ on the level 100p per cent“.
• Lower critical implication ⇒!p;α;Base
Parameters: 0 < p ≤ 1, Base > 0 and 0 < α ≤ 0.5
a+b
∑
i=a
Let us emphasize that each cedent and even partial cedent
are treated as objects and can be copied or moved to
another task or cedent.
¬ψ
b
d
Figure 4. – Four-fold table 4ft(ϕ, ψ, M) of ϕ, ψ in M
True iff
Figure 3. – Example of the coefficient of one value
ψ
a
c
(a+b)!
i!(a+b−i)!
* pi * (1 − p)a+b−i ≤ α ∧ a ≥ Base
Association rule ϕ ⇒!p;∝;Base ψ corresponds to a
test (on the level α) of a null hypothesis H0: P(ϕ|ψ ) ≤
p against the alternative one H1: P(ϕ|ψ) > p. If
association rule ϕ ⇒!p;∝;Base ψ is true in data
matrix M then the alternative hypothesis is accepted.
3
Rauch, J. - Šimůnek, M.: Alternative Approach to Mining Association Rules. in FDM 2002, The Foundation of Data Mining and
Knowledge Discovery, The Proceedings of the Workshop of ICDM02, pp 157-162.
• Double founded implication ⇔p;Base
Parameters 0 < p ≤ 1 and Base > 0
a
True iff
≥ p ∧ a ≥ Base
a + b + c
Association rule ϕ ⇔p;Base ψ can be interpreted as
“100p percent of objects satisfying ϕ or ψ satisfy both
ϕ and ψ” or “ϕ ∧ ψ implies ϕ ∨ ψ on the level 100p
per cent“.
All the implemented 4ft-quantifiers are described at
http://lispminer.vse.cz\overview\4ft_quantifier.html.
The four–fold table can be computed in a very fast
way, see section 3. Let us remark that pre-computed
tables of critical frequencies can be used to verification of
4ft-quantifiers based on statistical hypotheses tests [4].
This way we need only one test of inequality instead of
computation of complex formula.
When we deal with missing information we have to
compute nine-fold tables or even eighteen-fold tables. The
bit string approach again is used for very fast computation
of these tables. There are also several possibilities how to
reduce these tables back to four-fold table. For details see
e.g. [5].
Figure 5. – Example of the 4ft-Miner output
3. Bit String Approach
The basic principle of bit-string approach is in
representation of analysed data by suitable strings of bits
(see section 3.1). It makes then possible to use simple
algorithm and data structures to efficiently compute
necessary frequencies (see 3.2).
2.3. Output of 4ft-Miner
3.1. Bit-string Representation of Attributes
Output of the procedure consists of all prime association
rules. The association rule is prime if both it is true in the
analysed data matrix and it does not follow immediately
from other more simple association rules already in the
output.
The question is what does it mean that the association
rule ϕ ≈ ψ immediately follows from more simple
association rule ϕ1 ≈ ψ1. Answer depends on properties of
the used 4ft-quantifier. The definition of prime
association rule for the 4ft-quantifier of founded
implication ⇒p;Base must take into account that if the
association rule e.g.
Sex(M) ⇒p;Base District(Prague)
is true then the association rule
Sex(M)⇒p;BaseDistrict(Prague, Plzen)
is also always true. Thus the second association rule
immediately follows from the first, more simple one. All
the followers are automatically omitted from output.
There is theoretical background of logical properties
of association rules. For details see section 4 or e.g. [4].
An example of the output of 4ft-Miner is in Figure 5.
This output represents the task with the set of interesting
antecedents and succedents defined in Figure 2 and
Figure 3 respectively and with the quantifier ⇒0.7;20 of
founded implication. The whole solution contains 46
prime association rules.
Each category of each attribute (i.e. each of its possible
values) is represented by one string of bits. This string is
called card of category [3]. We can use the attribute
District as an example. The attribute District has 77
categories: Benesov, Brno, … , Prague, Plzen, … ,
Znojmo. Its representation is shown in Figure 6.
Client
District
1
2
3
4
...
6180
6181
Prague
Plzen
Brno
Benesov
...
Kolin
Brod
Cards of Categories
Brno Kolin Plzen Prague
0
0
0
1
0
0
1
0
1
0
0
0
0
0
0
0
...
...
...
...
0
1
0
0
1
0
0
0
…
…
…
…
…
…
…
…
Figure 6. – Cards of categories
The first row of this table corresponds to column Client
(row number) of the data matrix Loans, see Figure 1. The
second row of the table corresponds to column District.
Each of the further rows of Figure 6 is the card of one
category.
Each bit of the card of category corresponds to one
row of the data matrix Loans. The first bit corresponds to
the first row; the second bit corresponds to the second row
etc. There is 1 in particular bit if there is the value (i.e.
4
Rauch, J. - Šimůnek, M.: Alternative Approach to Mining Association Rules. in FDM 2002, The Foundation of Data Mining and
Knowledge Discovery, The Proceedings of the Workshop of ICDM02, pp 157-162.
category) in the row corresponding to this bit in the
column District. Otherwise there is 0 in this bit.
The first bit of the card of the category Benesov is 0
because the value in the first row of the data matrix is not
Benesov (but Prague). The third bit of the card of the
category Brno is 1 because of the value in the third row is
Brno, etc.
There are 6181 rows in the data matrix Loans,
therefore 6 181 bits or 773 bytes are necessary to
represent one category by its card. Attribute District has
77 categories. It means that 59 521 bytes (i.e. 773 × 77)
are necessary to represent this attribute.
3.2. Algorithm and Data Structures
Structure named card of antecedent represents each
antecedent. We denote it by Card_[antecedent]. It is a
string of bits of the same length as number of rows in the
analysed data matrix. Each bit of card corresponds again
to one row of the analysed data matrix. There is 1 in a
particular bit if the row corresponding to this bit satisfies
the antecedent. The card of antecedent is thus the bit-wise
representation of Boolean attribute antecedent. It is
created as conjunction of card of literals of all its literals.
Card of literal is beforehand created as disjunction of card
of categories from literal coefficient. Detail description is
out of range of this article and can be found in e.g. [3].
The number of 1’s in the card of antecedent is the
number of rows satisfying the antecedent. We use a lowlevel bit-string function Count(α) returning number of
values 1 in the string α.
The number of rows satisfying the antecedent must be
equal or greater than the value of parameter Base, see
section 2.2. For every generated antecedent we test
whether Count(Card_[antecedent]) ≥ Base to decide if
this antecedent can be at all a part of the true association
rule. This test can be understood whether the
corresponding itemset is frequent [1].
Both Card_[antecedent] and Card_[succedent]
(analogous to card of antecedent) are used to compute
frequencies of four-fold table of antecedent and
succedent, see Figure 7.
M
Succedent
¬ Succedent
Antecedent
a
b
c
d
¬ Antecedent
Figure 7. – Four-fold table from cards
The particular frequencies are computed in the following
way:
• a = Count(Card_[Antecedent] ∧ Card_[Succedent])
• b = Count(Card_[Antecedent]) – a
• c = Count(Card_[Succedent]) – a
• d=n–a–b–c
Here n is the total number of rows in the data matrix M.
Memory used by strings of bits while running a datamining task is not a significant problem. Especially when
compared to significant time improvements during
generation and verification.
Let us remark that e.g. lot of medical data concerns
thousands of patients and tens or hundreds of attributes.
The corresponding data mining tasks can be solved
without problems at common PC’s. Moreover in many
cases we get the solution in several minutes or even in
several seconds. Therefore 4ft-Miner is also suitable for
teaching purposes.
Here we provide results of an experiment at a Pentium
400 MHz computer with 98 MB RAM. We solved tasks
to find true and prime association rules in the data
matrices Loans, Loans_10 and Loans_20. The data matrix
Loans_10 has 10 times more rows than original data
matrix Loans. Analogously data matrix Loans_20 has 20
times more rows.
There are about 7 000 000 relevant association rules
that has to generated and verified according to task
definition. Only about 70 000 of association rules were
actually verified due to all the optimisations some of them
described above. The time of solution for particular data
matrices is given in Figure 8.
Data matrix
Loans Loans_10 Loans_20
Rows
6 181
61 810
123 620
Time of sol. [sec]
26
232
481
Figure 8. – Time of solution of various tasks
Let us emphasize that the time of the bit string operations
AND, NOT, OR and Count is linearly dependent on the
length of particular cards. The length of each card is equal
to the number of rows of the analysed data matrix. Thus
the time the procedure 4ft-Miner needs to solve a given
task is linearly dependent on the number of rows of the
analysed data matrix.
4. New Data Mining Procedures
Advantages of the bit-strings approach can be further used
in new data mining procedures. An example is the
procedure Pareto-Miner. Figures 9 and 10 express the
motivation for this procedure.
Both figures concern distribution of clients (see the
data matrix Loans, Figure 1) among particular regions.
The first one concerns all clients and the second one
concerns the clients with high salary only.
The distribution of clients with high salary remarkable
differs from the distribution of all clients. The difference
concerns namely the pair Prague – south Moravia. It can
be useful to find all segments of clients that differ in a
given way from the segment of all clients in the
5
Rauch, J. - Šimůnek, M.: Alternative Approach to Mining Association Rules. in FDM 2002, The Foundation of Data Mining and
Knowledge Discovery, The Proceedings of the Workshop of ICDM02, pp 157-162.
distribution of clients among particular regions. The
Pareto-Miner procedure is intended to solve such tasks.
Its input consists of:
• a data matrix with columns linked to attributes and
rows corresponding to observed objects.,
• a analysed attribute A (usually with several values),
• parameters defining a large set of conditions in the
same way as a set of conditions in the 4ft-Miner
procedure is defined,
• a criterion of interestingness of a particular condition.
Figure 9. – Distribution of all clients among regions
Figure 10. – Distribution of clients with high salary
among regions
The criterion of interestingness describes a distribution of
rows of the data matrix among the particular values of the
attribute A. Examples of the criteria are:
• a remarkable difference of the distribution when the
particular condition is satisfied and the distribution
for the whole analysed data matrix. The difference
can be measured e.g. by number of values with
different order.
• a remarkable difference of the distribution when the
particular condition is satisfied and the distribution
under an other given condition.
The evaluation of these criteria requires knowledge of
frequencies of particular values of the attribute A under
the condition in questions. These frequencies can be
computed using cards of cedents for conditions and using
cards of particular categories. Thus tools already
developed can be used.
We can use the already developed tools for generation
including particular conditions C and for computing card
Card_[C]. The particular frequencies can computed such
that fi,j = Count((Card_[ ai] ∧ Card_[ sj] ∧ Card_[C]).
Literature
[1] Aggraval, R. et all.: Fast Discovery of Association
Rules, Advances in Knowledge Discovery and Data
Mining (Fayyad, U. M. et al. eds.), AAAI Press / The
MIT Press, 1996, pp. 307-328
[2] Hájek, P. – Havránek, T.: Mechanising Hypothesis
Formation – Mathematical Foundations for a
General Theory, Springer-Verlag, 1978, pp. 396.
[3] Rauch, J.: Some Remarks on Computer Realisations
of GUHA Procedures, International Journal of ManMachine Studies 10, 1978, pp. 23-28.
[4] Rauch, J.: Classes of Four-Fold Table Quantifiers,
Principles of Data Mining and Knowledge Discovery,
(J. Zytkow, M. Quafafou, eds.), Springer-Verlag,
1998, pp. 203-211.
[5] Rauch, J.: Four-fold Table Calculi and Missing
Information, JCI’S98 Association for Intelligent
Machinery, Vol. II., (Wang Paul eds.), Durham,
Duke University, 1998.
[6] Rauch, J. – Šimůnek, M.: Mining for 4ft Association
Rules by 4ft-Miner, INAP 2001, The Proceeding of
the International Conference On Applications of
Prolog, Prolog Association of Japan, Tokyo, October
2001, pp. 285-294.
This paper has been supported by the grant
COST ACTION 274 – TARSKI (Theory and
Applications of Relational Structures as Knowledge
Instruments).
6