Download Preprocessing data sets for association rules using community

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
XIII Encontro Nacional de Inteligência Artificial e Computacional
Preprocessing data sets for association rules using community
detection and clustering: a comparative study
Renan de Padua1 and Exupério Lédo Silva Junior1 and Laı́s Pessine do Carmo1
and Veronica Oliveira de Carvalho2 and Solange Oliveira Rezende1
Instituto de Ciências Matemáticas e de Computação
USP - Universidade de São Paulo. São Carlos, Brasil
{padua,solange}@icmc.usp.br
{exuperio.silva, lais.carmo}@usp.br
1
Instituto de Geociências e Ciências Exatas
UNESP - Univ Estadual Paulista. Rio Claro, Brasil
[email protected]
2
Abstract. Association rules are widely used to analyze correlations among items
on databases. One of the main drawbacks of association rules mining is that the
algorithms usually generate a large number of rules that are not interesting or
are already known by the user. Finding new knowledge among the generated
rules makes the association rules exploration a new challenge. One possible
solution is to raise the support and the confidence values, resulting in the generation of fewer rules. The problem with this approach is that as the support and
confidence values come closer to 100%, the generated rules tend to be formed by
dominating items and to be more obvious. Some research has been carried out
on the use of clustering algorithms to prepare the database, before extracting the
association rules, so the grouped data can consider the items that appear only
in a part of the database. However, even with the good results that clustering
methods have shown, they only rely on similarity (or distance) measures, which
makes the clustering limited. In this paper, we evaluated the use of community
detection algorithms to preprocess databases for association rules. We compared the community detection algorithms with two clustering methods, aiming
to analyze the generated novelty. The results have shown that community detection algorithms perform well regarding the novelty of the generated patterns.
1. Introduction
Association rules are widely used to extract correlations among items on a given database
due its simplicity. The rules have the following pattern: LHS ⇒ RHS, where LHS is
the left-hand side (rule antecedent), RHS is the right-hand side (rule consequent). LHS
⇒ RHS occurs with the probability of c%, the confidence of the rule, which is used to
generate the association rules, along with the support [1].
The association rules extraction is generally done in two steps: (i) frequent itemset
mining; (ii) pattern extraction. Step (i) is done using support, that measures the number
of transactions an itemset occurs (an itemset is a set of items that occurs in the database).
For example, given an itemset {a, b, c}, the support of the itemset is the number of transactions that contains all the three items divided by the total number of transactions. Step
(ii) consists of combining all the frequent itemsets and calculating the confidence of each
SBC ENIAC-2016
Recife - PE
553
XIII Encontro Nacional de Inteligência Artificial e Computacional
possible rule. The confidence measures the quality of the rule, i.e., the probability of c%
that the RHS will occur providing that the LHS occurred. The confidence is calculated
by dividing the support of the rule by the LHS support. This process has two main drawbacks: (a) the items that are not dominant in the database (have low support) but have
interesting correlations are usually not found. This is the “local knowledge”, the interesting knowledge that represents a portion of the database but that is not very frequent; (b)
the number of generated rules that are not interesting normally surpass the user’s ability
to explore them.
Aiming to solve the first drawback (a), studies have been done on the database
preprocessing before submitting it to the association rules extraction. One way of doing
this is to cluster the data used to generate association rules, aiming to extract the local
knowledge inside each group that would be not extracted in the original database [8, 9].
These clustering approaches obtained interesting results, since they can (i) extract the
local knowledge from databases and (ii) organize the domain. However, the traditional
clustering methods rely only on similarity (or distance) measures, which can disregard
important characteristics of the database.
One area that is growing in the literature is community detection, which groups the
data modeled in a complex network. These works have presented great performance in the
literature and are only used to group the data, not being used to mine association rules.
Community detection algorithms are capable of finding structural communities in the
database using networks [12]. Therefore, this characteristic may improve the association
rule mining [17, 13] .
Based on the exposed, this paper analyzed some community detection algorithms
in association rule mining context, comparing the results to the traditional clustering
methods. The results have shown that community detection algorithms provide a different analysis on the data, exploring a different part of the knowledge, generating new
and possibly interesting associations. For that, the paper is organized as follows. Section 2
describes the related research. Section 3 presents the assessment methodology used. Section 4 presents the experiment configurations. Section 5 discusses the results obtained in
the experiments. Finally, conclusion and future works are given in Section 6.
2. Background
The preprocessing area aims to reduce the number of items to be analyzed by the association rule extraction algorithm in order to consider the items that compose the local
knowledge of the database. In the literature, the preprocessing researches normally use
clustering algorithms to find the local knowledge, putting the similar knowledge together.
This way the items that have low support will be considered by the association rule extraction algorithms and the dominating items will be split into all the groups and become
less dominant.
In [1] a clustering algorithm was proposed, called CLASD (CLustering for ASsociation Discovery), to cluster the transactions in a database. They proposed a measure to
calculate the similarity among transactions considering the items they have. The measure
is shown in Equation 1, where CountT() returns the number of transactions that have the
items inside the parenthesis, Tx ∩ Ty the items that both transactions have and Tx ∪ Ty
the items that has at least one transaction. The algorithm is hierarchical, using a bottom-
SBC ENIAC-2016
Recife - PE
554
XIII Encontro Nacional de Inteligência Artificial e Computacional
up strategy, which means that each transaction starts being an unitary group and merges
until a parameter k, that represents the number of desired groups, is met. The merge between 2 groups is done using complete linkage. The results have shown that the proposed
approach was capable of finding more frequent itemsets compared to a random partition
and to the original data, creating new rules to be explored that would not be extracted in
the other cases.
Sim(Tx , Ty ) =
CountT (Tx ∩ Ty )
CountT (Tx ∪ Ty )
(1)
In [14] a different approach is proposed. It still clusters the data, but, instead of
clustering the transactions, the approach clusters the items in the database. The study
calculates the similarity among items based on the transactions they support, as shown
in Equation 2, where T rans(Ix ) returns all the transactions that have the item Ix and ∩
and ∪ have the same meaning of Equation 1. Besides the measure, the authors used three
other measures to calculate the items’ similarity and applied some clustering algorithms
over the database. The results demonstrated that the proposed approach performs great
on sparse data, where the general support is extremely low, extracting rules that would
be generated if the association rule extraction algorithm was applied over a non-clustered
data set.
Sim(Ix , Iy ) =
CountT (T rans(Ix ) ∩ T rans(Iy ))
CountT (T rans(Ix ) ∪ T rans(Iy ))
(2)
Both approaches aimed to find rules that would not be generated in the original
database. However, the analysis that was performed by the authors is merely on the
number of new rules (or new itemsets) that were mined. They do not analyze the results
measuring the ratio of new rules and the number of maintained rules, among other aspects,
which verify the quality of the preprocessing algorithms (see Section 3). More works
using clustering can be seen in [8] and [9].
Besides these approaches, some works use community detection algorithms in order to preprocess the data set. Community detection algorithms search for natural groups
that occur in a network, regardless of their numbers and sizes, which is used as a primary
tool to discover and understand large-scale structures of networks [12]. The main goal of
the community detection algorithms is to split the database into groups according to the
network structure. The main difference from the clustering methods is that the community detection algorithms do not use only the similarity among the items to do the split,
but also uses the network structure to find the groups [12].
In [17] the authors used a structure called Product Network: each vertex of the
network represents an item and the edge (link) between them represents the number of
transactions they occur together. First of all, the authors created the Product Network and
applied a filter, reducing the number of connections among different products. Then, a
community detection algorithm was applied to group the products. The authors obtained a
great number of communities, around 28+ on each data set they used, each one containing
a mean of 7 items per group. The entire discussion is made on the amount of knowledge
that will be explored in each group, making the exploration and understanding of each
SBC ENIAC-2016
Recife - PE
555
XIII Encontro Nacional de Inteligência Artificial e Computacional
group easier for the user. More works using community detection algorithms can be seen
in [13] and [3].
3. Assessment Methodology
The approaches available in the literature have presented good results, as described in
Section 2, but the analysis they made do not consider some important points. Yet, the
traditional clustering methods rely only on similarity (or distance) measures, which can
disregard important characteristics of the database. The use of complex networks to find
groups can include important features on the clustering process, such as the elements
position and the data density.
Figure 1. The assessment methodology.
Based on this, this paper proposes an analysis on community detection algorithms,
which uses the network structure to find groups, to aid the association rule mining process. The proposed analysis is illustrated in Figure 1. The analysis consists of comparing
the results obtained by the traditional clustering methods with the results obtained by the
community detection algorithms. Therefore, some metrics proposed in [6] are used (described below). To compute the metrics, each process is executed: the traditional (A)
and the process that uses community detection (B). First, the original association rules
set (AR) is mined from the databases. This set is used to compute the evaluation metrics
(see below). Then, in all the databases, the similarity among all the transactions is calculated. After that, in (A), a traditional clustering algorithm is applied. Inside each obtained
group, an AR extraction algorithm is executed, generating groups of rules. All of these
rules form the clustered AR set (ARcl ). In (B), based on the computed similarities, the
data is modeled through a network – vertices are transactions and edges are the computed
similarity among them. Then, a community detection algorithm is applied. Inside each
obtained group, an AR extraction algorithm is executed, generating groups of rules. All
of these rules form the community detection AR set (ARcd ). Considering the obtained
AR sets, the evaluation metrics are computed and the comparison done.
Regarding the evaluation metrics, we used some of the metrics proposed in [6].
Some of the proposed metrics evaluate the benefits obtained from the data grouping,
which means that some measures needs to consider the original association rule set (AR)
and an association rule set generated after the grouping (ARcl or ARcd ). Besides, the
authors use the concept of h-top best rules, that consists of selecting the h% best rules,
SBC ENIAC-2016
Recife - PE
556
XIII Encontro Nacional de Inteligência Artificial e Computacional
based on an objective measure, from the AR set and the h% best rules from the grouped
set that will be analyzed (ARcl or ARcd ). These are the rules that are considered as the
most interesting to the user according to the objective measure that was chosen. In this
work, h was set to 1% of the total of the rules and the objective measure used was Lift.
For details about objective measures see [7]. The metrics that we selected to use in this
work are:
• MO-RSP: ratio of the rules in the AR set that were kept in ARcl or ARcd . The aim
is to analyze the amount of knowledge that was maintained; the higher the value
the better the result.
• MR-O-RSP: ratio of the rules in the AR set that were generated more than once
in ARcl or ARcd (the same rule can be extracted in different groups). The aim
is to analyze the amount of knowledge that was repeatedly generated in different
groups; the lower the value the better the result.
• MN-RSP: ratio of new rules in ARcl or ARcd . A rule is new if it is not in the
AR set. The aim is to analyze the number of rules that were generated in ARcl or
ARcd that were not generated in AR; the higher the value the better the result.
• MR-N-RSP: ratio of new rules that were generated more than once in ARcl or
ARcd (the same rule can be extracted in different groups). The aim is to analyze
the repetition of rules in ARcl or ARcd ; the lower the value the better the result.
• MN-I-RSP: ratio of new rules generated in ARcl or ARcd that are among the h-top
best rules of the set (ARcl or ARcd ). The aim is to analyze the number of new
rules that are considered as interesting; the higher the value the better the result.
• MO-I-N-RSP: ratio of rules among the h-top best rules of the AR set that were not
contained in the ARcl or ARcd . The aim is to analyze the number of interesting
rules in AR that was lost in ARcl or ARcd ; the lower the value the better the result.
• MC-I: ratio of rules among the h-top best rules in AR and the h-top best rules
in ARcl or ARcd . The aim is to analyze the number of rules that are considered
interesting both in AR and ARcl or ARcd ; on this measure, no consensus was met
by the specialists.
• MNC-I-RSP: ratio of clusters that contain all the h-top best rules in ARcl or ARcd .
The aim is to analyze the percentage of groups that needs to be explored in order
to find all the h-top best rules; the lower the value the better the result.
It is important to highlight that even the metrics description always cite the grouped
sets together (“ARcl or ARcd ”), the measures are applied only over one grouped set at a
time. This means that if we have 4 different algorithms, the metrics will be calculated
to each of these algorithms separately from the others. The two sets are always cited
together to strengthen the notion that the metrics were calculated to all the grouped data
sets.
4. Experimental Setup
The assessment methodology was applied on six databases: balance-scale (bs), breastcancer (bc), car, dermatology (der), tic-tac-toe (ttt) and zoo. All of them are available
at UCI repository1 . These databases were processed and converted to an attribute/value
format. The modified databases can be downloaded from http://sites.labic.
1
http://archive.ics.uci.edu/ml/datasets.html.
SBC ENIAC-2016
Recife - PE
557
XIII Encontro Nacional de Inteligência Artificial e Computacional
icmc.usp.br/padua/basesENIAC2016.zip. To generate the association rules,
the apriori algorithm [2] was used. The implementation used is available at Christian
Borgelt’s homepage2 . To obtain the AR set, the minimum support and the minimum
confidence were empirically defined aiming to extract from 1.000 to 2.500 rules3 . These
values can be seen in Table 1. The second column represents the number of transactions,
namely, the number of examples that are available. The third column presents the number
of attributes of each data set. It is important to say that, after the data set conversion to the
attribute/value format, each attribute was divided into N attributes, being N the number of
possible values the attribute had. For example, if a data set contains the attribute “color”
and 3 possible values: “blue”, “yellow” and “red”, then the processed data set will have
3 attributes: “color=blue”, “color=yellow” and “color=red”. The fourth column presents
the minimum support and the minimum confidence used. The last column presents the
number of generated rules.
Table 1. Databases details.
Data set
balance-scale (bs)
breast-cancer (bc)
car
dermatology (der)
tic-tac-toe (ttt)
zoo
#Transac
625
286
1728
366
958
101
#Attrib
25
53
25
135
29
147
Sup/Conf
1%/5%
10%/15%
1%/5%
70%/75%
5%/10%
40%/50%
#Rules
1056
2220
758
2687
4116
1637
To split the transactions into groups, it is necessary to compute a similarity measure over all transactions. The similarity measure used was Jaccard, presents in Equation 3, where Items(Tx ) returns all the items contained on transaction X and #(Y) returns
the number of items contained on Y. The dissimilarity from Tx to Ty is computed by
1 − Jaccard(Tx , Ty ).
Jaccard(Tx , Ty ) =
#(Items(Tx ) ∩ Items(Ty ))
#(Items(Tx ) ∪ Items(Ty ))
(3)
Regarding the process (A) in Figure 1, two clustering algorithms were used: Partitioning Aroung Medoids (PAM) and Ward [10], both available on R4 . We selected these
algorithms because PAM is a partitional clustering method, which divides the data set
according to a K parameter. Ward is a hierarchical method, which creates a dendogram
that can be cut on different heights. In this paper, the number of generated groups were
set according to the number of groups obtained from the community detection algorithms
(it is important to use the same number of groups to compute the evaluation metrics).
Regarding the process (B) in Figure 1, that uses community detection algorithms,
the data were modeled through a simple homogeneous network, being each node a transaction and the edge/weight between two nodes the similarity between them (similarities
2
http://www.borgelt.net/apriori.html.
Due to the sensibility of the tic-tac-toe data set, a greater number of rules was generated.
4
https://www.r-project.org/.
3
SBC ENIAC-2016
Recife - PE
558
XIII Encontro Nacional de Inteligência Artificial e Computacional
equals 0 means no connection among the transactions). Besides, four different community detection algorithms were used (all of them available on igraph5 ): Modularity based
[11], Leading Eigenvector [12], Spinglass [16] and Walktrap [15]. These community detection algorithms were selected due to their variety of characteristics. The modularity
based algorithm does not need a parameter definition. The Leading Eigenvector needs the
number of clusters; however, the igraph has a default configuration that was used. The
Walktrap algorithm has the number of steps as a parameter, that was defined to 4 according to [15]. The Spinglass parameters were defined accordingly to [16]. The algorithms
Modularity and Spinglass start by a random division of the vertices in different groups
and go through the entire network performing changes in the vertice groups and calculating the new values. The Walktrap algorithm starts with every vertex in a lone group and
merges the closest groups each iteration. A general overview of these algorithms is given
below:
• Modularity based [11]: this community detection algorithm uses the modularity
measure, which penalizes connections among vertices in different communities
and considers connections among vertices in the same community.
• Leading Eigenvector [12]: this algorithm uses the concept of eigenvector and
eigenvalue, calculated over the similarity matrix, to group the data.
• Spinglass [16]: this algorithm uses a measure that was divided in four parts. Two
of them consider the connections among the vertices in the same community and
the lack of connections among vertices in different communities. The other two
penalize the connections among vertices in different communities and the lack of
connections among vertices in the same communities.
• Walktrap [15]: this algorithm uses the Ward’s method, which merges two communities, in each step, considering the squared distance among them.
Finally, to define the minimum support and the minimum confidence to be used
inside the groups (as presented in Figure 1), the following strategy was used: the largest
group, in other words, the one that contained the higher number of transactions, was selected and the minimum support and the minimum confidence was empirically defined
aiming to extract no more than 1000 rules. The obtained values were applied to all the
other groups. This process was done for each data set. Table 2 presents the number
of groups and the minimum support and the minimum confidence that were used. This
table presents the configurations only regarding the Spinglass community detection algorithm, since it obtained the best results among the community detection algorithms. All
the results, containing all the configurations, can be seen on http://sites.labic.
icmc.usp.br/padua/ResultadosENIAC2016.xlsx. It is important to say that
in the dermatology data set the support and the confidence used in the Ward clustered set
needed to be lowered to 90%/90%, since no rule was generated using 99%/99%. Also, in
the zoo data set, the values were also lowered to 40%/50% for the Ward’s algorithm for
the same reason.
5. Results and Discussion
The values obtained in each one of the metrics are presented from Tables 3 to 6. Table 3
presents the results regarding MO-RSP and MR-O-RSP. The first column presents the
5
http://igraph.org/.
SBC ENIAC-2016
Recife - PE
559
XIII Encontro Nacional de Inteligência Artificial e Computacional
Table 2. Support and Confidence values used on the grouped data.
Data set
balance-scale (bs)
breast-cancer (bc)
car
dermatology (der)
tic-tac-toe (ttt)
zoo
# of groups
3
3
9
3
9
3
Sup/Conf Group
5%/20%
25%/50%
10%/50%
99%/99%
20%/50%
90%/90%
database name (see Table 1). From the second to the fourth column the values of MORSP are presented to the Spinglass (SP), PAM and WARD (WD) algorithm, respectively.
From the fifth to the seventh column the values of MR-O-RSP are presented to the Spinglass (SP), PAM and WARD (WD) algorithm, respectively. Only the results regarding
Spinglass are presented since it got the best results among the community detection algorithms. All the other tables (Table 4, 5 and 6) follows the same pattern. The best results
in each data set are highlighted in gray.
As mentioned, Table 3 presents the results obtained by MO-RSP and MR-O-RSP.
The first metric analyzes the amount of knowledge maintained from the AR set and the
second the amount of this knowledge that is repeated. It can be seen that Spinglass presented a better performance in 3/6 data sets on MO-RSP, while PAM a better performance
in 2 and Ward in 1 data set. The value obtained in the second metric, 0% of repetition,
was the same in all algorithms and data sets, which is a very interesting result, that shows
that no rule is repeated in the ARcd nor ARcl . By analyzing these measures together, it
can be seen that the Spinglass algorithm is capable of generating new knowledge while
maintaining part of the knowledge in the AR set. The PAM algorithm presents a similar
behavior; however, it was not as effective as Spinglass. The ward algorithm won in the
zoo data set since the support and the confidence had to be lowered to the same value used
in the AR set due to the lack of rules generated.
Table 3. Results obtained by MO-RSP and MR-O-RSP metrics.
Base
bs
bc
car
der
ttt
zoo
SP
64.39%
59.23%
89.18%
19.09%
43.33%
32.25%
MO-RSP
PAM
61.74%
55.32%
99.47%
7.03%
60.42%
31.77%
WD
9.19%
11.80%
42.08%
0.33%
2.04%
100%
MR-O-RSP
SP
PAM
WD
0.00% 0.00% 0.00%
0.00% 0.00% 0.00%
0.00% 0.00% 0.00%
0.00% 0.00% 0.00%
0.00% 0.00% 0.00%
0.00% 0.00% 0.00%
Table 4 presents the results obtained by MN-RSP and MR-N-RSP. The first metric
analyzes the ratio of new knowledge generated and the second the ratio of new knowledge
that repeats in different groups. In MN-RSP there is a tie between Spinglass and PAM,
where each one of them had won on 3 data sets. However, in the tic-tac-toe data set,
the amount of new knowledge generated by the Spinglass algorithm is almost 40% more
than the amount generated by PAM; besides, the biggest difference in the case PAM won
SBC ENIAC-2016
Recife - PE
560
XIII Encontro Nacional de Inteligência Artificial e Computacional
is about 10%. In the second metric a tie also occurred: both PAM and SP had the best
results in 4 data sets. However, after taking a closer look at the results it can be seen
that Spinglass is more stable than PAM, as presented on the balance-scale data set. PAM
obtained 100% in one case, which is the worst possible value for this metric – the worst
result obtained by SP was 11.97%.
Table 4. Results obtained by MN-RSP and MR-N-RSP metrics.
Base
bs
bc
car
der
ttt
zoo
SP
10.29%
4.99%
76.83%
76.07%
87.89%
84.85%
MN-RSP
PAM
0.00%
14.84%
76.82%
85.88%
48.88%
86.62%
WD
0.00%
0.00%
0.93%
0.00%
0.00%
47.18%
MR-N-RSP
SP
PAM
WD
0.00% 100% 100%
0.00% 0.00% 100%
10.99% 8.42% 0.00%
0.00% 0.00% 100%
11.97% 0.08% 100%
0.00% 0.00% 4.32%
Table 5 presents the results obtained from MN-I-RSP and MO-I-N-RSP. The first
metric analyzes the ratio of new rules in ARcl or ARcd that are among the h-top interesting
rules of their sets and the second the ratio of rules that were among the h-top rules in the
AR set that are not found in ARcl or ARcd . Regarding the first metric it can be seen that
the Spinglass won in the most of the data sets, meaning that it generated more interesting
new rules compared to the other algorithms. Even in the cases it lost (car and zoo), the
distance of the results is no more than 1%, and in the cases it won, the difference could
reach almost 7%. In the second metric all the algorithms performed badly, getting only 3
values different from 100% (2 on PAM and 1 on WARD). This means that, in almost all
the cases, the rules among the h-top best rules in the AR set were not found in ARcl or
ARcd . Analyzing these two metrics together it is possible to see that Spinglass brought
more novelty in the generated knowledge compared to the others, not maintaining what
was considered interesting before (100% on all data sets). On the other hand, the others
generated less new interesting knowledge, but maintained some of the rules that were
considered interesting in the AR set in some of the cases.
Table 5. Results obtained by MN-I-RSP and MO-I-N-RSP metrics.
Base
bs
bc
car
der
ttt
zoo
SP
87.5%
93.75%
97.44%
95.65%
99.44%
97.22%
MN-I-RSP
PAM
85.71%
62.50%
98.00%
93.33%
92.98%
97.44%
WD
66.67%
85.71%
35.71%
0.00%
66.67%
98.25%
MO-I-N-RSP
SP
PAM
WD
100% 100%
100%
100% 77.27% 100%
100% 100% 57.14%
100% 100%
100%
100% 92.68% 100%
100% 100%
100%
Table 6 presents the results obtained from MC-I and MNC-I-RSP. The first metric
analyzes the ratio of rules that are contained in both h-top rules sets, that is, the rules that
are in the AR set h-top best and also are in the ARcl or ARcd h-top best. The second
metric analyzes the percentage of clusters needed to be explored in order to find all the
SBC ENIAC-2016
Recife - PE
561
XIII Encontro Nacional de Inteligência Artificial e Computacional
h-top rules in ARcl or ARcd . On the first metric no cell was highlighted because the
specialists did not reach a consensus in [6] regarding the value interpretation. The 0%
value means that all the rules selected as the h-top best rules in ARcl or ARcd were not
selected as the h-top best rules on AR. This means that all the interesting knowledge
in ARcl or ARcd is new, directing the user to explore an entire new set of interesting
rules compared to the AR set. In the second metric the Spinglass algorithm won again,
concentrating all the interesting rules in few of the groups compared to the others. The
Spinglass won in 4 data sets, where PAM won in 2 (1 tie with SP) – zoo data set had a tie
among the all the 3 algorithms.
Table 6. Results obtained by MC-I and MNC-I-RSP metrics.
Base
bs
bc
car
der
ttt
zoo
SP
0.00%
0.00%
0.00%
0.00%
0.00%
0.00%
MC-I
PAM
0.00%
22.73%
0.00%
0.00%
7.32%
0.00%
WD
0.00%
0.00%
28.57%
0.00%
0.00%
0.00%
MNC-I-RSP
SP
PAM
WD
33.33% 66.67% 100%
100% 66.67% 100%
11.11% 33.33% 88.89%
33.33% 33.33% 100%
11.11% 77.78% 100%
33.33% 33.33% 33.33%
In the metrics MN-RSP, MN-I-RSP and MO-I-N-RSP, that analyze the new knowledge, the Spinglass algorithm performed better than the PAM and WARD. These measures together indicate that the Spinglass algorithm can be used in the cases that the user
wants to obtain new knowledge, without the need of maintaining what was considered
interesting in the non-clustered data set. The metrics MO-RSP and MC-I, that analyze
the maintained knowledge, the PAM algorithm got higher results. This means that the
PAM algorithm was capable of finding new knowledge; however, part of the interesting
knowledge from the original data set was maintained together with the new obtained interesting results. So, the result indicated that the Spinglass algorithm can generate more new
knowledge, while the PAM and Ward algorithm is better in maintaining the knowledge
that was previously considered interesting.
6. Conclusion
This paper presented a comparative study among two traditional clustering algorithms
and four community detection algorithms on the context of association rule preprocessing
using six data sets. The study used 8 metrics, all of them proposed in [6], to analyze the
results. In this paper, we presented and compared the results obtained by the Spinglass
algorithm with two clustering algorithms because it was the one that obtained the best
results among the four community detection algorithms that were selected. The complete
results, containing all six algorithms, can be seen on http://sites.labic.icmc.
usp.br/padua/ResultadosENIAC2016.xlsx.
The results demonstrated that PAM and Spinglass had a very similar performance,
both of them obtaining good results. However, the Spinglass algorithm performed better
in the metrics that analyzed the amount of new knowledge generated. This indicates that
the Spinglass algorithm can generate more novelty compared to the clustering methods.
On the metrics that analyzed the amount of knowledge maintained, PAM performed better.
SBC ENIAC-2016
Recife - PE
562
XIII Encontro Nacional de Inteligência Artificial e Computacional
This indicates that PAM is better to maintain the knowledge that would be generated in
the AR set. Ward did not show good results. Therefore, the results indicate that Spinglass
has potential to discover new rules that bring novelty to the user. Also, this algorithm was
able to generate all the interesting rules (the rules among the h-top best rules) on a small
number of clusters, indicating that the partitioning is more concise.
This initial exploration shows interesting results. However, there are many improvements to be done. As seen, there is a need to propose a community detection algorithm focused on the context of association rules preprocessing. This algorithm must
consider the intrinsic characteristics of the context, as the high density, for example, and
must be capable of splitting the domain in a way that the interesting knowledge appear
together. Besides, a deeper study needs to be done, considering more data sets and more
algorithms to better analyze the results obtained by each type of grouping.
Acknowledgment
We wish to thank CAPES and FAPESP: Grant 2014/08996-0, São Paulo Research Foundation (FAPESP) for the financial aid.
References
[1] Aggarwal, C., Procopiuc, C., and Yu, P. (2002). Finding localized associations in market
basket data. IEEE Transactions on Knowledge and Data Engineering, 14(1):51–62.
[2] Agrawal, R. and Imielinski, T. and Swami, A. (1994). Mining Association Rules Between
Sets of Items in Large Databases. Special Interest Group on Management of Data,
207–216.
[3] Alonso, A. G. and Carrasco-Ochoa, J. A. and Medina-Pagola, J. E. and Trinidad, J. F.
M. (2011). Reducing the Number of Canonical Form Tests for Frequent Subgraph
Mining. Computación y Sistemas, 251-265.
[4] Berrado, A. and Runger, G. C. (2007). Using metarules to organize and group discovered
association rules. Data Mining and Knowledge Discovery, 14(3):409–431.
[5] Carvalho, V. O.; dos Santos, F. F.; Rezende, S. O. and de Padua, R. (2011), PAR-COM:
A New Methodology for Post-processing Association Rules., Proceedings of ICEIS ,
Springer, pp. 66-80.
[6] Carvalho, V. O. and dos Santos, F. F. and Rezende, S. O. (2015). Metrics for Association Rule Clustering Assessment. Transactions on Large-Scale Data- and KnowledgeCentered Systems XVII, 97-127.
[7] Carvalho, V. O. and Padua, R. and Rezende, S. O. (2016). Solving the Problem of Selecting Suitable Objective Measures by Clustering Association Rules Through the
Measures Themselves. SOFSEM 2016: 505-517.
[8] Koh, Y.S. and Pears, R. (2008). Rare Association Rule Mining via Transaction Clustering.
In Proceedings of Seventh Australasian Data Mining Conference, 87-94.
[9] Maquee, A. and Shojaie, A. A. and Mosaddar, D. (2012). Clustering and association
rules in analyzing the efficiency of maintenance system of an urban bus network.
International Journal of System Assurance Engineering and Management, 175-183.
SBC ENIAC-2016
Recife - PE
563
XIII Encontro Nacional de Inteligência Artificial e Computacional
[10] Murtagh, F. and Legendre, P. (2014). Ward’s hierarchical agglomerative clustering
method: which algorithms implement Ward’s criterion? Journal of Classification,
31: 274-295.
[11] Newman, M. E. J. (2010), Networks: An Introduction , Oxford University Press.
[12] Newman, M. E. J. (2006), Finding community structure in networks using the eigenvectors of matrices, Physical review E 74 (3).
[13] Özkural, E. and Uçar, B. and Aykanat, C. (2011), Parallel Frequent Item Set Mining with
Selective Item Replication, IEEE Transactions on Parallel and Distributed Systems,
1632-1640.
[14] Plasse, M., Niang, N., Saporta, G., Villeminot, A., and Leblond, L. (2007). Combined
use of association rules mining and clustering methods to find relevant links between
binary rare attributes in a large data set. Computational Statistics & Data Analysis,
52(1):596–613.
[15] Pons, P. and Latapy, M. (2005), Computing communities in large networks using random
walks (long version), Computer and Information Sciences-ISCIS 2005, 284–293.
[16] Reichardt, J. and Bornholdt, S. (2006), Statistical Mechanics of Community Detection,
Physical Review E 74, 016110.
[17] Videla-Cavieres, I. F. and Rı́os, Sebastián A. (2014), Extending Market Basket Analysis with Graph Mining Techniques: A Real Case. Expert Systems with Application,
1928–1936.
SBC ENIAC-2016
Recife - PE
564