Download Preprocessing data sets for association rules using community

XIII Encontro Nacional de Inteligência Artificial e Computacional Preprocessing data sets for association rules using community detection and clustering: a comparative study Renan de Padua1 and Exupério Lédo Silva Junior1 and Laı́s Pessine do Carmo1 and Veronica Oliveira de Carvalho2 and Solange Oliveira Rezende1 Instituto de Ciências Matemáticas e de Computação USP - Universidade de São Paulo. São Carlos, Brasil {padua,solange}@icmc.usp.br {exuperio.silva, lais.carmo}@usp.br 1 Instituto de Geociências e Ciências Exatas UNESP - Univ Estadual Paulista. Rio Claro, Brasil [email protected] 2 Abstract. Association rules are widely used to analyze correlations among items on databases. One of the main drawbacks of association rules mining is that the algorithms usually generate a large number of rules that are not interesting or are already known by the user. Finding new knowledge among the generated rules makes the association rules exploration a new challenge. One possible solution is to raise the support and the confidence values, resulting in the generation of fewer rules. The problem with this approach is that as the support and confidence values come closer to 100%, the generated rules tend to be formed by dominating items and to be more obvious. Some research has been carried out on the use of clustering algorithms to prepare the database, before extracting the association rules, so the grouped data can consider the items that appear only in a part of the database. However, even with the good results that clustering methods have shown, they only rely on similarity (or distance) measures, which makes the clustering limited. In this paper, we evaluated the use of community detection algorithms to preprocess databases for association rules. We compared the community detection algorithms with two clustering methods, aiming to analyze the generated novelty. The results have shown that community detection algorithms perform well regarding the novelty of the generated patterns. 1. Introduction Association rules are widely used to extract correlations among items on a given database due its simplicity. The rules have the following pattern: LHS ⇒ RHS, where LHS is the left-hand side (rule antecedent), RHS is the right-hand side (rule consequent). LHS ⇒ RHS occurs with the probability of c%, the confidence of the rule, which is used to generate the association rules, along with the support [1]. The association rules extraction is generally done in two steps: (i) frequent itemset mining; (ii) pattern extraction. Step (i) is done using support, that measures the number of transactions an itemset occurs (an itemset is a set of items that occurs in the database). For example, given an itemset {a, b, c}, the support of the itemset is the number of transactions that contains all the three items divided by the total number of transactions. Step (ii) consists of combining all the frequent itemsets and calculating the confidence of each SBC ENIAC-2016 Recife - PE 553 XIII Encontro Nacional de Inteligência Artificial e Computacional possible rule. The confidence measures the quality of the rule, i.e., the probability of c% that the RHS will occur providing that the LHS occurred. The confidence is calculated by dividing the support of the rule by the LHS support. This process has two main drawbacks: (a) the items that are not dominant in the database (have low support) but have interesting correlations are usually not found. This is the “local knowledge”, the interesting knowledge that represents a portion of the database but that is not very frequent; (b) the number of generated rules that are not interesting normally surpass the user’s ability to explore them. Aiming to solve the first drawback (a), studies have been done on the database preprocessing before submitting it to the association rules extraction. One way of doing this is to cluster the data used to generate association rules, aiming to extract the local knowledge inside each group that would be not extracted in the original database [8, 9]. These clustering approaches obtained interesting results, since they can (i) extract the local knowledge from databases and (ii) organize the domain. However, the traditional clustering methods rely only on similarity (or distance) measures, which can disregard important characteristics of the database. One area that is growing in the literature is community detection, which groups the data modeled in a complex network. These works have presented great performance in the literature and are only used to group the data, not being used to mine association rules. Community detection algorithms are capable of finding structural communities in the database using networks [12]. Therefore, this characteristic may improve the association rule mining [17, 13] . Based on the exposed, this paper analyzed some community detection algorithms in association rule mining context, comparing the results to the traditional clustering methods. The results have shown that community detection algorithms provide a different analysis on the data, exploring a different part of the knowledge, generating new and possibly interesting associations. For that, the paper is organized as follows. Section 2 describes the related research. Section 3 presents the assessment methodology used. Section 4 presents the experiment configurations. Section 5 discusses the results obtained in the experiments. Finally, conclusion and future works are given in Section 6. 2. Background The preprocessing area aims to reduce the number of items to be analyzed by the association rule extraction algorithm in order to consider the items that compose the local knowledge of the database. In the literature, the preprocessing researches normally use clustering algorithms to find the local knowledge, putting the similar knowledge together. This way the items that have low support will be considered by the association rule extraction algorithms and the dominating items will be split into all the groups and become less dominant. In [1] a clustering algorithm was proposed, called CLASD (CLustering for ASsociation Discovery), to cluster the transactions in a database. They proposed a measure to calculate the similarity among transactions considering the items they have. The measure is shown in Equation 1, where CountT() returns the number of transactions that have the items inside the parenthesis, Tx ∩ Ty the items that both transactions have and Tx ∪ Ty the items that has at least one transaction. The algorithm is hierarchical, using a bottom- SBC ENIAC-2016 Recife - PE 554 XIII Encontro Nacional de Inteligência Artificial e Computacional up strategy, which means that each transaction starts being an unitary group and merges until a parameter k, that represents the number of desired groups, is met. The merge between 2 groups is done using complete linkage. The results have shown that the proposed approach was capable of finding more frequent itemsets compared to a random partition and to the original data, creating new rules to be explored that would not be extracted in the other cases. Sim(Tx , Ty ) = CountT (Tx ∩ Ty ) CountT (Tx ∪ Ty ) (1) In [14] a different approach is proposed. It still clusters the data, but, instead of clustering the transactions, the approach clusters the items in the database. The study calculates the similarity among items based on the transactions they support, as shown in Equation 2, where T rans(Ix ) returns all the transactions that have the item Ix and ∩ and ∪ have the same meaning of Equation 1. Besides the measure, the authors used three other measures to calculate the items’ similarity and applied some clustering algorithms over the database. The results demonstrated that the proposed approach performs great on sparse data, where the general support is extremely low, extracting rules that would be generated if the association rule extraction algorithm was applied over a non-clustered data set. Sim(Ix , Iy ) = CountT (T rans(Ix ) ∩ T rans(Iy )) CountT (T rans(Ix ) ∪ T rans(Iy )) (2) Both approaches aimed to find rules that would not be generated in the original database. However, the analysis that was performed by the authors is merely on the number of new rules (or new itemsets) that were mined. They do not analyze the results measuring the ratio of new rules and the number of maintained rules, among other aspects, which verify the quality of the preprocessing algorithms (see Section 3). More works using clustering can be seen in [8] and [9]. Besides these approaches, some works use community detection algorithms in order to preprocess the data set. Community detection algorithms search for natural groups that occur in a network, regardless of their numbers and sizes, which is used as a primary tool to discover and understand large-scale structures of networks [12]. The main goal of the community detection algorithms is to split the database into groups according to the network structure. The main difference from the clustering methods is that the community detection algorithms do not use only the similarity among the items to do the split, but also uses the network structure to find the groups [12]. In [17] the authors used a structure called Product Network: each vertex of the network represents an item and the edge (link) between them represents the number of transactions they occur together. First of all, the authors created the Product Network and applied a filter, reducing the number of connections among different products. Then, a community detection algorithm was applied to group the products. The authors obtained a great number of communities, around 28+ on each data set they used, each one containing a mean of 7 items per group. The entire discussion is made on the amount of knowledge that will be explored in each group, making the exploration and understanding of each SBC ENIAC-2016 Recife - PE 555 XIII Encontro Nacional de Inteligência Artificial e Computacional group easier for the user. More works using community detection algorithms can be seen in [13] and [3]. 3. Assessment Methodology The approaches available in the literature have presented good results, as described in Section 2, but the analysis they made do not consider some important points. Yet, the traditional clustering methods rely only on similarity (or distance) measures, which can disregard important characteristics of the database. The use of complex networks to find groups can include important features on the clustering process, such as the elements position and the data density. Figure 1. The assessment methodology. Based on this, this paper proposes an analysis on community detection algorithms, which uses the network structure to find groups, to aid the association rule mining process. The proposed analysis is illustrated in Figure 1. The analysis consists of comparing the results obtained by the traditional clustering methods with the results obtained by the community detection algorithms. Therefore, some metrics proposed in [6] are used (described below). To compute the metrics, each process is executed: the traditional (A) and the process that uses community detection (B). First, the original association rules set (AR) is mined from the databases. This set is used to compute the evaluation metrics (see below). Then, in all the databases, the similarity among all the transactions is calculated. After that, in (A), a traditional clustering algorithm is applied. Inside each obtained group, an AR extraction algorithm is executed, generating groups of rules. All of these rules form the clustered AR set (ARcl ). In (B), based on the computed similarities, the data is modeled through a network – vertices are transactions and edges are the computed similarity among them. Then, a community detection algorithm is applied. Inside each obtained group, an AR extraction algorithm is executed, generating groups of rules. All of these rules form the community detection AR set (ARcd ). Considering the obtained AR sets, the evaluation metrics are computed and the comparison done. Regarding the evaluation metrics, we used some of the metrics proposed in [6]. Some of the proposed metrics evaluate the benefits obtained from the data grouping, which means that some measures needs to consider the original association rule set (AR) and an association rule set generated after the grouping (ARcl or ARcd ). Besides, the authors use the concept of h-top best rules, that consists of selecting the h% best rules, SBC ENIAC-2016 Recife - PE 556 XIII Encontro Nacional de Inteligência Artificial e Computacional based on an objective measure, from the AR set and the h% best rules from the grouped set that will be analyzed (ARcl or ARcd ). These are the rules that are considered as the most interesting to the user according to the objective measure that was chosen. In this work, h was set to 1% of the total of the rules and the objective measure used was Lift. For details about objective measures see [7]. The metrics that we selected to use in this work are: • MO-RSP: ratio of the rules in the AR set that were kept in ARcl or ARcd . The aim is to analyze the amount of knowledge that was maintained; the higher the value the better the result. • MR-O-RSP: ratio of the rules in the AR set that were generated more than once in ARcl or ARcd (the same rule can be extracted in different groups). The aim is to analyze the amount of knowledge that was repeatedly generated in different groups; the lower the value the better the result. • MN-RSP: ratio of new rules in ARcl or ARcd . A rule is new if it is not in the AR set. The aim is to analyze the number of rules that were generated in ARcl or ARcd that were not generated in AR; the higher the value the better the result. • MR-N-RSP: ratio of new rules that were generated more than once in ARcl or ARcd (the same rule can be extracted in different groups). The aim is to analyze the repetition of rules in ARcl or ARcd ; the lower the value the better the result. • MN-I-RSP: ratio of new rules generated in ARcl or ARcd that are among the h-top best rules of the set (ARcl or ARcd ). The aim is to analyze the number of new rules that are considered as interesting; the higher the value the better the result. • MO-I-N-RSP: ratio of rules among the h-top best rules of the AR set that were not contained in the ARcl or ARcd . The aim is to analyze the number of interesting rules in AR that was lost in ARcl or ARcd ; the lower the value the better the result. • MC-I: ratio of rules among the h-top best rules in AR and the h-top best rules in ARcl or ARcd . The aim is to analyze the number of rules that are considered interesting both in AR and ARcl or ARcd ; on this measure, no consensus was met by the specialists. • MNC-I-RSP: ratio of clusters that contain all the h-top best rules in ARcl or ARcd . The aim is to analyze the percentage of groups that needs to be explored in order to find all the h-top best rules; the lower the value the better the result. It is important to highlight that even the metrics description always cite the grouped sets together (“ARcl or ARcd ”), the measures are applied only over one grouped set at a time. This means that if we have 4 different algorithms, the metrics will be calculated to each of these algorithms separately from the others. The two sets are always cited together to strengthen the notion that the metrics were calculated to all the grouped data sets. 4. Experimental Setup The assessment methodology was applied on six databases: balance-scale (bs), breastcancer (bc), car, dermatology (der), tic-tac-toe (ttt) and zoo. All of them are available at UCI repository1 . These databases were processed and converted to an attribute/value format. The modified databases can be downloaded from http://sites.labic. 1 http://archive.ics.uci.edu/ml/datasets.html. SBC ENIAC-2016 Recife - PE 557 XIII Encontro Nacional de Inteligência Artificial e Computacional icmc.usp.br/padua/basesENIAC2016.zip. To generate the association rules, the apriori algorithm [2] was used. The implementation used is available at Christian Borgelt’s homepage2 . To obtain the AR set, the minimum support and the minimum confidence were empirically defined aiming to extract from 1.000 to 2.500 rules3 . These values can be seen in Table 1. The second column represents the number of transactions, namely, the number of examples that are available. The third column presents the number of attributes of each data set. It is important to say that, after the data set conversion to the attribute/value format, each attribute was divided into N attributes, being N the number of possible values the attribute had. For example, if a data set contains the attribute “color” and 3 possible values: “blue”, “yellow” and “red”, then the processed data set will have 3 attributes: “color=blue”, “color=yellow” and “color=red”. The fourth column presents the minimum support and the minimum confidence used. The last column presents the number of generated rules. Table 1. Databases details. Data set balance-scale (bs) breast-cancer (bc) car dermatology (der) tic-tac-toe (ttt) zoo #Transac 625 286 1728 366 958 101 #Attrib 25 53 25 135 29 147 Sup/Conf 1%/5% 10%/15% 1%/5% 70%/75% 5%/10% 40%/50% #Rules 1056 2220 758 2687 4116 1637 To split the transactions into groups, it is necessary to compute a similarity measure over all transactions. The similarity measure used was Jaccard, presents in Equation 3, where Items(Tx ) returns all the items contained on transaction X and #(Y) returns the number of items contained on Y. The dissimilarity from Tx to Ty is computed by 1 − Jaccard(Tx , Ty ). Jaccard(Tx , Ty ) = #(Items(Tx ) ∩ Items(Ty )) #(Items(Tx ) ∪ Items(Ty )) (3) Regarding the process (A) in Figure 1, two clustering algorithms were used: Partitioning Aroung Medoids (PAM) and Ward [10], both available on R4 . We selected these algorithms because PAM is a partitional clustering method, which divides the data set according to a K parameter. Ward is a hierarchical method, which creates a dendogram that can be cut on different heights. In this paper, the number of generated groups were set according to the number of groups obtained from the community detection algorithms (it is important to use the same number of groups to compute the evaluation metrics). Regarding the process (B) in Figure 1, that uses community detection algorithms, the data were modeled through a simple homogeneous network, being each node a transaction and the edge/weight between two nodes the similarity between them (similarities 2 http://www.borgelt.net/apriori.html. Due to the sensibility of the tic-tac-toe data set, a greater number of rules was generated. 4 https://www.r-project.org/. 3 SBC ENIAC-2016 Recife - PE 558 XIII Encontro Nacional de Inteligência Artificial e Computacional equals 0 means no connection among the transactions). Besides, four different community detection algorithms were used (all of them available on igraph5 ): Modularity based [11], Leading Eigenvector [12], Spinglass [16] and Walktrap [15]. These community detection algorithms were selected due to their variety of characteristics. The modularity based algorithm does not need a parameter definition. The Leading Eigenvector needs the number of clusters; however, the igraph has a default configuration that was used. The Walktrap algorithm has the number of steps as a parameter, that was defined to 4 according to [15]. The Spinglass parameters were defined accordingly to [16]. The algorithms Modularity and Spinglass start by a random division of the vertices in different groups and go through the entire network performing changes in the vertice groups and calculating the new values. The Walktrap algorithm starts with every vertex in a lone group and merges the closest groups each iteration. A general overview of these algorithms is given below: • Modularity based [11]: this community detection algorithm uses the modularity measure, which penalizes connections among vertices in different communities and considers connections among vertices in the same community. • Leading Eigenvector [12]: this algorithm uses the concept of eigenvector and eigenvalue, calculated over the similarity matrix, to group the data. • Spinglass [16]: this algorithm uses a measure that was divided in four parts. Two of them consider the connections among the vertices in the same community and the lack of connections among vertices in different communities. The other two penalize the connections among vertices in different communities and the lack of connections among vertices in the same communities. • Walktrap [15]: this algorithm uses the Ward’s method, which merges two communities, in each step, considering the squared distance among them. Finally, to define the minimum support and the minimum confidence to be used inside the groups (as presented in Figure 1), the following strategy was used: the largest group, in other words, the one that contained the higher number of transactions, was selected and the minimum support and the minimum confidence was empirically defined aiming to extract no more than 1000 rules. The obtained values were applied to all the other groups. This process was done for each data set. Table 2 presents the number of groups and the minimum support and the minimum confidence that were used. This table presents the configurations only regarding the Spinglass community detection algorithm, since it obtained the best results among the community detection algorithms. All the results, containing all the configurations, can be seen on http://sites.labic. icmc.usp.br/padua/ResultadosENIAC2016.xlsx. It is important to say that in the dermatology data set the support and the confidence used in the Ward clustered set needed to be lowered to 90%/90%, since no rule was generated using 99%/99%. Also, in the zoo data set, the values were also lowered to 40%/50% for the Ward’s algorithm for the same reason. 5. Results and Discussion The values obtained in each one of the metrics are presented from Tables 3 to 6. Table 3 presents the results regarding MO-RSP and MR-O-RSP. The first column presents the 5 http://igraph.org/. SBC ENIAC-2016 Recife - PE 559 XIII Encontro Nacional de Inteligência Artificial e Computacional Table 2. Support and Confidence values used on the grouped data. Data set balance-scale (bs) breast-cancer (bc) car dermatology (der) tic-tac-toe (ttt) zoo # of groups 3 3 9 3 9 3 Sup/Conf Group 5%/20% 25%/50% 10%/50% 99%/99% 20%/50% 90%/90% database name (see Table 1). From the second to the fourth column the values of MORSP are presented to the Spinglass (SP), PAM and WARD (WD) algorithm, respectively. From the fifth to the seventh column the values of MR-O-RSP are presented to the Spinglass (SP), PAM and WARD (WD) algorithm, respectively. Only the results regarding Spinglass are presented since it got the best results among the community detection algorithms. All the other tables (Table 4, 5 and 6) follows the same pattern. The best results in each data set are highlighted in gray. As mentioned, Table 3 presents the results obtained by MO-RSP and MR-O-RSP. The first metric analyzes the amount of knowledge maintained from the AR set and the second the amount of this knowledge that is repeated. It can be seen that Spinglass presented a better performance in 3/6 data sets on MO-RSP, while PAM a better performance in 2 and Ward in 1 data set. The value obtained in the second metric, 0% of repetition, was the same in all algorithms and data sets, which is a very interesting result, that shows that no rule is repeated in the ARcd nor ARcl . By analyzing these measures together, it can be seen that the Spinglass algorithm is capable of generating new knowledge while maintaining part of the knowledge in the AR set. The PAM algorithm presents a similar behavior; however, it was not as effective as Spinglass. The ward algorithm won in the zoo data set since the support and the confidence had to be lowered to the same value used in the AR set due to the lack of rules generated. Table 3. Results obtained by MO-RSP and MR-O-RSP metrics. Base bs bc car der ttt zoo SP 64.39% 59.23% 89.18% 19.09% 43.33% 32.25% MO-RSP PAM 61.74% 55.32% 99.47% 7.03% 60.42% 31.77% WD 9.19% 11.80% 42.08% 0.33% 2.04% 100% MR-O-RSP SP PAM WD 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% Table 4 presents the results obtained by MN-RSP and MR-N-RSP. The first metric analyzes the ratio of new knowledge generated and the second the ratio of new knowledge that repeats in different groups. In MN-RSP there is a tie between Spinglass and PAM, where each one of them had won on 3 data sets. However, in the tic-tac-toe data set, the amount of new knowledge generated by the Spinglass algorithm is almost 40% more than the amount generated by PAM; besides, the biggest difference in the case PAM won SBC ENIAC-2016 Recife - PE 560 XIII Encontro Nacional de Inteligência Artificial e Computacional is about 10%. In the second metric a tie also occurred: both PAM and SP had the best results in 4 data sets. However, after taking a closer look at the results it can be seen that Spinglass is more stable than PAM, as presented on the balance-scale data set. PAM obtained 100% in one case, which is the worst possible value for this metric – the worst result obtained by SP was 11.97%. Table 4. Results obtained by MN-RSP and MR-N-RSP metrics. Base bs bc car der ttt zoo SP 10.29% 4.99% 76.83% 76.07% 87.89% 84.85% MN-RSP PAM 0.00% 14.84% 76.82% 85.88% 48.88% 86.62% WD 0.00% 0.00% 0.93% 0.00% 0.00% 47.18% MR-N-RSP SP PAM WD 0.00% 100% 100% 0.00% 0.00% 100% 10.99% 8.42% 0.00% 0.00% 0.00% 100% 11.97% 0.08% 100% 0.00% 0.00% 4.32% Table 5 presents the results obtained from MN-I-RSP and MO-I-N-RSP. The first metric analyzes the ratio of new rules in ARcl or ARcd that are among the h-top interesting rules of their sets and the second the ratio of rules that were among the h-top rules in the AR set that are not found in ARcl or ARcd . Regarding the first metric it can be seen that the Spinglass won in the most of the data sets, meaning that it generated more interesting new rules compared to the other algorithms. Even in the cases it lost (car and zoo), the distance of the results is no more than 1%, and in the cases it won, the difference could reach almost 7%. In the second metric all the algorithms performed badly, getting only 3 values different from 100% (2 on PAM and 1 on WARD). This means that, in almost all the cases, the rules among the h-top best rules in the AR set were not found in ARcl or ARcd . Analyzing these two metrics together it is possible to see that Spinglass brought more novelty in the generated knowledge compared to the others, not maintaining what was considered interesting before (100% on all data sets). On the other hand, the others generated less new interesting knowledge, but maintained some of the rules that were considered interesting in the AR set in some of the cases. Table 5. Results obtained by MN-I-RSP and MO-I-N-RSP metrics. Base bs bc car der ttt zoo SP 87.5% 93.75% 97.44% 95.65% 99.44% 97.22% MN-I-RSP PAM 85.71% 62.50% 98.00% 93.33% 92.98% 97.44% WD 66.67% 85.71% 35.71% 0.00% 66.67% 98.25% MO-I-N-RSP SP PAM WD 100% 100% 100% 100% 77.27% 100% 100% 100% 57.14% 100% 100% 100% 100% 92.68% 100% 100% 100% 100% Table 6 presents the results obtained from MC-I and MNC-I-RSP. The first metric analyzes the ratio of rules that are contained in both h-top rules sets, that is, the rules that are in the AR set h-top best and also are in the ARcl or ARcd h-top best. The second metric analyzes the percentage of clusters needed to be explored in order to find all the SBC ENIAC-2016 Recife - PE 561 XIII Encontro Nacional de Inteligência Artificial e Computacional h-top rules in ARcl or ARcd . On the first metric no cell was highlighted because the specialists did not reach a consensus in [6] regarding the value interpretation. The 0% value means that all the rules selected as the h-top best rules in ARcl or ARcd were not selected as the h-top best rules on AR. This means that all the interesting knowledge in ARcl or ARcd is new, directing the user to explore an entire new set of interesting rules compared to the AR set. In the second metric the Spinglass algorithm won again, concentrating all the interesting rules in few of the groups compared to the others. The Spinglass won in 4 data sets, where PAM won in 2 (1 tie with SP) – zoo data set had a tie among the all the 3 algorithms. Table 6. Results obtained by MC-I and MNC-I-RSP metrics. Base bs bc car der ttt zoo SP 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% MC-I PAM 0.00% 22.73% 0.00% 0.00% 7.32% 0.00% WD 0.00% 0.00% 28.57% 0.00% 0.00% 0.00% MNC-I-RSP SP PAM WD 33.33% 66.67% 100% 100% 66.67% 100% 11.11% 33.33% 88.89% 33.33% 33.33% 100% 11.11% 77.78% 100% 33.33% 33.33% 33.33% In the metrics MN-RSP, MN-I-RSP and MO-I-N-RSP, that analyze the new knowledge, the Spinglass algorithm performed better than the PAM and WARD. These measures together indicate that the Spinglass algorithm can be used in the cases that the user wants to obtain new knowledge, without the need of maintaining what was considered interesting in the non-clustered data set. The metrics MO-RSP and MC-I, that analyze the maintained knowledge, the PAM algorithm got higher results. This means that the PAM algorithm was capable of finding new knowledge; however, part of the interesting knowledge from the original data set was maintained together with the new obtained interesting results. So, the result indicated that the Spinglass algorithm can generate more new knowledge, while the PAM and Ward algorithm is better in maintaining the knowledge that was previously considered interesting. 6. Conclusion This paper presented a comparative study among two traditional clustering algorithms and four community detection algorithms on the context of association rule preprocessing using six data sets. The study used 8 metrics, all of them proposed in [6], to analyze the results. In this paper, we presented and compared the results obtained by the Spinglass algorithm with two clustering algorithms because it was the one that obtained the best results among the four community detection algorithms that were selected. The complete results, containing all six algorithms, can be seen on http://sites.labic.icmc. usp.br/padua/ResultadosENIAC2016.xlsx. The results demonstrated that PAM and Spinglass had a very similar performance, both of them obtaining good results. However, the Spinglass algorithm performed better in the metrics that analyzed the amount of new knowledge generated. This indicates that the Spinglass algorithm can generate more novelty compared to the clustering methods. On the metrics that analyzed the amount of knowledge maintained, PAM performed better. SBC ENIAC-2016 Recife - PE 562 XIII Encontro Nacional de Inteligência Artificial e Computacional This indicates that PAM is better to maintain the knowledge that would be generated in the AR set. Ward did not show good results. Therefore, the results indicate that Spinglass has potential to discover new rules that bring novelty to the user. Also, this algorithm was able to generate all the interesting rules (the rules among the h-top best rules) on a small number of clusters, indicating that the partitioning is more concise. This initial exploration shows interesting results. However, there are many improvements to be done. As seen, there is a need to propose a community detection algorithm focused on the context of association rules preprocessing. This algorithm must consider the intrinsic characteristics of the context, as the high density, for example, and must be capable of splitting the domain in a way that the interesting knowledge appear together. Besides, a deeper study needs to be done, considering more data sets and more algorithms to better analyze the results obtained by each type of grouping. Acknowledgment We wish to thank CAPES and FAPESP: Grant 2014/08996-0, São Paulo Research Foundation (FAPESP) for the financial aid. References [1] Aggarwal, C., Procopiuc, C., and Yu, P. (2002). Finding localized associations in market basket data. IEEE Transactions on Knowledge and Data Engineering, 14(1):51–62. [2] Agrawal, R. and Imielinski, T. and Swami, A. (1994). Mining Association Rules Between Sets of Items in Large Databases. Special Interest Group on Management of Data, 207–216. [3] Alonso, A. G. and Carrasco-Ochoa, J. A. and Medina-Pagola, J. E. and Trinidad, J. F. M. (2011). Reducing the Number of Canonical Form Tests for Frequent Subgraph Mining. Computación y Sistemas, 251-265. [4] Berrado, A. and Runger, G. C. (2007). Using metarules to organize and group discovered association rules. Data Mining and Knowledge Discovery, 14(3):409–431. [5] Carvalho, V. O.; dos Santos, F. F.; Rezende, S. O. and de Padua, R. (2011), PAR-COM: A New Methodology for Post-processing Association Rules., Proceedings of ICEIS , Springer, pp. 66-80. [6] Carvalho, V. O. and dos Santos, F. F. and Rezende, S. O. (2015). Metrics for Association Rule Clustering Assessment. Transactions on Large-Scale Data- and KnowledgeCentered Systems XVII, 97-127. [7] Carvalho, V. O. and Padua, R. and Rezende, S. O. (2016). Solving the Problem of Selecting Suitable Objective Measures by Clustering Association Rules Through the Measures Themselves. SOFSEM 2016: 505-517. [8] Koh, Y.S. and Pears, R. (2008). Rare Association Rule Mining via Transaction Clustering. In Proceedings of Seventh Australasian Data Mining Conference, 87-94. [9] Maquee, A. and Shojaie, A. A. and Mosaddar, D. (2012). Clustering and association rules in analyzing the efficiency of maintenance system of an urban bus network. International Journal of System Assurance Engineering and Management, 175-183. SBC ENIAC-2016 Recife - PE 563 XIII Encontro Nacional de Inteligência Artificial e Computacional [10] Murtagh, F. and Legendre, P. (2014). Ward’s hierarchical agglomerative clustering method: which algorithms implement Ward’s criterion? Journal of Classification, 31: 274-295. [11] Newman, M. E. J. (2010), Networks: An Introduction , Oxford University Press. [12] Newman, M. E. J. (2006), Finding community structure in networks using the eigenvectors of matrices, Physical review E 74 (3). [13] Özkural, E. and Uçar, B. and Aykanat, C. (2011), Parallel Frequent Item Set Mining with Selective Item Replication, IEEE Transactions on Parallel and Distributed Systems, 1632-1640. [14] Plasse, M., Niang, N., Saporta, G., Villeminot, A., and Leblond, L. (2007). Combined use of association rules mining and clustering methods to find relevant links between binary rare attributes in a large data set. Computational Statistics & Data Analysis, 52(1):596–613. [15] Pons, P. and Latapy, M. (2005), Computing communities in large networks using random walks (long version), Computer and Information Sciences-ISCIS 2005, 284–293. [16] Reichardt, J. and Bornholdt, S. (2006), Statistical Mechanics of Community Detection, Physical Review E 74, 016110. [17] Videla-Cavieres, I. F. and Rı́os, Sebastián A. (2014), Extending Market Basket Analysis with Graph Mining Techniques: A Real Case. Expert Systems with Application, 1928–1936. SBC ENIAC-2016 Recife - PE 564

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Preprocessing data sets for association rules using community