Download Diverse Subgroup Set Discovery using a Novel Genetic Algorithm

1 Diverse Subgroup Set Discovery using a Novel Genetic Algorithm Shanjida Khatun, Swakkhar Shatabda Abstract—When the search space is too large or it is required to select a small set of patterns from a large dataset, exhaustive search techniques do not perform well. Large data is challenging for most existing discovery algorithms because many variants of essentially the same pattern exist, due to (numeric) attributes of high cardinality, correlated attributes, and so on. While ignoring many potentially interesting results, this causes top-k mining algorithms to return highly redundant result sets. These problems are particularly apparent with pattern set discovery and its generalization, exceptional model mining. To address this, we deal with the discriminative or diverse pattern set mining problem. In this paper, we investigate an approach which aims to find diverse set of patterns using genetic algorithm to mine diverse frequent patterns. We propose a fast genetic algorithm with several novel components that outperforms state-of-the-art methods on a standard set of benchmarks and capable to produce satisfactory results within a very short period of time using large and small datasets. Our proposed genetic algorithm uses a relative encoding scheme for the patterns, an effective twin removal technique to ensure diversity throughout the search and a random restart technique to avoid getting stuck in local minima or maxima. Index Terms—Pattern set mining; large neighborhood search; genetic algorithm. I. I NTRODUCTION In the area of patten set mining, the process of frequent pattern extraction finds interesting information about the association among the items in a transactional database. The notion of support is employed to extract the frequent patterns. Normally, a frequent pattern may contain items which belong to different categories of a particular domain. The existing approaches do not consider the notion of diversity while extracting the frequent patterns. For certain types of applications, it may be useful to distinguish between the frequent patterns with items belonging to different categories and the frequent patterns with items belonging to the same category. The major issue with the frequent pattern mining is the generation of a huge number of patterns which might be found insignificant depending upon the application or user-requirement. In this connection, the researchers have made efforts to mine constraint based and user-interest based frequent patterns by using the measures such as closed, maximal, periodic and top-k. Many algorithms have been proposed in last few years to find such sets of patterns [1] and most of these algorithms perform some kind Shanjida Khatun is with the Department of Computer Science and Engineering, Ahsanullah University of Science and Technology, Dhaka, Bangladesh e-mail: [email protected]. Swakkhar Shatabda is with Department of Computer Science and Engineering, United International University, Dhaka, Bangladesh email:[email protected]. of greedy or local search and differ widely in the heuristics and search orders used. Constraint programming methods on a declarative framework [4], [6] have earned significant success but these algorithms perform very poorly for large datasets and require huge time where local search methods have been very effective to find satisfactory results efficiently. In this paper, we have made an effort to propose a new interestingness measure by exploiting the fact that items in a frequent pattern may belong to different categories of a particular domain. We have made an effort to propose an interestingness measure which is XOR based dispersion score by analyzing the extent to which the items in the patterns belong to different categories. We investigate the possibilities for discovering diverse pattern sets to find small set of patterns within a short period of time using genetic algorithm by using a large datasets with minor modifications in the search technique. We are given a set of transactions and a set of patterns in the dispersion score set up to select a small set of diverse patterns for genetic algorithm. Our genetic algorithm has several novel components such as a relative encoding technique learned from the structures in the dataset, a twin removal technique to remove identical and redundant individuals in the population and a random restart technique to avoid stagnation. We compared the performance with several other algorithms such as random walk, hill climbing and large neighborhood search. The key contributions in the paper are as follows: • • Perform a comparative analysis between various types of local search algorithms and analysis of their relative strength compared with each other. Demonstrate the overall strength of genetic algorithm with novel components for finding small set of diversefrequent patterns. The paper is furnished as follows: Section II explains our works and all the necessary definitions to understand the paper; Section III reviews the related task; Section IV explains the algorithms used; Section V discusses and analyzes the experimental results; and then, Section VI presents our conclusions with a discussion and possible outlines for future work. II. P RELIMINARIES A. Pattern Constraints In this section, we explain some concepts to understand the diverse pattern set mining problems. These notations are adopted from Guns et al. [6] and Shanjida et al. [8]. JU Journal of Information Technology (JIT), Vol. 4, July 2015 TABLE I: A small dataset containing five items and six transactions. Transaction Id t1 t2 t3 t4 t5 t6 ItemSet {A,B,D} {B,C} {A,D} {A,C,D} {B,C,D} {C,D,E} A 1 0 1 1 0 0 B 1 1 0 0 1 0 C 0 1 0 1 1 1 D 1 0 1 1 1 1 E 0 0 0 0 0 1 We assume that we are given a set of items I and a database, D of transactions T , in which all elements are either 0 or 1. The process of finding the set of patterns which satisfy all of the constraints is called pattern set mining. A pair of variables (I, T ), where I represents an itemset I ⊆ I and T represents a transaction set T ⊆ T , represented by means of boolean variables Ii and Tt for every item i ∈ I and every transaction t∈T. The itemsets or pattern sets and the transaction sets are generally represented by binary vectors. The coverage ϕD (I) of an itemset I consists of all transactions in which the itemset occurs: ϕD (I) = {t ∈ T |∀i ∈ I : Dti = 1} For example, consider the small dataset presented in Table I. Given an itemset, I = {B, C}, it is represented as h0, 1, 1, 0, 0i and the coverage is ϕD(I) = {t2 , t5 } which is represented by h0, 1, 0, 0, 1, 0i. Support of the itemset is SupportD (I) = 2. Where, Support of an itemset is the size of its coverage set, SupportD (I) = |ϕD (I)|. Dispersion score is the score of the frequent pattern sets based on the items categories within it. For example, suppose there are three itemsets named I1 = {B, C}, I2 = {C, D} and I3 = {E} in the pattern sets for pattern set size, k = 3. So, their coverage will be ϕD (I1 ) = h0, 1, 0, 0, 1, 0i, ϕD (I2 ) = h0, 0, 0, 1, 1, 1i and ϕD (I3 ) = h0, 0, 0, 0, 0, 1i respectively. After XOR operation to each other, the sum of each item of the coverage will be ϕD (I1 ) XOR ϕD (I2 ) = h0, 1, 0, 1, 0, 1i = 3, ϕD (I1 ) XOR ϕD (I3 ) = h0, 1, 0, 0, 1, 1i = 3 and ϕD (I2 ) XOR ϕD (I3 ) = h0, 0, 0, 1, 1, 0i = 2. Now, the result of the dispersion score will be 3+3+2 = 8. B. Pattern Set Constraints In pattern set mining, we are interested to find k−pattern sets [5]. A k−pattern set Π is a set of k tuples, each of type hI p , T p i. The pattern set is formally defined as follows: Π = {π1 , · · · , πk }, where, ∀p = 1, · · · , k : πp = hI p , T p i Diverse pattern sets: In pattern set mining, highly similar transaction sets can be founded which can be undesirable. To avoid this, many measures can be used to find the similarity between two set of patterns such as dispersion score [11]: X dispersion(T i , T j ) = (2Tti − 1)(2Ttj − 1). t∈T 2 The term (2Tti − 1) transforms a binary {0, 1} variable into one of range {−1, 1}. This way of finding, dispersion score has some disadvantages. When two patterns cover exactly the same transactions and one pattern covers exactly the opposite transactions of the other, the score will be maximized in both [6]. For example, if two patterns cover h0, 1, 1, 0, 0, 1i and h1, 0, 0, 1, 1, 0i or h0, 1, 1, 0, 0, 1i and h0, 1, 1, 0, 0, 1i transactions respectively, in both case, the score will be 6. This is not exactly desirable because for second case, score should be 0. To address this issue, we define a new XOR based dispersion score to calculate the diversity between two pattern sets as shown below: X xorDispersion(T i , T j ) = Tti ⊕ Ttj . t∈T To measure the diversity of a pattern set we use the following expression which is the objective function that we wish to maximize. objDispersion = k X i−1 X xorDispersion(T i , T j ). i=1 j=1 To find diverse-frequent patterns, in last few years, most of the algorithms too struggles to produce good quality solutions on the large datasets within a short period of time. In this paper, to solve this problem, we propose a fast genetic algorithm with various novel components which work on large datasets. III. R ELATED W ORK In pattern set mining, to find patterns which are correlated [10], discriminative [12], contrast [5] and diverse [11] became promising tasks. Many algorithms has been proposed as a general framework for pattern set mining [6], [4] in last few years for discovering diverse pattern sets. Many languages have been developed such as Zinc [9], Essence [3], Gecode [13] and Comet [6], [7]. To search and prune the solution space in an efficient way, most of these methods used exhaustive search methods and the algorithms, those are not only exhaustive in nature but also take huge amount of time. Most of these algorithms performed poor for large datasets. In [4], the k-pattern set mining tackled pattern mining directly at a global level rather than at a local one. Using the constraint programming (CP) framework, researcher evaluated the feasibility of exhaustive search for pattern set mining. They proposed one-step search method that is often more effective than a two-step method in which an exhaustive search is performed in both steps. CP used exhaustive search and the feasibility of the CP approach depended on the constraints involved. But one issue was that whether other solvers can be used to solve the general k-pattern set mining problem given only its description in terms of constraints. Guns et al. [6] investigated a technique by simplifying two pattern set mining tasks and search strategies by putting these into a common declarative framework where Large Neighborhood Search performed remarkably well. They limited their focus to an exhaustive search without using the basic propagation principle of CP. They used limited number of JUJIT JU Journal of Information Technology (JIT), Vol. 4, July 2015 pattern set mining tasks. The algorithm that they used only worked for small dataset. In a recent work, Shanjida et al. [8] explored the use of genetic algorithms and other stochastic local search algorithms to solve the diverse pattern set mining using large and small datasets. IV. O UR A PPROACH In this section, first we describe our proposed genetic algorithm with novel components to solve the diverse pattern set mining problem. Then we describe the other algorithms that we implemented [8] in order to compare with GA. A. Genetic algorithm Algorithm 1 geneticAlgorithm(int percentChange) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: p = populationSize P = generate p valid pattern sets Pb = {} while timeout do Pm = simpleMutation(P) Pc = uniformCrossOver(P) P∗ = select best (P ∪ Pm ∪ Pc ) if stagnation then Q = findBest(P Q ∗) Pb = Pb ∪ { } P∗ = changePopulation( percentChange, P∗ ) end if P = P∗ end Q∗ while = findBest(P b) Q∗ return Genetic algorithms are inspired by natural selection process. The search improves from generation to generation of a population of individuals by means of mutation and crossover. We have used XOR operation to generate our objective score as described in the preliminaries section. In Algorithm 1, we created a population in Pm and Pc . Pm created a population using mutation and Pc created a population using cross over (shown in Algorithm 3). After that we took best population into P∗ from P, Pm and Pc . Here, size of P∗ will be same as population size. We iterated the procedure over and over again through several generations. If P∗ remains same for at least 100 generations, we changed the value of P∗ using simpleMutation(P atternSetsP) (shown in Algorithm 2). In this way we won’t stuck in local minima. Here, We saved diverse pattern set with maximum value in Pb every time. Then we copied P∗ ’s value into P. In the next generation, we got a new population. We continued this procedure until timeout. After that we returned the best score from Pb . We describe our GA (shown in Algorithm 1) in the following parts: 1) Objective function: To find the objective score for a pattern set, we calculated coverage of each itemset. This will return some boolean array. After that we found all the combination for those boolean array. Then we calculated XOR based dispersion score for each combination. 3 2) Population initialization: We randomly generated p valid pattern sets and kept it in P. We found that the itemsets have a particular structure to generate a valid pattern set. We used a constrained initialization for the representation to avoid invalid situations such as there are several exclusive attributes which are not true at a time. 3) Crossover technique: Using crossover (shown in Algorithm 3), we have took two pattern sets from population to create an offspring. We have done this for p times where p is the number of population. Then we got p offspring. We have used uniform crossover to find offspring. We have randomly choose each item from these two pattern sets and place them into new pattern sets but we have made sure that no duplicate remains in the new population. 4) Mutation technique: In Algorithm 2, we created Pm new pattern sets using mutation. We randomly generated pattern sets by flipping a single bit. In the datasets, we found that the attributes are grouped exclusively. Which is only one of the bits are 1 in each group. We always kept this structure constraint satisfied while doing the mutation by making sure no two bits are simultaneously on within a same group and at least one bit is on. 5) Twin removal: In our Algorithm, we never allowed it to have twin in any population. Before entering any pattern sets if we found any twin, we rejected it and created new one. We have done this until found a distinct valid pattern set. 6) Handling stagnation: To avoid getting stuck in local minima, we have used random restart in our genetic algorithm. When list of population are not change for a certain period, we restarted the algorithm based on two variable. One, when it will be restarted and second, how much change will be done in the list. changePopulation(percentChange, P) (shown in Algorithm 4) is used to create a new population where P represents the pattern set in which we have to change and percentChange represents how much patters that we have to change. For example, if percentChange = 90 that means 90% value will be deleted to create new value. In our algorithm, we experimented with different values of percentChange. We have found that when percentChange = 90, we have always got good results. As it saves only top 10% score and other 90% will be used to create new population. 7) Population Size: We have checked the effect of population in result using tic-tac-toe dataset. We have found that population size plays a pivotal role for generating result. We have described about this in analysis section. B. Large Neighborhood Search (LNS) In large neighborhood search (LNS) (shown in Algorithm 5), first we created a valid pattern set and calculated its score. Then we created its neighbors and found the best neighbor. If the value of best neighbor is greater than the initial pattern set then we changed the initial pattern set and JUJIT JU Journal of Information Technology (JIT), Vol. 4, July 2015 4 Algorithm 5 largeNeigbourhoodSearch() Algorithm 2 simpleMutation( PatternSets P) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: index = 0 Pm = {} size = noOfPatternset(P) while Q index < size do Q = P[index] Q by flipping single m = generate a valid neighbor of bit Q while m ∈ Pm do Q Q = generate a valid neighbor of by flipping m single bit end while Q Pm = Pm ∪ { m } index + + end while return Pm Algorithm 3 crossOver( PatternSets P) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: index = 0 Pc = {} size = noOfPatternset(P) while Q index <size do Qm = randomly take a pattern set from P Qf = randomly take a pattern Q set Q from P uniformCrossOver( m , f ) o =Q while o ∈ Pc do Q Q Q o = uniformCrossOver( m, f ) end while Q Pc = Pc ∪ { o } index + + end while return Pc Algorithm 4 changePopulation(perChange, PatternSets P) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: noOf change = (perChangeQ ∗ sizeOf(P))/100 remove lowest noOf change from P i=1 while Q i ≤ noOf change do = randomly create a valid patten set with k-items Q while ∈ P do Q = randomly create a valid patten set with k-items end while Q P = P ∪{ } i++ end while return P 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: noOf BitT oChange = 1 Q = randomly create a valid patten set with k-items while timeout do Q noOf BitT oChange P neighbours for Q∗= create 2 = find best individual Q∗ from P Q if getObjectiveScore( ) > getObjectiveScore( ) then Q Q∗ = end if Q if remains same for 100 iteration then noOf BitT oChange + + end if end while Q return replaced it with best neighbor. In our implementation, the number of neighbors, created for a pattern set, will be 2n , where n = noOf BitT oChange. When we generated the neighbors, first we created 21 neighbor with n = 1. If it didn’t give good results for 100 iteration, we incremented the value of n by 1. We performed this again and again whenever LNS stuck for 100 iteration. To crate neighbors of a pattern set, we randomly choose an itemset from that pattern set. After that we randomly choose an item from that itemset. We did this for n times as each item is represented by boolean values. So if we created all possible neighbors for three items, number of neighbors for changing three items will be 23 . For n, it will be 2n . Algorithm 6 hillClimbing() Q∗ 1: = randomly create a valid patten Q∗ set with k-items 2: bestScore = getObjectiveScore( ) 3: while timeout do Q Q∗ 4: = generate a valid neighbor from Q 5: currentScore = getObjectiveScore( ) 6: if currentscore > bestScore then Q∗ Q 7: = 8: bestScore = currentScore 9: end if 10: end while Q∗ 11: return C. Hill Climbing with Single Neighbor For hill climbing (shown in Algorithm 6), we created a valid Q∗ Q pattern set and copied the value in another pattern set . We started aQ loop which run for 1 minute. Then we created a Q ∗ neighbor of in . If this new neighbor is Q greater than the Q∗ ∗ , we copied theQ value of new neighbor in and created ∗ a new neighbor of . The cycle goes on until the time is up. D. Random Walk In random Q walk (shown in Algorithm 7), we created a valid Q∗ pattern set . Then we created another pattern set called . JUJIT 11: Q Q∗ We copied the value of into . Then we Q started a loop which run for 1 minute. Here, we changed the by creating Q∗ a new valid pattern set and then checked the value with . Q Q Q ∗ If the score Q of is greater, we copied into . Then by creating another pattern set randomly. This we changed procedure is worked for 1 minute. After that we took the score Q∗ of . V. E XPERIMENTAL R ESULTS We have implemented all algorithms in JAVA language and have run our experiments on an Intel core i3 2.27 GHz machine with 4 GB ram running 64bit Windows 7 Home Premium. TABLE II: Description of datasets. Data Set Tic-tac-toe Primary-tumor Soybean Hypothyroid Mushroom Items 27 31 50 88 119 Transactions 958 336 630 3247 8124 A. Dataset In this paper, the datasets that we use are taken from UCI Machine Learning repository [2] and originally used in [6]. The datasets are available to download freely from the website: https://dtai.cs.kuleuven.be/CP4IM/datasets/. The datasets are given in Table II with their properties. B. Results We have implemented four algorithms and tested them by various population size. We have calculated the objective score for each algorithm for k pattern sets where k = 2, 3, 6, 9, 10 and population size 100. For each algorithm, we have used five datasets whose transaction number and item size are given in Table II. We have collected the score by run all algorithm for 1 minute. For each test case, we have ran the code five times and took the best score and the average score which are given in Table III. The best average values in each row are shown in bold faced fonts and the best values are shown in both bold and italic faced fonts. For other population sizes, the performance of all algorithms remains same. 8000 7500 7000 6500 10 20 30 40 50 75 100 300 500 600 1000 1200 1500 2000 5: 6: 7: 8: 9: 10: bestScore = −∞ Q∗ =φ while Q timeout do = randomly create a valid patten set Q with k-items currentScore = getObjectiveScore( ) if currentscore > bestScore then Q∗ Q = bestScore = currentScore end if end while Q∗ return 8500 Population Size (a) Average 8000 7800 7600 7400 7200 10 20 30 40 50 75 100 300 500 600 1000 1200 1500 2000 1: 2: 3: 4: Objective Score Algorithm 7 randomWalk() 5 Objective Score JU Journal of Information Technology (JIT), Vol. 4, July 2015 Population Size (b) Best Fig. 1: Search progress of genetic algorithm for the tic-tac-toe dataset with pattern size k = 6. C. Analysis From Table III, we have found that almost always genetic algorithm performs better than other algorithms. In few cases, LNS performs well than genetic algorithm. The performance of Random walk and hill climbing is not well. We found that when number of pattern set is increasing, GA tends to work better compare to other algorithms. When k = 2 or k = 3, LNS, Random Walk gives best value but when k = 9 or k = 10, GA gives the highest values. We found that when the number of itemset becomes too greater, genetic algorithm performs poor. So, Too less or too many population size will give a bad result because calculation becomes too expensive. Genetic algorithm works better by changing 90% population using random restart. Fig. 1 shows the performance of GA with respect to population size for the dataset tic-tac-toe. We tested our GA with different population size from 10 − 2000. For each population size, we have ran the code five times and took the best and the average objective score. In fig. 1(a) when population size is in 40 − 500, GA gives the best result. After that when population size is greater than 500, the objective score is decreasing. In fig. 1(b) when population size is in 10 − 1000, GA gives the best result but after that when population size is exceed 1000, the objective score is decreasing. So, genetic algorithm works better with large population size but when the size of population becomes too big, it performs not well in allocated time because the calculations becomes too expensive. Fig. 2 shows the performance of the search algorithms based on their average objective score which is shown as vertical bars. We ran all the algorithms for 1 minute using all the datasets with different pattern set sizes, k = 2, 3, 6, 9, 10, for population size 100. We found that genetic algorithm always gives good result with compare to other algorithms. Sometimes LNS gives good result like as genetic algorithm. For the JUJIT JU Journal of Information Technology (JIT), Vol. 4, July 2015 6 TABLE III: Objective score achieved by different algorithms for various datasets with different size of pattern sets k. 2 3 6 9 10 2 3 6 9 10 2 3 6 9 10 2 3 6 9 10 2 3 6 9 10 Mushroom Hypothyroid Soybean Primary-tumor 20000 15000 Random Walk Hill Climbing LNS Genetic Algorithm Objective Score Objective Score 25000 10000 5000 0 2 3 6 9 Pattern Set Size, k 120000 100000 80000 60000 40000 20000 0 Objective Score 20000 Random Walk Hill Climbing LNS Genetic Algorithm 15000 10000 Genetic Algorithm Avg. Best 798 798 1916 1916 7938 7938 18458.4 18624 22731.4 22816 8124 8124 16248 16248 58734 64992 103932 142452 107529.6 130944 2736.4 3247 5876 6494 12549.4 16325 24234.8 27556 17629.8 21726 630 630 1260 1260 5642.8 5664 12547.2 12598 15531.2 15696 336 336 672 672 3013.6 3024 6715.2 6720 8351.4 8376 Random Walk Hill Climbing LNS Genetic Algorithm 5000 0 2 10 (a) Tic-Tac-Toe 30000 25000 20000 15000 10000 5000 0 Search Algorithm Hill Climbing LNS Avg. Best Avg. Best 516.8 753 762 798 1432.2 1593 1825.6 1916 7004.4 7653 7758 7791 15977.6 16972 18097.6 17858 19963 21496 22235.2 22748 0 0 1362.4 6812 3249.6 16248 2070.4 10352 0 0 0 0 20960 63392 0 0 28868.4 73116 0 0 324.4 1622 649.4 3247 0 0 0 0 0 0 0 0 0 0 5193.6 25968 11689.2 29223 0 0 0 0 374.5 624 260.4 1136 1168.8 1248 3304.2 5076 4023.8 4992 3770 5634 11113.6 12568 9406.2 12000 7653.8 12090 238 336 334.6 336 540.4 672 672 672 2944 3017 3001.4 3018 6616.4 6710 6682 6712 7576.2 8336 8343.4 8393 walk Best 798 1690 5380 18224 12764 4936 14576 37440 43216 46584 562 1484 3405 5864 9333 624 1248 3438 5778 7597 329 658 2453 4372 4897 3 6 9 Pattern Set Size, k 10 2 (b) Mushroom 3 6 9 Pattern Set Size, k 10 (c) Soybean 10000 Random Walk Hill Climbing LNS Genetic Algorithm Ogjective Score Tic-tac-toe Random Avg. 771 1491.4 5355 17517.6 11393.8 3388 6889.6 27260 33955.2 34117.2 439.6 937.2 2277 3732.8 5916.6 624 1242.4 3155 5246.8 6409 326.4 647.6 2115.8 3833.2 4539 Objective Score Pattern set size k Data set 8000 6000 Random Walk Hill Climbing LNS Genetic Algorithm 4000 2000 0 2 3 6 9 Pattern Size, k 10 (d) Hypothyroid 2 3 6 9 Pattern Set Size, k 10 (e) Primary-tymor Fig. 2: Bar diagram showing comparison of average objective score achieved by different algorithms for k = 2, 3, 6, 9, 10. datasets mushroom and hypothyroid, in few cases, the objective score of LNS and hill climbing is zero because the size of the items of the datasets (shown in Table II) is too big. In Fig. 3, we shows the performance of different search algorithms for the tic-tac-toe dataset for pattern set size 6 and population size 100. In this figure, the average objective scores of the search algorithms are shown as vertical line for the different times. From the figure, we find that random walk performs poorly as usual but hill climbing improves very quickly using single neighbor where LNS performs very well which is near to genetic algorithm. Genetic algorithm always gives best result. JUJIT JU Journal of Information Technology (JIT), Vol. 4, July 2015 Objective Score 10000 Random Walk LNS Hill Climbing Genetic Algorithm 8000 6000 4000 5 10 15 20 25 30 Time (in seconds) 35 40 Fig. 3: Comparison of average objective score achieved by different algorithms for the tic-tac-toe dataset with pattern size k = 6. 7 [7] P. V. Hentenryck and L. Michel, Constraint-based local search. The MIT Press, 2009. [8] S. Khatun, H. U. Alam, and S. Shatabda, “An efficient genetic algorithm for discovering diverse-frequent patterns,” 2015, pp. 120–126. [9] K. Marriott, N. Nethercote, R. Rafeh, P. J. Stuckey, M. G. De La Banda, and M. Wallace, “The design of the zinc modelling language,” Constraints, vol. 13, no. 3, pp. 229–267, 2008. [10] F. Rossi, P. Van Beek, and T. Walsh, Handbook of constraint programming. Elsevier, 2006. [11] U. Rückert and S. Kramer, “Optimizing feature sets for structured data,” in Machine Learning: ECML 2007. Springer, 2007, pp. 716–723. [12] P. Shaw, “Using constraint programming and local search methods to solve vehicle routing problems,” in Principles and Practice of Constraint ProgrammingCP98. Springer, 1998, pp. 417–431. [13] G. Team, “Gecode: Generic constraint development environment,” 2006. VI. C ONCLUSION In this paper, we presented a new genetic algorithm which is a combined way of three different enhancement techniques: i) a relative encoding technique; ii) a twin removal technique; iii) a random-walk based stagnation recovery technique. We compared our results with the state-of-the-art local search algorithms. We found that our final algorithm GA that use a combination of all the three enhancements significantly outperforms all current approaches of local search algorithms. Here, genetic algorithm almost always gives good results within a very short period of time with compare to other algorithms. In this paper, we proposed an interestingness measure which is XOR based dispersion score by analyzing the extent to which the items in the patterns belong to different categories. We compared the different search strategies for dispersion score on the pattern set mining tasks. It remains to be seen to how many other tasks, for example concept learning, this observation extends. Similarly, it remains to be seen how many pattern set mining tasks can be modelled in terms of constraints, for example learning decision lists. Finally, we restricted this study to pattern set mining. We believe there is huge opportunity for general declarative tools for data mining and machine learning at large. In the future, we would like to improve the performance of the search techniques of GA for large population size within the framework of GA by using a genetic operator and applying it in a similar way of crossover and mutation. Shanjida Khatun received her B.Sc. and M.Sc. degrees, all in Computer Science and Engineering, from Ahsanullah University of Science and Technology in June 2012, and United International University in September 2015, respectively. Since October 2012, she has been with Department of Computer Science and Engineering, Ahsanullah University of Science and Technology as a Lecturer. Her research interests include artificial intelligence, data mining, meta-heuristic search, bioinformatics and computational biology. Swakkhar Shatabda received his BSc degree in Computer Science and Engineering in 2007 from Bangladesh University of Engineering and Technology(BUET) and the PhD degree in Bioinformatics and Computational Biology in 2014 from Griffith University, Australia. He has also worked as a graduate researcher in National ICT Australia (NICTA) from 2010 until 2014. He is currently working as an Assistant Professor and Undergraduate Program Coordinator in the Department of Computer Science and Engineering of United International University, Bangladesh. His research interests include Bioinformatics, protein fold and structural class prediction problems, protein structure and function prediction problems, data mining, statistical learning theory, pattern recognition, graph theory, algorithms and machine learning. R EFERENCES [1] B. Bringmann, S. Nijssen, N. Tatti, J. Vreeken, and A. Zimmerman, “Mining sets of patterns,” Tutorial at ECMLPKDD, 2010. [2] A. Frank, A. Asuncion et al., “Uci machine learning repository,” 2010. [3] A. M. Frisch, W. Harvey, C. Jefferson, B. Martı́nez-Hernández, and I. Miguel, “Essence: A constraint language for specifying combinatorial problems,” Constraints, vol. 13, no. 3, pp. 268–306, 2008. [4] T. Guns, S. Nijssen, and L. De Raedt, “Itemset mining: A constraint programming perspective,” Artificial Intelligence, vol. 175, no. 12, pp. 1951–1983, 2011. [5] T. Guns, S. Nijssen, and L. D. Raedt, “k-pattern set mining under constraints,” Knowledge and Data Engineering, IEEE Transactions on, vol. 25, no. 2, pp. 402–418, 2013. [6] T. Guns, S. Nijssen, A. Zimmermann, and L. De Raedt, “Declarative heuristic search for pattern set mining,” in Data Mining Workshops (ICDMW), 2011 IEEE 11th International Conference on. IEEE, 2011, pp. 1104–1111. JUJIT

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Diverse Subgroup Set Discovery using a Novel Genetic Algorithm