Download Diverse Subgroup Set Discovery using a Novel Genetic Algorithm

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Heritability of IQ wikipedia , lookup

Viral phylodynamics wikipedia , lookup

Genetic engineering wikipedia , lookup

Koinophilia wikipedia , lookup

Genome (book) wikipedia , lookup

Human genetic variation wikipedia , lookup

Genetic testing wikipedia , lookup

Microevolution wikipedia , lookup

Genetic drift wikipedia , lookup

Gene expression programming wikipedia , lookup

Population genetics wikipedia , lookup

Smith–Waterman algorithm wikipedia , lookup

Transcript
1
Diverse Subgroup Set Discovery using a Novel
Genetic Algorithm
Shanjida Khatun, Swakkhar Shatabda
Abstract—When the search space is too large or it is required
to select a small set of patterns from a large dataset, exhaustive
search techniques do not perform well. Large data is challenging
for most existing discovery algorithms because many variants of
essentially the same pattern exist, due to (numeric) attributes
of high cardinality, correlated attributes, and so on. While
ignoring many potentially interesting results, this causes top-k
mining algorithms to return highly redundant result sets. These
problems are particularly apparent with pattern set discovery
and its generalization, exceptional model mining. To address this,
we deal with the discriminative or diverse pattern set mining
problem. In this paper, we investigate an approach which aims
to find diverse set of patterns using genetic algorithm to mine
diverse frequent patterns. We propose a fast genetic algorithm
with several novel components that outperforms state-of-the-art
methods on a standard set of benchmarks and capable to produce
satisfactory results within a very short period of time using
large and small datasets. Our proposed genetic algorithm uses
a relative encoding scheme for the patterns, an effective twin
removal technique to ensure diversity throughout the search and
a random restart technique to avoid getting stuck in local minima
or maxima.
Index Terms—Pattern set mining; large neighborhood search;
genetic algorithm.
I. I NTRODUCTION
In the area of patten set mining, the process of frequent pattern extraction finds interesting information about the association among the items in a transactional database. The notion of
support is employed to extract the frequent patterns. Normally,
a frequent pattern may contain items which belong to different
categories of a particular domain. The existing approaches
do not consider the notion of diversity while extracting the
frequent patterns. For certain types of applications, it may be
useful to distinguish between the frequent patterns with items
belonging to different categories and the frequent patterns with
items belonging to the same category. The major issue with the
frequent pattern mining is the generation of a huge number of
patterns which might be found insignificant depending upon
the application or user-requirement. In this connection, the
researchers have made efforts to mine constraint based and
user-interest based frequent patterns by using the measures
such as closed, maximal, periodic and top-k. Many algorithms
have been proposed in last few years to find such sets of
patterns [1] and most of these algorithms perform some kind
Shanjida Khatun is with the Department of Computer Science and Engineering, Ahsanullah University of Science and Technology, Dhaka, Bangladesh
e-mail: [email protected].
Swakkhar Shatabda is with Department of Computer Science and
Engineering, United International University, Dhaka, Bangladesh email:[email protected].
of greedy or local search and differ widely in the heuristics
and search orders used. Constraint programming methods on a
declarative framework [4], [6] have earned significant success
but these algorithms perform very poorly for large datasets
and require huge time where local search methods have been
very effective to find satisfactory results efficiently.
In this paper, we have made an effort to propose a new
interestingness measure by exploiting the fact that items in
a frequent pattern may belong to different categories of a
particular domain. We have made an effort to propose an
interestingness measure which is XOR based dispersion score
by analyzing the extent to which the items in the patterns
belong to different categories. We investigate the possibilities
for discovering diverse pattern sets to find small set of patterns
within a short period of time using genetic algorithm by
using a large datasets with minor modifications in the search
technique. We are given a set of transactions and a set of
patterns in the dispersion score set up to select a small
set of diverse patterns for genetic algorithm. Our genetic
algorithm has several novel components such as a relative
encoding technique learned from the structures in the dataset,
a twin removal technique to remove identical and redundant
individuals in the population and a random restart technique to
avoid stagnation. We compared the performance with several
other algorithms such as random walk, hill climbing and large
neighborhood search. The key contributions in the paper are
as follows:
•
•
Perform a comparative analysis between various types
of local search algorithms and analysis of their relative
strength compared with each other.
Demonstrate the overall strength of genetic algorithm
with novel components for finding small set of diversefrequent patterns.
The paper is furnished as follows: Section II explains
our works and all the necessary definitions to understand
the paper; Section III reviews the related task; Section IV
explains the algorithms used; Section V discusses and analyzes
the experimental results; and then, Section VI presents our
conclusions with a discussion and possible outlines for future
work.
II. P RELIMINARIES
A. Pattern Constraints
In this section, we explain some concepts to understand
the diverse pattern set mining problems. These notations are
adopted from Guns et al. [6] and Shanjida et al. [8].
JU Journal of Information Technology (JIT), Vol. 4, July 2015
TABLE I: A small dataset containing five items and six
transactions.
Transaction
Id
t1
t2
t3
t4
t5
t6
ItemSet
{A,B,D}
{B,C}
{A,D}
{A,C,D}
{B,C,D}
{C,D,E}
A
1
0
1
1
0
0
B
1
1
0
0
1
0
C
0
1
0
1
1
1
D
1
0
1
1
1
1
E
0
0
0
0
0
1
We assume that we are given a set of items I and a database,
D of transactions T , in which all elements are either 0 or 1.
The process of finding the set of patterns which satisfy all of
the constraints is called pattern set mining. A pair of variables
(I, T ), where I represents an itemset I ⊆ I and T represents
a transaction set T ⊆ T , represented by means of boolean
variables Ii and Tt for every item i ∈ I and every transaction
t∈T.
The itemsets or pattern sets and the transaction sets are
generally represented by binary vectors. The coverage ϕD (I)
of an itemset I consists of all transactions in which the itemset
occurs:
ϕD (I) = {t ∈ T |∀i ∈ I : Dti = 1}
For example, consider the small dataset presented in Table I.
Given an itemset, I = {B, C}, it is represented as h0, 1, 1, 0, 0i
and the coverage is ϕD(I) = {t2 , t5 } which is represented by
h0, 1, 0, 0, 1, 0i. Support of the itemset is SupportD (I) = 2.
Where, Support of an itemset is the size of its coverage set,
SupportD (I) = |ϕD (I)|.
Dispersion score is the score of the frequent pattern sets
based on the items categories within it. For example, suppose
there are three itemsets named I1 = {B, C}, I2 = {C, D}
and I3 = {E} in the pattern sets for pattern set size,
k = 3. So, their coverage will be ϕD (I1 ) = h0, 1, 0, 0, 1, 0i,
ϕD (I2 ) = h0, 0, 0, 1, 1, 1i and ϕD (I3 ) = h0, 0, 0, 0, 0, 1i
respectively. After XOR operation to each other, the sum of
each item of the coverage will be
ϕD (I1 ) XOR ϕD (I2 ) = h0, 1, 0, 1, 0, 1i = 3,
ϕD (I1 ) XOR ϕD (I3 ) = h0, 1, 0, 0, 1, 1i = 3 and
ϕD (I2 ) XOR ϕD (I3 ) = h0, 0, 0, 1, 1, 0i = 2.
Now, the result of the dispersion score will be 3+3+2 = 8.
B. Pattern Set Constraints
In pattern set mining, we are interested to find k−pattern
sets [5]. A k−pattern set Π is a set of k tuples, each of type
hI p , T p i. The pattern set is formally defined as follows:
Π = {π1 , · · · , πk }, where, ∀p = 1, · · · , k : πp = hI p , T p i
Diverse pattern sets: In pattern set mining, highly similar
transaction sets can be founded which can be undesirable. To
avoid this, many measures can be used to find the similarity
between two set of patterns such as dispersion score [11]:
X
dispersion(T i , T j ) =
(2Tti − 1)(2Ttj − 1).
t∈T
2
The term (2Tti − 1) transforms a binary {0, 1} variable into
one of range {−1, 1}. This way of finding, dispersion score
has some disadvantages. When two patterns cover exactly the
same transactions and one pattern covers exactly the opposite
transactions of the other, the score will be maximized in both
[6]. For example, if two patterns cover h0, 1, 1, 0, 0, 1i and
h1, 0, 0, 1, 1, 0i or h0, 1, 1, 0, 0, 1i and h0, 1, 1, 0, 0, 1i transactions respectively, in both case, the score will be 6. This is not
exactly desirable because for second case, score should be 0.
To address this issue, we define a new XOR based dispersion
score to calculate the diversity between two pattern sets as
shown below:
X
xorDispersion(T i , T j ) =
Tti ⊕ Ttj .
t∈T
To measure the diversity of a pattern set we use the
following expression which is the objective function that we
wish to maximize.
objDispersion =
k X
i−1
X
xorDispersion(T i , T j ).
i=1 j=1
To find diverse-frequent patterns, in last few years, most of
the algorithms too struggles to produce good quality solutions
on the large datasets within a short period of time. In this
paper, to solve this problem, we propose a fast genetic
algorithm with various novel components which work on large
datasets.
III. R ELATED W ORK
In pattern set mining, to find patterns which are correlated
[10], discriminative [12], contrast [5] and diverse [11] became
promising tasks. Many algorithms has been proposed as a general framework for pattern set mining [6], [4] in last few years
for discovering diverse pattern sets. Many languages have been
developed such as Zinc [9], Essence [3], Gecode [13] and
Comet [6], [7]. To search and prune the solution space in an
efficient way, most of these methods used exhaustive search
methods and the algorithms, those are not only exhaustive
in nature but also take huge amount of time. Most of these
algorithms performed poor for large datasets.
In [4], the k-pattern set mining tackled pattern mining
directly at a global level rather than at a local one. Using the
constraint programming (CP) framework, researcher evaluated
the feasibility of exhaustive search for pattern set mining.
They proposed one-step search method that is often more
effective than a two-step method in which an exhaustive search
is performed in both steps. CP used exhaustive search and the
feasibility of the CP approach depended on the constraints
involved. But one issue was that whether other solvers can be
used to solve the general k-pattern set mining problem given
only its description in terms of constraints.
Guns et al. [6] investigated a technique by simplifying two
pattern set mining tasks and search strategies by putting these
into a common declarative framework where Large Neighborhood Search performed remarkably well. They limited
their focus to an exhaustive search without using the basic
propagation principle of CP. They used limited number of
JUJIT
JU Journal of Information Technology (JIT), Vol. 4, July 2015
pattern set mining tasks. The algorithm that they used only
worked for small dataset. In a recent work, Shanjida et al. [8]
explored the use of genetic algorithms and other stochastic
local search algorithms to solve the diverse pattern set mining
using large and small datasets.
IV. O UR A PPROACH
In this section, first we describe our proposed genetic
algorithm with novel components to solve the diverse pattern
set mining problem. Then we describe the other algorithms
that we implemented [8] in order to compare with GA.
A. Genetic algorithm
Algorithm 1 geneticAlgorithm(int percentChange)
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
p = populationSize
P = generate p valid pattern sets
Pb = {}
while timeout do
Pm = simpleMutation(P)
Pc = uniformCrossOver(P)
P∗ = select best (P ∪ Pm ∪ Pc )
if stagnation
then
Q
= findBest(P
Q ∗)
Pb = Pb ∪ { }
P∗ = changePopulation( percentChange, P∗ )
end if
P = P∗
end
Q∗ while
= findBest(P
b)
Q∗
return
Genetic algorithms are inspired by natural selection process.
The search improves from generation to generation of a
population of individuals by means of mutation and crossover.
We have used XOR operation to generate our objective score
as described in the preliminaries section. In Algorithm 1, we
created a population in Pm and Pc . Pm created a population
using mutation and Pc created a population using cross over
(shown in Algorithm 3). After that we took best population
into P∗ from P, Pm and Pc . Here, size of P∗ will be same
as population size. We iterated the procedure over and over
again through several generations. If P∗ remains same for
at least 100 generations, we changed the value of P∗ using
simpleMutation(P atternSetsP) (shown in Algorithm 2). In
this way we won’t stuck in local minima. Here, We saved
diverse pattern set with maximum value in Pb every time. Then
we copied P∗ ’s value into P. In the next generation, we got
a new population. We continued this procedure until timeout.
After that we returned the best score from Pb . We describe
our GA (shown in Algorithm 1) in the following parts:
1) Objective function: To find the objective score for a
pattern set, we calculated coverage of each itemset.
This will return some boolean array. After that we
found all the combination for those boolean array. Then
we calculated XOR based dispersion score for each
combination.
3
2) Population initialization: We randomly generated p
valid pattern sets and kept it in P. We found that the
itemsets have a particular structure to generate a valid
pattern set. We used a constrained initialization for the
representation to avoid invalid situations such as there
are several exclusive attributes which are not true at a
time.
3) Crossover technique: Using crossover (shown in Algorithm 3), we have took two pattern sets from population
to create an offspring. We have done this for p times
where p is the number of population. Then we got
p offspring. We have used uniform crossover to find
offspring. We have randomly choose each item from
these two pattern sets and place them into new pattern
sets but we have made sure that no duplicate remains in
the new population.
4) Mutation technique: In Algorithm 2, we created Pm
new pattern sets using mutation. We randomly generated
pattern sets by flipping a single bit. In the datasets, we
found that the attributes are grouped exclusively. Which
is only one of the bits are 1 in each group. We always
kept this structure constraint satisfied while doing the
mutation by making sure no two bits are simultaneously
on within a same group and at least one bit is on.
5) Twin removal: In our Algorithm, we never allowed it to
have twin in any population. Before entering any pattern
sets if we found any twin, we rejected it and created
new one. We have done this until found a distinct valid
pattern set.
6) Handling stagnation: To avoid getting stuck in local
minima, we have used random restart in our genetic
algorithm. When list of population are not change
for a certain period, we restarted the algorithm based
on two variable. One, when it will be restarted and
second, how much change will be done in the list.
changePopulation(percentChange, P) (shown in Algorithm 4) is used to create a new population where P
represents the pattern set in which we have to change and
percentChange represents how much patters that we
have to change. For example, if percentChange = 90
that means 90% value will be deleted to create new
value. In our algorithm, we experimented with different
values of percentChange. We have found that when
percentChange = 90, we have always got good results.
As it saves only top 10% score and other 90% will be
used to create new population.
7) Population Size: We have checked the effect of population in result using tic-tac-toe dataset. We have found
that population size plays a pivotal role for generating
result. We have described about this in analysis section.
B. Large Neighborhood Search (LNS)
In large neighborhood search (LNS) (shown in Algorithm 5), first we created a valid pattern set and calculated
its score. Then we created its neighbors and found the best
neighbor. If the value of best neighbor is greater than the
initial pattern set then we changed the initial pattern set and
JUJIT
JU Journal of Information Technology (JIT), Vol. 4, July 2015
4
Algorithm 5 largeNeigbourhoodSearch()
Algorithm 2 simpleMutation( PatternSets P)
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
index = 0
Pm = {}
size = noOfPatternset(P)
while
Q index < size do
Q = P[index]
Q
by flipping single
m = generate a valid neighbor of
bit
Q
while
m ∈ Pm do
Q
Q
=
generate a valid neighbor of
by flipping
m
single bit
end while
Q
Pm = Pm ∪ { m }
index + +
end while
return Pm
Algorithm 3 crossOver( PatternSets P)
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
index = 0
Pc = {}
size = noOfPatternset(P)
while
Q index <size do
Qm = randomly take a pattern set from P
Qf = randomly take a pattern
Q set
Q from P
uniformCrossOver( m , f )
o =Q
while
o ∈ Pc do
Q
Q Q
o = uniformCrossOver(
m,
f )
end while Q
Pc = Pc ∪ { o }
index + +
end while
return Pc
Algorithm 4 changePopulation(perChange, PatternSets P)
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
noOf change = (perChangeQ
∗ sizeOf(P))/100
remove lowest noOf change
from P
i=1
while
Q i ≤ noOf change do
= randomly
create a valid patten set with k-items
Q
while
∈
P
do
Q
= randomly create a valid patten set with k-items
end while Q
P = P ∪{ }
i++
end while
return P
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
noOf
BitT oChange = 1
Q
= randomly create a valid patten set with k-items
while timeout do
Q
noOf BitT oChange
P
neighbours for
Q∗= create 2
= find best individual
Q∗ from P
Q
if getObjectiveScore(
) > getObjectiveScore(
)
then
Q Q∗
=
end
if
Q
if
remains same for 100 iteration then
noOf BitT oChange + +
end if
end while
Q
return
replaced it with best neighbor. In our implementation, the
number of neighbors, created for a pattern set, will be 2n ,
where n = noOf BitT oChange. When we generated the
neighbors, first we created 21 neighbor with n = 1. If it didn’t
give good results for 100 iteration, we incremented the value
of n by 1. We performed this again and again whenever LNS
stuck for 100 iteration. To crate neighbors of a pattern set, we
randomly choose an itemset from that pattern set. After that
we randomly choose an item from that itemset. We did this
for n times as each item is represented by boolean values. So
if we created all possible neighbors for three items, number of
neighbors for changing three items will be 23 . For n, it will
be 2n .
Algorithm 6 hillClimbing()
Q∗
1:
= randomly create a valid patten
Q∗ set with k-items
2: bestScore = getObjectiveScore(
)
3: while timeout do
Q
Q∗
4:
= generate a valid neighbor from Q
5:
currentScore = getObjectiveScore(
)
6:
if currentscore
>
bestScore
then
Q∗ Q
7:
=
8:
bestScore = currentScore
9:
end if
10: end while
Q∗
11: return
C. Hill Climbing with Single Neighbor
For hill climbing
(shown in Algorithm 6), we created a valid
Q∗
Q
pattern set
and copied the value in another pattern set .
We started aQ
loop which
run for 1 minute. Then we created a
Q
∗
neighbor
of
in
.
If
this new neighbor is Q
greater than the
Q∗
∗
, we copied theQ
value of new neighbor in
and created
∗
a new neighbor of
. The cycle goes on until the time is up.
D. Random Walk
In random
Q walk (shown in Algorithm 7), we created a valid
Q∗
pattern set . Then we created another pattern set called
.
JUJIT
11:
Q
Q∗
We copied the value of
into
. Then we Q
started a loop
which run for 1 minute. Here, we changed the
by creating
Q∗
a new valid pattern
set
and
then
checked
the
value
with
.
Q
Q
Q
∗
If the score Q
of
is greater, we copied
into
. Then
by creating another pattern set randomly. This
we changed
procedure
is
worked
for 1 minute. After that we took the score
Q∗
of
.
V. E XPERIMENTAL R ESULTS
We have implemented all algorithms in JAVA language
and have run our experiments on an Intel core i3 2.27 GHz
machine with 4 GB ram running 64bit Windows 7 Home
Premium.
TABLE II: Description of datasets.
Data Set
Tic-tac-toe
Primary-tumor
Soybean
Hypothyroid
Mushroom
Items
27
31
50
88
119
Transactions
958
336
630
3247
8124
A. Dataset
In this paper, the datasets that we use are taken from UCI
Machine Learning repository [2] and originally used in [6].
The datasets are available to download freely from the website:
https://dtai.cs.kuleuven.be/CP4IM/datasets/. The datasets are
given in Table II with their properties.
B. Results
We have implemented four algorithms and tested them by
various population size. We have calculated the objective score
for each algorithm for k pattern sets where k = 2, 3, 6, 9, 10
and population size 100. For each algorithm, we have used five
datasets whose transaction number and item size are given in
Table II. We have collected the score by run all algorithm
for 1 minute. For each test case, we have ran the code five
times and took the best score and the average score which are
given in Table III. The best average values in each row are
shown in bold faced fonts and the best values are shown in
both bold and italic faced fonts. For other population sizes,
the performance of all algorithms remains same.
8000
7500
7000
6500
10
20
30
40
50
75
100
300
500
600
1000
1200
1500
2000
5:
6:
7:
8:
9:
10:
bestScore
= −∞
Q∗
=φ
while
Q timeout do
= randomly create a valid patten set
Q with k-items
currentScore = getObjectiveScore(
)
if currentscore
>
bestScore
then
Q∗ Q
=
bestScore = currentScore
end if
end while
Q∗
return
8500
Population Size
(a) Average
8000
7800
7600
7400
7200
10
20
30
40
50
75
100
300
500
600
1000
1200
1500
2000
1:
2:
3:
4:
Objective Score
Algorithm 7 randomWalk()
5
Objective Score
JU Journal of Information Technology (JIT), Vol. 4, July 2015
Population Size
(b) Best
Fig. 1: Search progress of genetic algorithm for the tic-tac-toe
dataset with pattern size k = 6.
C. Analysis
From Table III, we have found that almost always genetic
algorithm performs better than other algorithms. In few cases,
LNS performs well than genetic algorithm. The performance
of Random walk and hill climbing is not well. We found that
when number of pattern set is increasing, GA tends to work
better compare to other algorithms. When k = 2 or k = 3,
LNS, Random Walk gives best value but when k = 9 or
k = 10, GA gives the highest values. We found that when
the number of itemset becomes too greater, genetic algorithm
performs poor. So, Too less or too many population size will
give a bad result because calculation becomes too expensive.
Genetic algorithm works better by changing 90% population
using random restart. Fig. 1 shows the performance of GA
with respect to population size for the dataset tic-tac-toe. We
tested our GA with different population size from 10 − 2000.
For each population size, we have ran the code five times
and took the best and the average objective score. In fig.
1(a) when population size is in 40 − 500, GA gives the best
result. After that when population size is greater than 500,
the objective score is decreasing. In fig. 1(b) when population
size is in 10 − 1000, GA gives the best result but after that
when population size is exceed 1000, the objective score
is decreasing. So, genetic algorithm works better with large
population size but when the size of population becomes
too big, it performs not well in allocated time because the
calculations becomes too expensive.
Fig. 2 shows the performance of the search algorithms based
on their average objective score which is shown as vertical
bars. We ran all the algorithms for 1 minute using all the
datasets with different pattern set sizes, k = 2, 3, 6, 9, 10, for
population size 100. We found that genetic algorithm always
gives good result with compare to other algorithms. Sometimes
LNS gives good result like as genetic algorithm. For the
JUJIT
JU Journal of Information Technology (JIT), Vol. 4, July 2015
6
TABLE III: Objective score achieved by different algorithms for various datasets with different size of pattern sets k.
2
3
6
9
10
2
3
6
9
10
2
3
6
9
10
2
3
6
9
10
2
3
6
9
10
Mushroom
Hypothyroid
Soybean
Primary-tumor
20000
15000
Random Walk
Hill Climbing
LNS
Genetic Algorithm
Objective Score
Objective Score
25000
10000
5000
0
2
3
6
9
Pattern Set Size, k
120000
100000
80000
60000
40000
20000
0
Objective Score
20000
Random Walk
Hill Climbing
LNS
Genetic Algorithm
15000
10000
Genetic Algorithm
Avg.
Best
798
798
1916
1916
7938
7938
18458.4
18624
22731.4
22816
8124
8124
16248
16248
58734
64992
103932
142452
107529.6 130944
2736.4
3247
5876
6494
12549.4
16325
24234.8
27556
17629.8
21726
630
630
1260
1260
5642.8
5664
12547.2
12598
15531.2
15696
336
336
672
672
3013.6
3024
6715.2
6720
8351.4
8376
Random Walk
Hill Climbing
LNS
Genetic Algorithm
5000
0
2
10
(a) Tic-Tac-Toe
30000
25000
20000
15000
10000
5000
0
Search Algorithm
Hill Climbing
LNS
Avg.
Best
Avg.
Best
516.8
753
762
798
1432.2
1593
1825.6
1916
7004.4
7653
7758
7791
15977.6
16972
18097.6
17858
19963
21496
22235.2
22748
0
0
1362.4
6812
3249.6
16248
2070.4
10352
0
0
0
0
20960
63392
0
0
28868.4
73116
0
0
324.4
1622
649.4
3247
0
0
0
0
0
0
0
0
0
0
5193.6
25968
11689.2
29223
0
0
0
0
374.5
624
260.4
1136
1168.8
1248
3304.2
5076
4023.8
4992
3770
5634
11113.6
12568
9406.2
12000
7653.8
12090
238
336
334.6
336
540.4
672
672
672
2944
3017
3001.4
3018
6616.4
6710
6682
6712
7576.2
8336
8343.4
8393
walk
Best
798
1690
5380
18224
12764
4936
14576
37440
43216
46584
562
1484
3405
5864
9333
624
1248
3438
5778
7597
329
658
2453
4372
4897
3
6
9
Pattern Set Size, k
10
2
(b) Mushroom
3
6
9
Pattern Set Size, k
10
(c) Soybean
10000
Random Walk
Hill Climbing
LNS
Genetic Algorithm
Ogjective Score
Tic-tac-toe
Random
Avg.
771
1491.4
5355
17517.6
11393.8
3388
6889.6
27260
33955.2
34117.2
439.6
937.2
2277
3732.8
5916.6
624
1242.4
3155
5246.8
6409
326.4
647.6
2115.8
3833.2
4539
Objective Score
Pattern set size
k
Data set
8000
6000
Random Walk
Hill Climbing
LNS
Genetic Algorithm
4000
2000
0
2
3
6
9
Pattern Size, k
10
(d) Hypothyroid
2
3
6
9
Pattern Set Size, k
10
(e) Primary-tymor
Fig. 2: Bar diagram showing comparison of average objective score achieved by different algorithms for k = 2, 3, 6, 9, 10.
datasets mushroom and hypothyroid, in few cases, the objective
score of LNS and hill climbing is zero because the size of the
items of the datasets (shown in Table II) is too big.
In Fig. 3, we shows the performance of different search
algorithms for the tic-tac-toe dataset for pattern set size 6
and population size 100. In this figure, the average objective
scores of the search algorithms are shown as vertical line
for the different times. From the figure, we find that random
walk performs poorly as usual but hill climbing improves very
quickly using single neighbor where LNS performs very well
which is near to genetic algorithm. Genetic algorithm always
gives best result.
JUJIT
JU Journal of Information Technology (JIT), Vol. 4, July 2015
Objective Score
10000
Random Walk
LNS
Hill Climbing
Genetic Algorithm
8000
6000
4000
5
10
15 20 25 30
Time (in seconds)
35
40
Fig. 3: Comparison of average objective score achieved by
different algorithms for the tic-tac-toe dataset with pattern size
k = 6.
7
[7] P. V. Hentenryck and L. Michel, Constraint-based local search. The
MIT Press, 2009.
[8] S. Khatun, H. U. Alam, and S. Shatabda, “An efficient genetic algorithm
for discovering diverse-frequent patterns,” 2015, pp. 120–126.
[9] K. Marriott, N. Nethercote, R. Rafeh, P. J. Stuckey, M. G. De La Banda,
and M. Wallace, “The design of the zinc modelling language,” Constraints, vol. 13, no. 3, pp. 229–267, 2008.
[10] F. Rossi, P. Van Beek, and T. Walsh, Handbook of constraint programming. Elsevier, 2006.
[11] U. Rückert and S. Kramer, “Optimizing feature sets for structured data,”
in Machine Learning: ECML 2007. Springer, 2007, pp. 716–723.
[12] P. Shaw, “Using constraint programming and local search methods to
solve vehicle routing problems,” in Principles and Practice of Constraint
ProgrammingCP98. Springer, 1998, pp. 417–431.
[13] G. Team, “Gecode: Generic constraint development environment,” 2006.
VI. C ONCLUSION
In this paper, we presented a new genetic algorithm which
is a combined way of three different enhancement techniques:
i) a relative encoding technique; ii) a twin removal technique;
iii) a random-walk based stagnation recovery technique. We
compared our results with the state-of-the-art local search
algorithms. We found that our final algorithm GA that use
a combination of all the three enhancements significantly
outperforms all current approaches of local search algorithms.
Here, genetic algorithm almost always gives good results
within a very short period of time with compare to other
algorithms.
In this paper, we proposed an interestingness measure which
is XOR based dispersion score by analyzing the extent to
which the items in the patterns belong to different categories.
We compared the different search strategies for dispersion
score on the pattern set mining tasks. It remains to be seen
to how many other tasks, for example concept learning, this
observation extends. Similarly, it remains to be seen how
many pattern set mining tasks can be modelled in terms of
constraints, for example learning decision lists. Finally, we
restricted this study to pattern set mining. We believe there is
huge opportunity for general declarative tools for data mining
and machine learning at large.
In the future, we would like to improve the performance of
the search techniques of GA for large population size within
the framework of GA by using a genetic operator and applying
it in a similar way of crossover and mutation.
Shanjida Khatun received her B.Sc. and M.Sc.
degrees, all in Computer Science and Engineering,
from Ahsanullah University of Science and Technology in June 2012, and United International University in September 2015, respectively. Since October
2012, she has been with Department of Computer
Science and Engineering, Ahsanullah University of
Science and Technology as a Lecturer. Her research
interests include artificial intelligence, data mining,
meta-heuristic search, bioinformatics and computational biology.
Swakkhar Shatabda received his BSc degree in
Computer Science and Engineering in 2007 from
Bangladesh University of Engineering and Technology(BUET) and the PhD degree in Bioinformatics
and Computational Biology in 2014 from Griffith
University, Australia. He has also worked as a graduate researcher in National ICT Australia (NICTA)
from 2010 until 2014. He is currently working as
an Assistant Professor and Undergraduate Program
Coordinator in the Department of Computer Science
and Engineering of United International University,
Bangladesh. His research interests include Bioinformatics, protein fold and
structural class prediction problems, protein structure and function prediction
problems, data mining, statistical learning theory, pattern recognition, graph
theory, algorithms and machine learning.
R EFERENCES
[1] B. Bringmann, S. Nijssen, N. Tatti, J. Vreeken, and A. Zimmerman,
“Mining sets of patterns,” Tutorial at ECMLPKDD, 2010.
[2] A. Frank, A. Asuncion et al., “Uci machine learning repository,” 2010.
[3] A. M. Frisch, W. Harvey, C. Jefferson, B. Martı́nez-Hernández, and
I. Miguel, “Essence: A constraint language for specifying combinatorial
problems,” Constraints, vol. 13, no. 3, pp. 268–306, 2008.
[4] T. Guns, S. Nijssen, and L. De Raedt, “Itemset mining: A constraint
programming perspective,” Artificial Intelligence, vol. 175, no. 12, pp.
1951–1983, 2011.
[5] T. Guns, S. Nijssen, and L. D. Raedt, “k-pattern set mining under
constraints,” Knowledge and Data Engineering, IEEE Transactions on,
vol. 25, no. 2, pp. 402–418, 2013.
[6] T. Guns, S. Nijssen, A. Zimmermann, and L. De Raedt, “Declarative
heuristic search for pattern set mining,” in Data Mining Workshops
(ICDMW), 2011 IEEE 11th International Conference on. IEEE, 2011,
pp. 1104–1111.
JUJIT