Download Improving Contrast Set Mining - University of Regina

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Improving Contrast Set Mining
Mondelle Simeon
Robert Hilderman
Department of Computer Science
University of Regina
Regina, Saskatchewan, Canada S4S 0A2
Email: {simeon2m, hilder}@cs.uregina.ca
Abstract
2
A fundamental task in exploratory data analysis is
discerning the differences between contrasting groups.
Contrast set mining has been developed as a data
mining task, which aims to identify the differences
between these groups. This paper examines the algorithms, heuristics, and open issues of contrast set
mining, and seeks to improve contrast set mining
by addressing several of the open issues. It proposes four interestingness measures for ranking contrast sets: coverage, overall support, growth rate, and
unusualness. It introduces a new method to discretize
quantitative attributes. A new type of contrast set,
called the jumping contrast set, is defined, and the
contrast set mining process is modified, to include
mining both types of contrast sets, on datasets containing both quantitative and categorical attributes.
Finally, a simple visualization method is introduced,
to describe contrast sets to the end-user.
Contrast set mining is closely related to association
rule mining, and utilizes some of the terminology and
notation of association rule mining (Agrawal et al.,
1993). Contrast sets were first defined by Bay and
Pazzani (1999) as conjunctions of attributes and values that differ meaningfully in their distributions
across groups. They further explore contrast set mining in Bay and Pazzani (2001).
1
Introduction
Discovering the differences between groups is a fundamental problem in many disciplines. For example,
a medical researcher might be interested in uncovering “What is the difference between males and females
affected by a disease, such as diabetes?”. He may discover that women who have a waist size greater than
40 inches, are twice as likely to have diabetes, than
men who have the same waist size. Groups are defined
by a selected property that distinguishes one group
from the other. The search for group differences can
be applied to patients, organizations, molecules and
even timelines. Contrast set mining was developed
by Bay and Pazzani (1999) to automatically detect
all the differences between contrasting groups from
observational multivariate data. Contrast set mining seeks conjunctions of attributes and values, called
contrast sets, that have different levels of support in
different groups.
This paper examines the algorithms, heuristics
and open issues of contrast set mining, and seeks to
improve contrast set mining by addressing four open
issues. It proposes four interestingness measures for
ranking contrasts sets. It also examines three ways
of comparing contrasting groups, to determine which
contrast sets are significant. Several methods for discretizing quantitative attributes are discussed, with a
new discretization method proposed, which is based
on statistical measures. This paper also defines a
new type of contrast set, called the jumping contrast
set, and expands the contrast set mining process, to
include mining both the original contrast sets, and
jumping contrast sets on datasets with both categorical and quantitative attributes. Finally, this paper
introduces a simple visualization method to describe
contrast sets to the end-user.
Contrast Set Mining
2.1
Contrast Set Mining Algorithm
The STUCCO (Search and Testing for Understandable Consistent Contrasts) algorithm which is based
on the Max-Miner rule discovery algorithm (Bayardo,
1998) was introduced by Bay and Pazzani (1999).
The algorithm discovers a set of contrast sets along
with their supports on groups. Let A1 , A2 , ..., Ak be
a set of k variables called attributes. Each Ai can
take on values from the set {Vi1 , Vi2 , ..., Vim }. Then
a contrast set, CSet, is a conjunction of attributevalue pairs defined on mutually exclusive groups,
G1 , G2 , ..., Gn with no Ai occurring more than once
and Gi ∩ Gj = ∅, ∀i 6= j. The support of a contrast set with respect to a group Gi , represented as
support(CSet, Gi ), is the percentage of examples in
Gi for which the contrast set is true. STUCCO specifically finds significant contrast sets by testing the null
hypothesis that contrast-set support is equal across
all groups, or alternatively, that contrast-set support
is independent of group membership. Formally, this
is defined as:
∃ij P (CSet = T rue|Gi ) 6= P (CSet = T rue|Gj )
(1)
max |support(CSet, Gi ) − support(CSet, Gj )| ≥ δ
ij
(2)
where δ is a user defined threshold called the minimum support difference. Contrast sets where Equation 1 is statistically valid are called significant, and
contrast sets where Equation 2 is met are called large.
A potential contrast set is discarded if it fails the test
for independence, measured with the chi-squared (χ2 )
test, with respect to the group variable, Gi . In order
to limit the possibly high probability of false rejection,
STUCCO employs a variant of the Bonferroni correction for multiple hypothesis tests which uses more
stringent critical values for the statistical tests as the
number of conditions in the contrast set is increased.
In the contrast set mining process, the search space
for potential contrast sets can be represented in a
tree structure having every possible combination of
the attributes. For a dataset that has three categorical attributes, ‘1’, ‘2’ and ‘3’, in addition to a class
attribute, Figure 1 shows the search tree.
The ‘{}’ at the root of the tree represents the entire dataset. The search process begins with the most
Holm’s sequential rejective method (Holm, 1979) for
the independence test.
A different approach to contrast set mining called
CIGAR (Contrasting Grouped Association Rules)
was proposed by Hilderman and Peckham (2005).
CIGAR not only considers whether the difference in
support between groups is significant, it also specifically identifies which pairs of groups are significantly
different and whether the attributes in a contrast set
are correlated. CIGAR utilizes the same general approach as STUCCO, however it uses a series of 2 ×
2 contingency tables in determining whether a difference exists between contrast sets in two or more
groups. In contrast to STUCCO, CIGAR focusses on
controlling Type II error through increasing the significance level for the significance tests, and by not
correcting for multiple corrections. CIGAR utilizes a
correlation pruning technique that compares the correlation difference, r, between parents and children in
the search space and rejects candidate contrast sets
where r is less than a predetermined threshold.
A general formulation of interesting contrast rules
was proposed by Minaei-Bidgoli et al. (2004). They
developed an algorithm to discover a set of contrast
rules by investigating three statistical measures: difference of confidence, difference of proportion, and
correlation and chi-square. Their algorithm allows
for very low minimum support values, which would
allow for the discovery of rules which would normally
have been overlooked.
The previous contrast set mining approaches focused on discrete data, which is not frequently the
case in real world data. However, few attempts have
been made at discovering contrast sets on data which
contain continuous attributes. A formal notion of a
time series contrast set was introduced by Lin and
Keogh (2006). They proposed an efficient algorithm
to discover timeseries contrast sets on timeseries and
multimedia data. The algorithm utilizes a SAX alphabet (Lin et al., 2003) to convert continuous data
to discrete data (discretization).
A modified equal-width binning interval approach
to discretizing continuous-values attributes where the
approximate width of the intervals is provided as a
parameter to the model was proposed by Simeon and
Hilderman (2007). These intervals are then used in
generating contrast sets where the consequent in the
rules contain up to two continuous-valued attributes.
The approach used is similar to Figure 2 with the discretization process taking place before determining all
possible itemsets. The Student’s T-test and Z-test are
used to determine independence. Both tests require
at least two items for each group and this also served
as the minimum support requirement. An objective
measure for identifying and ranking potentially interesting contrast sets, referred to as the distribution
difference, was also proposed.
Figure 1: Example search tree for a dataset with three
attributes
general terms first, shown as ‘Level 1’ in Figure 1. In
moving to ‘Level 2’, conjunctions of the attributes
would then be examined. At this level, there are only
three contrast sets to examine. This is because, for
instance, ‘{1,2}’ represents the same instances in the
dataset as ‘{2,1}’. More complex combinations would
be examined in subsequent levels. The number of
levels in the search tree is equal to the number of attributes. The number of contrast sets on each level is
equal to the number of combinations of n and i, represented as Cin , where n is the number of attributes,
and i is the level in theP
tree.
The total number of
n
contrast sets is equal to i=1 Cin .
STUCCO prunes portions of the search space that
is determined to produce contrast sets which fail to
meet Equations 1 and 2. It utilizes three types of
pruning methods – effect size pruning, statistical significance pruning, and interest based pruning. Effect
size pruning requires that the maximum difference between the support of the two groups in the contrast
set, be greater than some predetermined threshold.
Statistical significance pruning requires that there are
enough data points to have a valid chi-square test, or
that the maximum value the chi-squared statistic can
take is large enough to be significant. Interest based
pruning requires that the contrast sets selected are interesting enough, where interest is measured as representing new information from what is already known,
with already selected contrast sets.
Figure 2 summarizes the algorithm. A search tree
is constructed for all possible item sets as in Figure 1,
the minimum support is applied to generate candidate
item sets, filtering conditions are applied to produce
candidate contrast sets and independence tests are
performed to discover the candidate contrast sets.
2.3
A study to compare contrast-set mining with existing
rule-discovery techniques was undertaken by Webb
et al. (2003). They compared a commercial rulediscovery system, called Magnum Opus (Webb, 2001),
with STUCCO and concluded that contrast set mining is equivalent to a special case of a general rulediscovery task. They identified several open issues in
contrast set mining which requires further research:
the development of filters to guard against spurious
contrasts as well as preventing the removal of potentially valuable contrasts; the development of appropriate methods to describe the contrasts that are discovered; and the investigation of the impact of various
independence tests on the contrast sets which are discovered. Since their publication, the work by Simeon
Figure 2: Procedure for mining contrast sets
2.2
Open issues in Contrast Set Mining
Related Research on Contrast Set Mining
An algorithm to discover negative contrast sets that
can include negation of terms in the contrast set was
proposed by Wong and Tseng (2005). Their algorithm utilizes the same steps as in Figure 2, to discover negative contrast sets. However they utilize
2
and Hilderman (2007) is the only attempt to directly
address one of these issues, through the development
of an objective measure for identifying and ranking
contrast sets.
3
needed. One simple method is to use a bar chart to
visualize the contrast sets. The first row would show
areas or blocks which are in proportion to the distribution of the entire dataset. Each following row
represents a single contrast set and its support, represented numerically and as the shaded area, in each
group. The contrast sets would be ordered based on
a ranking heuristic such as the growth rate. Figure 3
shows a model of this for four contrast sets and three
groups with sample support values.
Addressing the open issues in Contrast Set
Mining
Several open issues in contrast set mining were identified in Section 2.3. These issues among others need to
be addressed to further develop contrast set mining
into a more useful data mining task. The most urgent issues which need to be addressed are, the development of additional heuristics for ranking contrast
sets, the comparison of contrasting groups for independence testing, the discretization and classification
of quantitative data, and the visualization of contrast
sets.
In order to distinguish amongst the contrast sets
which are returned, four new interestingness measures
are proposed to rank the contrast sets: overall support, an extension of the support measure by Bay and
Pazzani (1999), growth rate, coverage, and unusualness. The latter three measures were adapted from
the areas of emerging patterns (Dong and Li, 1999)
and subgroup discovery (Klösgen, 1996) (Kavsek and
Lavrac, 2006) (Lavrac et al., 2004), because of their
provable compatibility with contrast set mining. The
overall support for a contrast set, X, is defined as the
sum of the supports of X over all the groups. The
growth rate of a contrast set, X, is defined as the
maximum ratio of the supports in Gi over Gj for all
pairs of the groups. When the support of X in Gi
is zero and the support of X in Gj is not equal to
zero, this creates a new type of contrast set, called
the jumping contrast set. The unusualness of a contrast set, X is defined as the maximum conditional
probability of a group Gi given that X is satisfied.
The coverage of a contrast set, X, is defined as the
percentage of instances in the dataset covered by X.
In the contrast set mining process, there are three
ways of inducing rules in multi-class learning problems: learners either induce the rules that characterize one class compared to the rest of the data, referred to as one versus all or they search for rules that
discriminate between all pairs of classes, referred to
as round robin (Kralj et al., 2007) or learners induce
rules such that the each class is represented separately
and the classes are compared simultaneously (Bay
and Pazzani, 1999), referred to as separate grouping.
Initial analysis shows that for the same independence
test and significance level, the same conclusion would
not be drawn in all three approaches.
A new method is proposed to partition the values of quantitative attributes into intervals, referred
to as discretization, in contrast set mining based on
the work of Aumann and Lindell (1999). Cut-points
for the intervals are created based on the standard
deviation of the values. Starting from the mean, left
and right cut-points would be found by iteratively
subtracting and adding, respectively, a user-defined
factor of the standard deviation until the potential
cutpoints are outside of the range of the values. Each
interval can be considered as a discrete value whereby
a statistical test such as chi-squared can be used to
determine independence. However, the data points
within each interval are continuous and a statistical
test such as the Student’s T-test can also be used to
determine independence. Can the test selection affect
the contrast sets discovered? Initial analysis shows
that the treatment of discretized values has significant effects on the contrast sets discovered.
Once the contrast sets have been discovered, a
suitable method for describing them to end users is
Figure 3: Bar graph visualization of contrast sets
4
Re-defining the Contrast Set Mining Process
With the introduction of a new type of contrast set,
and the expansion to quantitative attributes, the contrast set mining process in Figure 2 needs to be redefined to reflect these changes. Figure 4 shows the
modified procedure for mining contrast sets.
Figure 4: Modified Contrast Set Mining Procedure
The total itemsets are produced from either or
both of quantitative and categorical attributes, depending on the dataset, through the construction of
a search tree. However this tree is potentially larger
than in Figure 1 because each quantitative attribute
is discretized at every level in the tree, in order to
create intervals that are most representative of the
instances. Thus, each possible combination of the
attributes has to be examined. This results in the
maximum
number of contrast sets to be examined as
Pn
n
n
i=1 Pi , where Pi is equal to the number of permutations of n and i, n is the number of attributes, and
i is the level in the tree. This maximum is achieved
when all the attributes, except the class attribute is
quantitative. Figure 5, which uses the same notation
as Figure 1, shows the modified search tree for the
maximum.
3
SIGKDD international conference on Knowledge
discovery and data mining, pages 43–52, New York,
NY, USA, 1999. ACM. ISBN 1-58113-143-7. doi:
http://doi.acm.org/10.1145/312129.312191.
R.J. Hilderman and T. Peckham. A statistically
sound alternative approach to mining contrast sets.
Proceedings of the 4th Australasian Data Mining Conference (AusDM’05), pages 157–172, Dec.
2005.
Figure 5: Example search tree for dataset with three
non-class attributes where all three are quantitative
S Holm. A simple sequentially rejective multiple test
procedure. Scandinavian Journal of Statistics, 6:
65–70, 1979. ISSN 0303-6898.
From the total itemsets, those which do not meet
the minimum support are removed. The remaining
itemsets become the candidate itemsets. Itemsets
which have, for example, the same support in each
group would be removed as part of filtering. The remaining itemsets make up the candidate contrast sets.
The minimum support requirement was changed to
allow for the discovery of jumping contrast sets, where
the growth rate is infinity. For the other contrast sets,
an independence test is used to determine the significant contrast sets, where the support of the itemset
in each group is above some threshold. The interestingness measures proposed, that is, overall support,
growth rate, unusualness, and coverage, along with
the distribution difference, would then be used to rank
the contrast sets. It is certainly possible to use any
of these measures as filtering methods, as well. However, in making the contrast set mining process as
applicable as possible to a wide variety of datasets,
it may be prudent to limit the filtering and allow the
ranking methods to guide the end-user to the most
interesting contrast sets.
Branko Kavsek and Nada Lavrac. Apriori-sd: Adapting association rule learning to subgroup discovery.
Applied Artificial Intelligence, 20(7):543–583, 2006.
Willi Klösgen. Explora: a multipattern and multistrategy discovery assistant. pages 249–271, 1996.
Petra Kralj, Nada Lavrac, Dragan Gamberger, and
Antonija Krstacic. Contrast set mining for distinguishing between similar diseases. In Riccardo Bellazzi, Ameen Abu-Hanna, and Jim Hunter, editors,
AIME, volume 4594 of Lecture Notes in Computer
Science, pages 109–118. Springer, 2007. ISBN 9783-540-73598-4.
Nada Lavrac, Branko Kavsek, Peter A. Flach, and
Ljupco Todorovski. Subgroup discovery with cn2sd. Journal of Machine Learning Research, 5:153–
188, 2004.
Jessica Lin and Eamonn J. Keogh. Group sax: Extending the notion of contrast sets to time series
and multimedia data. In PKDD, pages 284–296,
2006.
References
Jessica Lin, Eamonn Keogh, Stefano Lonardi,
and Bill Chiu.
A symbolic representation of
time series, with implications for streaming algorithms. In DMKD ’03: Proceedings of the
8th ACM SIGMOD workshop on Research issues
in data mining and knowledge discovery, pages
2–11, New York, NY, USA, 2003. ACM. doi:
http://doi.acm.org/10.1145/882082.882086.
Rakesh Agrawal, Tomasz Imieliński, and Arun
Swami.
Mining association rules between
sets of items in large databases.
In SIGMOD ’93:
Proceedings of the 1993 ACM
SIGMOD international conference on Management of data, pages 207–216, New York, NY,
USA, 1993. ACM. ISBN 0-89791-592-5. doi:
http://doi.acm.org/10.1145/170035.170072.
B. Minaei-Bidgoli, Pang-Ning Tan, and W.F. Punch.
Mining interesting contrast rules for a web-based
educational system. Machine Learning and Applications, 2004. Proceedings. 2004 International
Conference on, pages 320–327, December, 2004.
Yonatan Aumann and Yehuda Lindell. A statistical theory for quantitative association rules. In In
Knowledge Discovery and Data Mining, pages 261–
270, 1999.
Mondelle Simeon and Robert J. Hilderman. Exploratory quantitative contrast set mining: A discretization approach. In ICTAI ’07: Proceedings of
the 19th IEEE International Conference on Tools
with Artificial Intelligence - Vol.2 (ICTAI 2007),
pages 124–131, Washington, DC, USA, 2007. IEEE
Computer Society. ISBN 0-7695-3015-X. doi:
http://dx.doi.org/10.1109/ICTAI.2007.99.
Stephen D. Bay and Michael J. Pazzani. Detecting change in categorical data: mining contrast
sets. In KDD ’99: Proceedings of the fifth ACM
SIGKDD international conference on Knowledge
discovery and data mining, pages 302–306, New
York, NY, USA, 1999. ACM. ISBN 1-58113-143-7.
doi: http://doi.acm.org/10.1145/312129.312263.
Stephen D. Bay and Michael J. Pazzani.
Detecting group differences:
Mining contrast sets.
Data Min. Knowl. Discov., 5
(3):213–246, 2001.
ISSN 1384-5810.
doi:
http://dx.doi.org/10.1023/A:1011429418057.
Geoffrey I. Webb, Shane Butler, and Douglas Newlands. On detecting differences between groups. In
KDD ’03: Proceedings of the ninth ACM SIGKDD
international conference on Knowledge discovery
and data mining, pages 256–265, New York, NY,
USA, 2003. ACM. ISBN 1-58113-737-0. doi:
http://doi.acm.org/10.1145/956750.956781.
Roberto J. Bayardo. Efficiently mining long patterns
from databases. In SIGMOD ’98: Proceedings of
the 1998 ACM SIGMOD international conference
on Management of data, pages 85–93, New York,
NY, USA, 1998. ACM. ISBN 0-89791-995-5. doi:
http://doi.acm.org/10.1145/276304.276313.
G.I. Webb.
Magnum opus: Setting new standards in association rule discovery, 2001. URL
www.rulequest.com.
Tzu-Tsung Wong and Kuo-Lung Tseng. Mining negative contrast sets from data with discrete attributes. Expert Syst. Appl., 29(2):401–407, 2005.
Guozhu Dong and Jinyan Li. Efficient mining of
emerging patterns: discovering trends and differences. In KDD ’99: Proceedings of the fifth ACM
4