Download Redescription Mining Over non-Binary Data Sets Using Decision

Document related concepts

Cluster analysis wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Transcript
Universität des Saarlandes
Max-Planck-Institut für Informatik
Redescription Mining Over non-Binary
Data Sets Using Decision Trees
Masterarbeit im Fach Informatik
Master’s Thesis in Computer Science
von / by
Tetiana Zinchenko
angefertigt unter der Leitung von / supervised by
Dr. Pauli Miettinen
begutachtet von / reviewers
Dr. Pauli Miettinen
Prof. Dr. Gerhard Weikum
Saarbrücken, November 2014
Eidesstattliche Erklärung
Ich erkläre hiermit an Eides Statt, dass ich die vorliegende Arbeit selbstständig verfasst
und keine anderen als die angegebenen Quellen und Hilfsmittel verwendet habe.
Statement in Lieu of an Oath
I hereby confirm that I have written this thesis on my own and that I have not used any
other media or materials than the ones referred to in this thesis.
Einverständniserklärung
Ich bin damit einverstanden, dass meine (bestandene) Arbeit in beiden Versionen in die
Bibliothek der Informatik aufgenommen und damit veröffentlicht wird.
Declaration of Consent
I agree to make both versions of my thesis (with a passing grade) accessible to the public
by having them added to the library of the Computer Science Department.
Saarbrücken, November 2014
Tetiana Zinchenko
Acknowledgements
First of all, I would like to thank Dr. Pauli Mittienen for the opportunity to write my
Master thesis under his supervision and for his support and encouragement during the
work on this thesis.
I would like to thank the International Max Planck Research School for Computer
Science for giving me the opportunity to study at Saarland University and their constant
support during all the time of my studies.
And special thanks I want to address to my husband for being the most supportive
and inspiring person in my life. He was the first trigger for me to start and finish this
degree.
v
Abstract
Scientific data mining is aimed to extract useful information from huge data sets with
the help of computational efforts. Recently, scientists encounter an overload of data
which describe domain entities from different sides. Many of them provide alternative means to organize information. And every alternative data set offers a different
perspective onto the studied problem.
Redescription mining is tool with a goal of finding various descriptions of the same
objects, i.e. giving information on entity from different perspectives. It is a tool for
knowledge discovery which helps uniformly reason across data of diverse origin and
integrates numerous forms of characterizing data sets.
Redescription mining has important applications. Mainly, redescriptions are useful
in biology (e.g. to find bio niches for species), bioinformatics (e.g. dependencies in
genes can assist in analysis of diseases) and sociology (e.g. exploration of statistical and
political data), etc.
We initiate redescription mining with data set consisting of 2 arrays with Boolean
and/or real-valued attributes. In redescription mining we are looking for such queries
which would describe nearly the same objects from both given arrays.
Among all redescription mining algorithms there exist approaches which exploits alternating decision tree induction. Only Boolean variables were involved there so far.
In this thesis we extend these approaches to non-Boolean data and adopt two methods
which allow redescription mining over non-binary data sets.
Contents
Acknowledgements
v
Abstract
vii
Contents
ix
1 Introduction
1.1
1
Outline of Document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 Preliminaries
3
5
2.1
The Setting for Redescription Mining . . . . . . . . . . . . . . . . . . . . .
5
2.2
Query Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.3
Propositional Queries, Predicates and Statements . . . . . . . . . . . . . .
8
2.3.1
Predicates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2.3.2
Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4
Exploration Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.1
Mining and Pairing . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.2
Greedy Atomic Updates . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.3
Alternating Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3 Related research
15
3.1
Rule Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2
Decision Trees and Impurity Measures . . . . . . . . . . . . . . . . . . . . 16
3.3
Redescription Mining Algorithms . . . . . . . . . . . . . . . . . . . . . . . 21
4 Contributions
25
4.1
Redescription Mining Over non-Binary Data Sets . . . . . . . . . . . . . . 25
4.2
Algorithm 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3
Algorithm 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4
Stopping Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
ix
x
CONTENTS
4.5
Extraxting Redescriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.6
Extending to Fully non-Boolean setting . . . . . . . . . . . . . . . . . . . 35
4.6.1
4.7
Data Discretization . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Quality of Redescriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.7.1
Support and Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.7.2
Assessing of Significance . . . . . . . . . . . . . . . . . . . . . . . . 38
5 Experiments with Algorithms for Redescription Mining
41
5.1
Finding Planted Redescriptions
5.2
The Real-World Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.3
Experiments With Algorithms on Bio-climatic Data Set . . . . . . . . . . 44
5.3.1
5.4
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Experiments With Algorithms on Conference Data Set . . . . . . . . . . . 57
5.4.1
5.5
. . . . . . . . . . . . . . . . . . . . . . . 41
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Experiments against ReReMi algorithm . . . . . . . . . . . . . . . . . . . 66
6 Conclusions and Future Work
69
Bibliography
71
A Redescription Sets from experiments with Bio Data Set
75
B Redescription Sets from experiments with DBLP data Set
91
Chapter 1
Introduction
Nowadays we encounter massive amounts of data everywhere and increased capabilities
accelerate the generation and acquisition of it. This data can be of different origin and
describe diverse objects which provides the stage for active data mining in the scientific domain. There are numerous techniques and approaches to find useful tendencies,
dependencies or underlying patterns in it.
The data derived from scientific domains is usually less homogeneous and more massive
than the one stemming from business domain. Despite the fact that a lot of data mining
techniques applied in business return nice results for the science as well, some more
sophisticated and tailored methods are needed to meet needs arising in science.
According to Craford [12] there are two types of analytic tasks for science that can be
supported by data mining. Firstly, discovery driven mining used to deriving hypothesizes. Secondly, verification driven mining used to support (or discourage) hypothesis,
i.e. experiments. In this setting hypothesis formation requires more exquisite approaches
and deeper domain-specific knowledge.
Facing imposing data volumes, scientist experience overload of data for describing
domain entities. The issue which comes along with it is the fact that all these data sets
can offer alternative (or even sometimes contradictory) perspective on a studied data.
Thus, a universal tool which is suitable for data analysis is a necessary option to have
on hand. Moreover, identifying correspondences between interesting aspects of studied
data is a natural task in many domains.
It is well known that viewing the data from different prospective is useful for better
understanding of a whole concept. Redescription mining is aimed to embody this. The
ultimate goal of it is finding different ways of looking at data and extracting alternative
characteristics of the same (or nearly the same) objects. As it can be concluded from
the name, redescription mining is aimed to learn model from data in order to describe it
and help with interpretability of investigated results. Redescription is a way of finding
objects that can be described from at least two different sides. The number of views can
be larger than two, but the setting with double-sided data is more common.
To assist in understanding of a redescription mining concept the following example
can be used:
Example 1. We consider a set of nine countries as our objects of interest, namely
Canada, Mexico, Mozambique, Chile, China, France, Russia, the United Kingdom and
the USA. Simple toy data set [48, 43, 63] consisting of four properties characterizing
1
2
Chapter 1 Introduction
these countries, represented as a Venn diagram in Figure 1.1. is also included. Consider
the couple of statements below:
1. Country outside the Americas with land area more than 8 billion square kilometers.
2. Country is a permanent member of the UN security council with a history of state
communism.
Figure 1.1: Geographic and geopolitical characteristics of countries represented as a
Venn diagram. Adapted from [48].
Blue - Located in the Americas
Green - History of state communism
Yellow- Land area above 8 Billion square kilometers
Red - Permanent member of the UN security council
Two countries (Russia and China) satisfy both statements. They show alternative
characterizations of the same subset of countries from geographical and geopolitical
properties. Thus, the redescription is formed.
The strength of it is given by symmetric Jaccard coefficient (1/1=1). Descriptors of
any side of derived redescription can contain more than one entity. This simple example
provides an intuition in understanding concept of redescription.
Thus, we are given multi-view data set (in our case consisting of two sub-sets describing
same objects with different features). For example, in a setting of niche-finding problem
for mammals studied in [23, 49], we can be provided with the one set containing species
which live in particular regions. Another set will contain climatic data about same
regions. Thus, the redescription mined for such a problem, can be a statement that
some mammal resides in a terrain where the average June temperature is in a particular
range, etc. Very often extracting such rules is very laborious if done manually, because
require picking up particular species and investigating their peculiarities.
Application of redescription mining in Bioinformatics can be associated with genes.
In such a case, the task to find such dependencies without suitable tool seems to be
Chapter 1 Introduction
3
unfeasible. Because the amount of data is enormous and very often it is not complete.
But mined redescriptions using one of existing methods are more informative and can
reveal unexpected useful information in a domain. Of course, usage of redescription
mining techniques is not limited to only these two domains. However, to make use of
received redescriptions knowledge of the domain is highly recommended.
Currently redescription mining techniques are able to handle non-Boolean data without
pre-processing. This is claimed to be a better option against previous transformation of
data sets [18]. In a setting with one side of a data set to be real-valued or categorical
redescription mining performed meaningful outcomes. And in case if both data sets
contain real-valued entries the exhaustive search is inevitable. This, in turn, might put
unwanted computational burden.
Beside this, redescription mining using decision trees with a modification such that it
can work with numerical entries (at least on the one side) might perform well and become
a competitive alternative to aforementioned techniques. However, it is not implemented
so far. Thus, this is a starting point for work conducted within thesis. A stretch goal
for the project can be defined as a resulting algorithm which allows both sides of data
set to be non-binary.
Finally, the comparison of received outcomes with the redescription mining conducted
by existing methods is to be performed. Also it is useful to test new method in synthetic
setting to study the behavior and performance of the algorithms. After this, conclusions
about the quality of the method can be made.
1.1
Outline of Document
This Thesis is organized as follows:
• Chapter 1 provides introduction to the topic
• In Chapter 2 the problem of redescription mining is formalized. Section 2.2 and
2.4 cover Query languages and Exploration strategies that can be used within
algorithms for redescription mining.
• Chapter 3 is devoted to related research. Namely, cover other algorithms, which
share some features with redescription mining. Section 3.2 describes in detail
decision tree induction methods together with impurity measures. Section 3.3 is
dedicated to other existing algorithms to mine redescription.
• Chapter 4 describes contributions made within this Thesis to the topic. In particular, Sections 4.2 and 4.3 provide explanation of two elaborated algorithms
for redescription mining over non-binary data sets using using decision trees. In
Section 4.7 we outline the way we evaluate our results.
• In Chapter 5 all experiments are covered. In particular, Section 5.1 is responsible
for synthetic setting and Sections 5.3 and 5.4 report the results and discussion of
experiments on the real-world data sets: biological and bibliographic respectively.
In addition, here in Section 5.5 we compare results of our algorithms to the
ReReMi algorithm [18].
• Finally, Chapter 6 contains conclusions to the Thesis.
Chapter 2
Preliminaries
2.1
The Setting for Redescription Mining
We denote O as a set of elementary objects and A a set of attributes which characterize
properties of the objects or relations between them. The attributes originate from
different sources and terminologies are denoted as a set of views V . Function v maps an
attribute to the corresponding view: v : A → V . The data set can be represented in a
form of triplet: (O, A, v ). Redescriptions are composed with several queries.
Definition 1. An expression formed with logical operators, expressed over attributes
in A and evaluated against the data set is called a query.
Q denotes a set of valid queries and called - query language. In order to assess any
statement against a data set, it is necessary to conduct expression replacement of the
variables in this statement with objects from the data set and identify the substitutions
where the ground formula holds. Support of a query q is this subset of objects of
nonempty tuples. We denote it as supp(q). All feasible substitutions for queries in a
query language are called as a set of entities and denoted as E. By att(q) we denote
set of attributes which can be found in a query q. Function v also denotes their view’s
unions: v(q) = ∪A∈att(q) v(A).
To make sure that two queries describe data from different view their attribute sets are
required to be disjoint. Similarity in support is provided by symmetric binary relation
∼ as a Boolean indicator. Finally, set C can denote arbitrary constraints that can
be applied to redescription. For example, to ensure ease of interpretation the maximal
length of set-theoretic expressions is to be provided or only conjunctions are used. Thus,
having this formalism, a redescription can be defined as following:
Definition 2. Given a data set (O,A,v ), a query language Q over A and a binary
relation ∼, a redescription is a pair of queries (qA ; qB )∈ Q× Q such that v(qA )∩v(qB ) = ∅
and supp(qA ) ∼ supp(qB ). Redescription mining is a process of these pairs discovery.
The problem of redescription mining: Find all redescriptions that satisfy constraints from C, given a data set (O,A,v ) with query language Q, and the binary relation
∼.
Example 2. (Based on Figure 1.1.) Here ten counties (UK, France, USA, Mexico,
Chile, Canada, China, Russia, Mozambique, France) form a set of objects. For attributes
(Blue, Yellow, Red, Green - equivalently (B, Y, R, G)) are split into two views: G geography (includes B and Y) and P - geopolitics (includes R and G). Thus, a set
5
6
Chapter 2 Preliminaries
of attributes is written as A = {B, Y, R, G}. For example, v(B)=G. First query over
geographical attributes can be written as: qG = B ∧ Y . In our data set this query is
supported by two countries: supp(qG ) = {Russia, China}.
Next step is a query over geopolitics. That is, qP = R ∧ G. Again, when evaluated
against our data set, support is provided by the same two countries. Hence, supp(qP ) ∼
supp(qG ). Moreover, v(qP ) ∩ v(qG ) = {G} ∩ {P } = ∅. Then, based on Definition 1,
(qG , qp ) is a redescription.
As it can be derived from its name, redescription mining is the analysis which is
focused on describing. It is not supposed to predict unknown data, but rather, describe
properly available data. In addition, the extent of expressiveness and interpretability
of the outcome really matters. Expressiveness can be determined through the variety
of concepts that a language can represent. At the same time interpretability is more
difficult to measure, since it implies the ease with which the associated meaning can be
grasped.
Nevertheless, simpler queries facilitate interpretability of an element of the language.
While solving any redescription mining task collection O (which consists of elementary
objects/samples) is considered. Attributes in A characterize the properties of these
objects. The set of views V denotes the various sources, domains or terminologies from
which the data originate.
If talking about particular tasks, for example in case in biological niche finding problem. Climate data on one side and fauna data on the other side create to fully diverse
sets of attributes that fit a setting for given problem. In case when we have medically
related problem these sets can be formed by personal information about patient’s background, elements of diagnosis and symptoms. Since redescriptions are focused to find
characteristics of the same (or nearly the same objects), we require that the attributes
over which both queries of a redescription are expressed come from disjoint sets of views.
As it was already mentioned we will stick to two-sided setting. This means, there
will be to data sets, which are denoted by L (for left) and R (for right) such that
AL ∪AR = A. In case we have multiple views the correspondence between the elementary
objects across the views might not be available. This can be caused by the fact that the
sets of objects occurring in distinct views do not coincide completely. Or, some objects
might have many observations in one view and single in another. Setting up of these
correspondences appear to be a non-trivial task, which formulates a research question
on its own [54].
The purpose of redescription mining is to find alternative characterizations of almost
the same objects. This means that the similarity of the supports of the queries determines the quality of a redescription derived. It is said, that a couple of queries are
accurate if they have similar supports. More general, similarity relation between support
sets is determined by similarity function f . In addition to that, a threshold σ such that
the following holds:
Ea ∼ Eb ←→ f (Ea ; Eb ) ≥ σ
The function f is usually chosen to be Jaccard’s coefficient [27]. We use this coefficient
as our measure of choice for accuracy, but it can easily be replaced with another set
similarity function. We consider similarity between the supports of the queries of a
redescription to be a main property of a redescription and call it accuracy. Thus, the
Chapter 2 Preliminaries
7
pair of two queries can be called accurate if their supports are similar. By similar we
imply they pass the given threshold. Moreover, similarity coefficient is 1 when two
queries are identical. This means we have a perfect redescription.
In practice, redescriptions with the similarity coefficient less than 1 are also useful in
many domains. A chain of these redescriptions can be used to connect independent
entities (i.e. applicable in story telling). Or, if we talk about bioinformatics, trying to
find genes responsible for a particular disease.
For a pair of queries (ql ; qr ), we denote by several subsets of entities:
1. E1,1 - entities that support both queries (i.e. E1,1 = supp(qL ) ∪ supp(qR ))
2. E1,0 - entities that support only first query
3. E0,1 - entities that support only second query
4. E0,0 - entities that do not support any query.
As an example of similarity function the following can be applied:
• matching number |E1,1 | + |E0,0 |
• matching ratio
|E1,1 |+|E0,0 |
|E1,0 |+|E1,1 +|E0,1 |+|E0,0 |
• Russell & Rao coefficient
• Jaccard’s coefficient
|E1,1 |
|E1,0 |+|E1,1 +|+|E0,1 |+|E0,0 |
|E1,1 |
|E1,0 |+|E1,1 |+|E0,1 |
• Rogers & Tanimoto coefficient
• Dice coefficient
|E1,1 |+|E0,0 |
|E1,0 |+2|E1,1 |+|E0,1 |+|E0,0 |
2|E1,1 |
|E1,0 |+2|E1,1 |+|E0,1 |
The choice of Jaccrad coeficient is more common when talking about evaluation of
redescriptions. This caused due to its simplicity and its agreement with the symmetric
approach adopted in redescription mining. Jaccard coefficient includes the support of
the two queries equally. Moreover, it is scaled to the unit interval without involving the
set of entities that support neither queries E0,0 .
8
2.2
Chapter 2 Preliminaries
Query Languages
The way we represent the results of redescription mining is determined by the query
languages. They are an essential part of the whole redescription mining technique.
Queries are the logical statements that are evaluated against given data set. These
statements are obtained after a combination of distinct predicates using Boolean operators. We can replace predicate variables with objects from given data set and verify
whether the conditions of the predicates are satisfied returns the truth value. The objects
which satisfy the given query are considered to be a support of this query.
In this part we cover different types of query languages. In particular we determine the
query structures which are used for redescription mining. They offer a representation
of logical combinations of constraints on the variety of individual attributes. Previous
papers which cover redescription mining also discussed diverse formal representations of
queries and query languages [48, 20].
2.3
Propositional Queries, Predicates and Statements
The queries are formed by logical statements evaluated against the data set. These statements are derived by atomic predicates built from individual attributes using Boolean
operators. Substituting predicate variables with objects from the data set and verifying
whether the conditions of the predicates are satisfied returns a truth value. The object
tuples in substitutions satisfying the statement form the support of the query.
We define a query language as a compound of acceptable queries, dependent on the
supported types of attributes and the principles for building predicates. Also, syntactic
rules for combining them into statements belong to the query language we use.
In this thesis we focus on propositional data sets. They contain attributes characterizing properties of individual objects. Sets of objects are deemed to be homogeneous,
i.e. attribute applies to all objects.
The set is called propositional, if it contains attributes which characterize properties
of distinct objects. In this setting, a value which attributes from A take form a matrix
D. This matrix contains |O| rows. One attribute correspond to one object. There are
|A| columns, each of them correspond to an attribute. Thus, the value of an attribute
Aj ∈ A is defined as D(i, j) = Aj (oi ) for objects oi ∈ O.
Let’s consider an example from [17] to exemplify query languages. Data set from
Table 2.1 contains countries as objects. Each column represent some property of a
county (geographical details). This data can be expressed as matrix G with 7 columns.
G = {G1 , G2 , . . . , G7 }, where Gn - is a vector, which corresponds to some property, i.e.
maximal elevation, continent, etc.
Chapter 2 Preliminaries
9
Table 2.1: Example data set. World countries with their attributes.
Country
Canada
Chile
China
France
Great Britain
Mexico
Mozambique
Russia
USA
G1
0
1
0
0
0
0
1
0
0
G2
1
1
0
1
1
1
0
0
1
G3
0
0
0
0
0
0
1
0
0
G4
1
1
1
0
0
1
0
1
1
G5
N.America
S.America
Asia
Europe
Europe
N.America
Africa
Asia, Europe
N.America
G6
9.98
0.76
9.71
0.64
0.24
1.96
0.79
17.1
9.63
G7
5959
6893
8850
4810
1343
5636
2436
5642
6194
Here we have 7 vectors, constituting the following features:
1. G1 - Location in South Hemisphere
2. G2 - Border with Atlantic Ocean
3. G3 - Border with Indian Ocean
4. G4 - Border with Pacific Ocean
5. G5 - Localization on a continent
6. G6 - Land area(109 km2 )
7. G7 - Maximal elevation of the surface in meters
2.3.1
Predicates
Attributes take values which compose a range. We restrict the values to selected subset
of the range of it and construct a predicate from an attribute. Let’s consider some
attribute Aj ∈ A from a range R. Having fixed a subset Rs ⊆ R, it is possible to
transform an associated data column into truth value assignment. This is we turn it
into a Boolean vector which indicates which values are placed within the fixed range.
This is denoted as [Aj ∈ RS ]. As a consequence, it includes a subset of objects each of
them has an attribute Aj with the value RS . Membership in such a sub set can be then
written as follows: s(Aj , Rs ) = {oi ∈ O, Aj (oi ) ∈ RS } and [Aj ∈ RS ]. Based on range,
all attributes can be segregated into types: Boolean, nominal and real-valued.
Boolean predicates. Boolean attributes can be only in two values: true of false. Or,
equivalently either 0, or 1. Interpretation of a Boolean variable can naturally create a
predicate. For simplicity, a true value assignment (i.e. [A = true]) is written simply as A.
Thus, [A = f alse] is a complementary assignment, which can be written with negation
(i.e ¬A). From the example above vector G3, a Boolean attribute corresponding to a
predicate with the following truth assignment for this data:
h0, 0, 0, 0, 0, 1, 0, 0i
10
Chapter 2 Preliminaries
Thus, one country (i.e. Mozambique) which has a border with Indian ocean is selected.
Nominal predicates. Some attribute Acan be called a nominal attribute when its
range is non-ordered set C or its power set. Categories (which reside in C) are considered
to be categories on an attribute A. To ensure truth value assignment, the subset of the
categories CS ⊆ C is chosen. Alternatively, a single category c ∈ C is selected. Thus,
nominal predicates are written as follows: [A ∈ CS ] and [A = c]. In practice, we
consider only those nominal attributes which take a single value. In case there are
nominal attributes with multiple values, we represent them with a help of multiple
Boolean attributes, i.e. one attribute for each category.
From the above example, six countries have borders with Pacific Ocean:
G4 ∈ {P acif ic Ocean}
Is satisfied by truth assignment: h1, 1, 1, 0, 0, 1, 0, 1, 1i. If we look on location on a
continent vector (G1 ), the attribute becomes multi-valued, because Russia falls into
two categories: Asia and Europe. In practice, multi-valued attributes are expressed via
several Boolean attributes, one per category.
Real-valued predicates. Some attribute A is considered to be a real-valued attribute, if its range is formed from real numbers R ⊆ R. The truth value assignment is
derived from selecting of any subset of R.
Nevertheless, for ease of interpretation truth value assignment is made based on some
particular adjacent subset of R. That is, we use A ∈ [a; b] to denote an interval [a, b] ⊆ R.
In addition, for any given real-valued attribute there are infinite possible intervals. All
of them will result in truth value assignment. Thus, the measurement of query language
consistency must involve also a criterion to select one of such equal intervals. Fox
exemplification, let’s consider the following:
G7 = h5959, 6893, 8850, 4810, 1343, 5636, 2436, 5642, 6194i
For a pair (a, b), with a ∈ (2000, 2200) and b ∈ (5000, 5500) the truth value assignment
will look like:
[a ≤ G7 ≤ b] = h0, 0, 0, 1, 0, 0, 1, 0, 0i
Thus, as a result we get several equivalent intervals for truth value assignment. Fox
example, [2200 ≤ G7 ≤ 5000] and [2436 ≤ G7 ≤ 4810]. Thus, decision depends on the
belief whether rounded bound are considered to have better interpretability or not. This
in turn depends on a task or problem we work with. For instance, usage of rounded
bounds can be adopted in case we work with big data sets involving many countries,
when the range in values is big enough. In case of smaller data sets (e.g. like the one
we consider here with 9 countries) exact bounds might be more desirable, because they
provide more precise description og each studied country.
2.3.2
Statements
Predicates, discussed previously, are used as a pieces to construct statements. Propositional predicates are joined with the help of Boolean operators:
Chapter 2 Preliminaries
11
1. Negation - 0 ¬0 ;
2. Conjunction - 0 ∧0 ;
3. Disjunction - 0 ∨0 ;
The truth assignment for the a query is derived via combination of the truth assignment
of the individual predicates.
The resulting subset of objects is the support of the query. Namely, support of query
q on D, suppD (q), is a set {o ∈ O : q is true f or o}.
For example, the query which is satisfied by countries with Atlantic Ocean borders,
but without borders with Pacific ocean with maximal elevation less than 4500 meters
looks as follows:
q1 = G2 ∧ ¬G4 ∧ (G7 < 4500)
Size of the support of this query is 1, since only Great Britain from our data set is
characterized by these features.
Now let’s move to possible query languages which deploy predicates and statement
from above. Firstly, one of the most limited and restricted query languages is monotone
conjunctions. That is, all predicates are allowed to be combined only with conjunction
operator.
For example, the following query from the running example is a monotone conjunctive
query:
q2 = G1 ∧ G4 ∧ [2000 ≤ G7 ≤ 5000]
First query can not be called the member of this query language because it is not
monotone.
These type of queries (monotone conjunctions) correspond to itemsets in which every
predicate represent an item. Itemsets are vigorously studied in the literature. Algorithms
to mine frequent itemsets received an increased interest [24, 11].
For example, it is possible to partially arrange them in order on the inclusion principle
to verify the downward closeness property. Which means if some query qi is a subset
of some query qj , then support of qi is a superset of qj ’s support. Thus, a search space
in such a case can be explored more efficiently. Monotone conjunctions are easy to
find and interpret, at the same time restriction for disjunctions and negations affects
expressiveness of the queries mined.
The opposed type to monotone conjunctions is unrestricted queries. Here predicates are allowed to be combined using any above mentioned operators without any
restrictions. Nevertheless, this extreme case provides full expressiveness for the queries.
Examples of unrestricted queries can look as follows:
q3 = (G2 ∧ G4 ∧ G1 ) ∨ ¬(G3 )
q4 = G2 ∧ [G6 < 1.9]
q5 = ((¬G1 ) ∧ ([G5 = Asia] ∧ G3 ) ∧ [1.9 ≤ G6 ≤ 7.6] ∧ ¬G4
q6 = [2000 ≤ G7 ] ∧ G1 ∧ [1200 ≤ G7 ≤ 8000]
12
Chapter 2 Preliminaries
Both queries mentioned before belong to this query language as well. Expressiveness
of queries without restrictions is maximal but queries can become more difficult to
interpret. For example, they can contain deeply nested structures, which means, we
have a query which involves numerous attributes with a complex structure. Despite the
fact that the support of this query matches the support of other query very well (e.g.
the redescription formed by these queries will be highly accurate), interpretation of this
redescription will be obstructing. This is caused by many entangled conditions. As a
consequence this redescription losses its intiristingness. Moreover, the final space formed
by redescriptions looks disordered and becomes difficult to search. Here we can observe
a rich structure of queries and full expressiveness, while nested structures make queries
hard to interpret. Hence, a balance between expressiveness and interpretability is the
most desirable feature.
A compromise between these two languages can be linearly parsable query language.
Here queries are formed with the help of the simple formal grammar. Moreover, to ensure
ease of inteprebility it is possible to apply some moderate restrictions. For example, allow
every attribute to appear only once, etc.
Selection of a query language theoretically should be performed ahead adopting the
algorithm. Practical constraints very often influence the choice. That is, the adopted
algorithm might naturally result is a particular query language. For example, linearly
parsable queries are more natural for the algorithms with iterative atomic extensions
which append on each iteration new literal to a query [20].
In this Thesis we exploit decision tree induction to mine redescriptions which affects
the query language we use. We stick to the data set with Boolean predicates on the one
side and real-valued - on the other. We avoid usage of negations by flipping the sign. For
example, for Boolean predicate instead of ¬G1 we would have G1 < 0, 5 - meaning ’0’
(i.e. ’false’) and G1 ≥ 0.5 meaning ’true’ or ’1’. But, if necessary, negations can be used
as well. Also, we allow both: conjunctions and disjunctions to provide expressiveness of
the resulting queries and there is no restriction for the predicate to appear only once in
a query.
Chapter 2 Preliminaries
2.4
13
Exploration Strategies
There exist several strategies for redescription mining. Basically, there are few approaches how one can find redescriptions given a query language and a space of possible
queries. Also different constraint on the redescriptions might be used as well. Thus, combination of these parameters results in different search spaces. Some properties (such
as anti-monotonicity) assists in more effective redescription mining process. There are
three main generic exploration strategies for redescription mining.
2.4.1
Mining and Pairing
This simple strategy includes two main steps for redescription mining. Firstly, individual
queries are found from different data sets. Secondly, these queries are combined into pairs
based on similarity in their supports. Thus, a redescription is formed from two similar
queries from different data sets. Within recent times several authors devised algorithms
to mine queries over fixed set of propositional predicates [6, 11, 62].
This approach has some treats which make it suitable for data sets which include
small amount of views because finding separate queries and pairing them later can be
performed very effectively. In contrast, when data sets contain imposing amounts of
views, this exploration strategy result in queries over all predicates pooled together.
When combining them, the queries with similar supports might appear to have disjoint
predicates.
This scheme is advantageous because it allows adaptation of frequent itemset mining
algorithms for mining redescriptions. As an extension of this independent mining and
further pairing the second step can be replaced with a splitting procedure. This includes
pooling together all predicates for the first mining step with future splitting the queries
depending on views. Nevertheless, the fact that the query exist does not guarantee that
it can be split into several smaller ones.
When we have data coming from two different views, we can mine monotone conjunctive redescriptions in a level-wise fashion. This is similar to the algorithm from [6, 38],
which is called Apriori.
Support of queries and their intersection is used and can be used safely for pruning since
they are anti-monotonic. Finally, this exploration strategy finds its best applications
in case of exhaustive search. Hence, when the sets are not big enough to cause an
undesirable computational burden.
2.4.2
Greedy Atomic Updates
Next exploration strategy is based on iterative search of the best atomic update to the
current query pair. That is, one tries to apply atomic operations to one of the queries
such that a resulting redescription becomes better. This process is continued until no
improvements further possible.
Atomic updates imply operations which include addition, deletion and edition of predicates. Hence, a new predicate can be added, removed or changed (for example, negated).
In order to prevent the algorithm to form cycles, it is possible to remember the queries
which already has been explored. As a starting point a couple of perfectly matching
queries from distinct views can be selected. This approach was firstly proposed by Gallo
14
Chapter 2 Preliminaries
et al. [20] and it used only addition operations to update the query. Later it was
extended to non-Boolean setting with ReReMi algorithm [18]. More to say, the issue of
missing entries was also covered, since it is a highly relevant aspect when working with
real data.
2.4.3
Alternating Scheme
One more approach to build redescriptions is an alternating scheme. We use it as the
main exploration strategy in this Thesis, because the algorithms we elaborate are based
on decision tree induction. The main idea behind this strategy is to find one query and
then find another one which matches good to it. Then the first query is replaced with a
new one, which makes a better match. Alternations are continued until no better match
can be made or the stopping criteria is met.
(0)
For example, we start with query from a left-hand side qL and search for a good
(1)
matching query from the right-hand side qR . Now, we proceed again to the left-hand
(2)
side and try finding another query qL that matches the one derived from the right.
Hence, the algorithm runs in this manner until termination.
If one hand side of the redescription is fixed, the task of finding an optimal query for
the other side can be defined as binary classification task. Entities that belong to the
support of the fixed query are positive examples, while the elements not in the support
are negative examples. Hence, the redescription mining task can be potentially solved
with the help of any feature-based classification technique along with query language.
Finding the proper starting point for the alternating scheme is a question of the quality
of the method on its own. The simplest option is to randomly split data into examples
and use this partition for initialization. Or, start with a queries which consist only of
one predicate.
Having fixed the number of starting points and the number of allowed alternations,
the complexity of such an approach depends mainly on the complexity of the chosen
classification algorithm used for alternations. In this thesis we focus on the alternating
scheme for redescription mining task. And as a classification algorithm we use decision
tree induction.
This idea is not new. Firstly it was adopted by CARTWheels algorithm [48] which
is able to process binary data sets and mine redescriptions by matching the terminal
nodes (leaf nodes) into pairs of queries.
Chapter 3
Related research
3.1
Rule Discovery
The main feature inherent to redescription mining is ’multi-views’. This implies description of entities with the help different set of variables. Nevertheless, this ’multi-views’
feature is not unique for only redascription mining. One of the most common similar
approaches is supervised classification [57], yet it is not always perceived as such. In
classification entities are characterized by the observations on one hand an by the class
label on the other hand.
The starting point of viewing same object from different angles was introduced by
Yarowsky [60]. He initiated aforementioned multi-view learning approaches. This was
followed by Blum and Mitchell [7] and evoked high interest to the topic.
Mining single query can be treated as a classification task. When we fix one query,
we get binary class labels and we are looking for a good classifier for it. A particular
example where we have Boolean attributes and targets is Logical Analysis of Data [8].
It has on purpose finding an optimal classifier of pre-determined form (e.g. DNF, CNF,
a horn clause, etc.)
Multi-label classification has a bit more resemblance with redescription mining as
well [55]. Here classifiers are supposed to to be learned for conjunction of labels. This
restriction only to conjunctions and prediction (not description) are the main differences
of this approach to redescription mining.
Moreover, there are several more instances that can be covered as somehow similar
ones to redescription mining. Emerging Patterns [41] is targeted at Boolean data and
item sets (monotone conjunctive queries). Thus, it tries to detect those itemsets, whose
presence depends statistically on negative or positive label assignment of the objects.
In case of perfect outcome the itemset will reside solely in positive example and will
compound a perfect classifier for the given data set.
One more approach that can be related to redescription mining is Contrast Set Mining [41]. It can be used to detect monotone conjunctive queries which gives the best
discrimination of some distinct class from all other objects from data. Subgroup Discovery [56] can also be mentioned here. It is aimed to find a query such that all objects
from determined subgroup posses atypical values for target attribute compared to other
objects.
15
16
Chapter 3 Related research
Taking everything into account, it can underlined that the main differences between
redescription mining and these approaches are: the goal of redescription mining is finding
simultaneously multiple descriptions for a subset of entities which were not previously
determined. It selects only several relevant variables among big variety. Moreover,
we have one-dimensional redescription mining problem despite there are two sets of
describing attributes. Queries are constructed over one set of attributes, determining
subgroups of a quality that is measured as their ability to describe queries from the
second set of attributes.
3.2
Decision Trees and Impurity Measures
Decision trees. Regardless the domain where decision trees are used they are aimed
to use a given set of attributes to classify data into a set of predefined classes. Firstly,
a training data set is used to help tree learn about the specific data. Thus, we run
the algorithm to split the source set into several subsets based on attribute value. This
process is repeated on each resulting subset in a recursive manner and has name recursive
partitioning. This recursion is considered complete when the subset which fall into the
same node carry the same class label. Or, when further splitting does not result in
adding the value to the predictions.
Secondly, test data sets are used to evaluate the accuracy of built tree, to determine
weather it is able to classify data properly. By properly, we mean placing each attribute
into a correct class (i.e. minimize instances of misclassifications). A decision tree that
has multiple discrete class labels is called a classification tree. Tree-based model have
variety of uses starting from spam filtering [16] going even to astrophysics [28].
The concept of decision trees is not new it was invented in 1966 by Hunt, Marin, and
Stone [13]. In this thesis we mainly concentrate on Classification Trees aspect because
trees are used redescriptions of the same (or nearly the same) objects. For example,
in biological niche finding problem we do not focus on predicting climatic conditions of
any species. On the contrary, the idea is to find specific information about a mammals
which already live in a particular surroundings.
Decision trees were one of the earliest methods used to build classifiers [34]. They
have several advantages: they are easily interpretable by human experts; they provide
effective induction and accuracy; they are comparatively easy to be built, etc. When
using decision trees, it is important to determine the algorithm used to actually build
the tree. This includes investigation of different splitting rules used, because the quality
of the result might be highly dependable on the choice of the parameters. For example,
Information Gain, Entropy, Gini [34, 10], etc. There exist numerous implementations
which are scalable and effective [9]. Some of them more suitable for smaller data and
vice versa. Thus, the mechanism used to build a decision tree is to be studied in detail
in order to provide a strong support for redescriptions mining based on this approach.
In general, having a given set of attributes, there are exponential many decision trees
can be constructed from it. Resulting trees will differ by their accuracy. However, the
finding of optimal tree is usually unfeasible, since the search space is of exponential size.
Nevertheless, there are numerous efficient algorithms that can produces decision trees
of reasonable consistency within the acceptable time span. They mainly use greedy
strategy that deepens a tree by making a succession of locally optimal decisions.
Chapter 3 Related research
17
One of the most known algorithms of this type is Hunt’s algorithm [42]. It is used as
a base in many common algorithms, e.g. ID3 [46], C4.5 [47], and CART [34].
Hunt’s algorithm [13] grows a tree in a recursive fashion by partitioning training
set into several ordered purer subsets. Any algorithm which is used for decision trees
induction must deal with two main aspects. One of them, how to split the training set.
On each recursive step of growing the tree the algorithm must split the training data
into smaller subsets. To embody this algorithm must provide a method which specify
the test condition for attributes of diverse types. In addition, the way of measuring
the goodness of every test condition should be defined. These ’goodness measures’ are
commonly called impurity measures and discussed further.
One more aspect, the stopping criterion should be determined as well. The easiest
approach to stop the process of tree-growing is to terminate it whenever all of the
entries in nodes belong to the corresponding classes (i.e. nodes are pure) or all entries
have identical attribute values. These two points are enough to terminate any algorithm
which builds decision trees. However, early termination has some advantages.
In this thesis we focus on the most famous algorithm for decision tree induction called
CART [34].
Classification and Regression Tree (CART)
CART was firstly introduced in by Breiman et al. [34]. CART was invented independently within same time span as ID3 [46], both use similar approach for learning
decision tree from training tuples. CART is a non-parametric decision tree training
technique which returns classification or regression trees.
CART is the most popular data mining technique for classification purposes in the
world. It revolutionized the entire field of advanced analytics and allowed data mining
to move to the new level [1]. It is a statistical approach that allows selecting from a huge
number of explanatory variables those, which are most important for determining the
response variable to be explained. Decision trees partition (split) the data into mutually
exclusive nodes (groups). Nodes are supposed to be maximally pure. Building process
begins with a root node, which contains all objects in it. Further they are split into
nodes by recursive binary splitting. Each split is determined by a simple rule based on
a single explanatory variable. Steps done in CART to grow a classifier can be expressed
in the following [34]:
1. All objects assigned to a root node;
2. All possible splits of an exploratory variable and attribute values (splitting rules)
are generated;
3. For each split from previous stage separate objects from the parent into two child
nodes based on the value (lower or higher than it);
4. Pick up a the variable and a value from 2 which return highest reduction of impurity. Impurity measures are discussed in the Section 3.2;
5. Conduct the split into two child nodes according to selected splitting rule;
6. Repeat steps 2-5 applying then to all child nodes as if they are parents until the
tree has a maximal size.
18
Chapter 3 Related research
7. Prune the tree with a help of cross-validation [31] to return the tree of an optimal
size. The pruning algorithm here attempts to balance optimistic estimate of empirical risk with the help of addition of a complexity term. This complexity term
penalizes bigger sub-trees. In cross-validation some objects are randomly removed
from the data and than they are used to assess predictive power of the tree.
The one of the common ideas is to stop building the tree early (early termination)
can result in not sufficient coverage of interactions between explanatory variables [51].
That is why, in CART it is chosen to allow tree to grow to a maximal size. In these
maximal trees all nodes will be either small (will contain a single object or a desirable,
predetermined number of objects) or the resulting nodes will be pure (i.e. no further
split is needed). This type of a tree is overfitted: not only does it fit the data, but also
a noise and idiosyncrasy of a training set.
Hence, next steps are dedicated to pruning. The branches which lead to the smallest
decrease in accuracy comparing to pruning of the other branches are pruned.
For each sub-tree T, a cost-complexity measure Rα (T ) is [44]:
Rα (T ) = R(T ) + α|T |
Here |T | is a number of terminal nodes (or complexity of a sub tree), α is a complexity
parameter; R(T ) is an overall miss-classification rate for classification trees or the total
residual sum of squares for regression trees. Every value of alpha has a corresponding
unique smallest tree that minimizes the cost complexity measure. Complexity parameter
increases from 0. This returns a nested sequence of trees which become smaller in size
[44].
All of them has the best size size and selection of the best option can be transferred
into a problem of selection of the best size. Cross-validation defines this optimal size.
The data set we work with is randomly divided into N subsets (commonly set to 10).
One of these subsets is used as a test. All other N − 1 sets a grouped together and used
as learning a data set. Tree is grown and pruned N times and in every new case with a
usage of different subset these subsets play role of a test sets.
A prediction error (sum of squared differences between the observations) is calculated
for every size of a decision trees. Then, it is averaged over all subsets and matched with
the sub-trees of the complete data set using the values of alpha. The optimal sized tree
is the one with the lowest cost-complexity measure [31].
CART implies the assumption that samples are independent in computing of classification rules [36]. Models produced by CART have positive features, such as: input
data is not supposed to convey the normal distribution; predictor variables do not have
to be independent.
It is possible to model non-linear relations between predictor variables and observed
data. CART enables evaluation of the importance of the diverse explanatory variables
to define a splitting rule and splitting value. The technique used for this is ’variable
ranking method’ [45]. Variables which do not show up in the resulting tree can be
called less important for data set description.
CART has numerous implementations that undergo continues changes, extensions and
improvements. New units are written to make it more convenient or specific in distinct
Chapter 3 Related research
19
domains. In implementations of our algorithms for redescription mining we use available
packages in R, namely rpart [52] and rattle [59].
Impurity measures. The main aspect in decision tree building is to decide how to
split the data set. The ’goodness’ of split is evaluated by impurity measures. Which
in fact a function which assesses how well the particular split separates the data into
classes. Impurity measure is an objective to to by minimized on each intermediate stage
of a decision tree building. In general, impurity measure should satisfy:
1. Is should be the largest when data is split evenly for attribute values
2. Should be 0 when all data belongs to the same class
Quinlan’s information measure (Entropy). Originally, Quinlan offered to measure this ’goodness’ based on on a classic formula from information theory:
−
P
pi
pi log(pi )
With pi - probability of i -th message.
Thus, the outcome depends entirely on a probability of possible messages. If their
probability is equal, there is the greatest amount of uncertainty. Thus, gained information will be the greatest. Consequently, if they are now uniform, less information
will be gained. Also, the value of this objective function depend on the amount of messages. Thus, entropy of a pure node is zero, because then the probability becomes 1 and
log(1) = 0. And vise versa Entropy is maximal when all classes have equal probability
to appear.
Information gain. One of the most common impurity measures used while building
decision trees in various implementations is Information Gain (IG). Which, in fact, is a
difference in Entropy (i.e. also involves computation of entropy for the nodes).
Information Gain, popularized by Quinlan in [46], is the expected reduction in entropy
provoked by partitioning the objects according to a given attribute. Let’s denote set C
which has p objects on one class (P) and n objects on another class (N). If the decision
tree is accurate, it is supposed to classify there objects in the same proportion as they
present in C.
As a root of a decision tree a partition A (which contains values {A1 , A2 , · · · , Av } is
picked, so it would partition C into {C1 , C2 , · · · , Cv }. If Ci contains pi objects of class
P and ni of class N and the expected required for the sub-tree for Ci information is
denoted as I(pi , ni ), then expected information needed for the tree with A as root is
defined as the weighted average:
E(A) =
v
X
p i + ni
i=i
p+n
I(pi , ni )
The information gained by using A as a root is defined as follows:
IG(A) = I(p, n) − E(A)
20
Chapter 3 Related research
Whenever Information Gain is used as impurity measure in decision tree algorithms,
all candidate attributes are investigated. The one, which maximizes information gain is
chosen. The process is continued further with the residual subsets {C1 , C2 , · · · , Cv }.
Classification Error. As an impurity measure the classification error can be used
as well. It is also aimed to determine the ’goodness’ of a split on a terminal nodes with
defying the entries which go to a children nodes. It measures a missclassification error
made by a node and looks as follows (for a node t): Error(t) = 1 − max P (i|t).
1
Thus, classification error is maximal (from 1 to N classes
). When all entries are evenly
distributed across the classes. This means, we gain the least interesting information.
Classification error becomes minimal when all entries represent the same class (Error =
0).
The GINI index (Gini). Very similar to Quinlan’s impurity measure was presented
by Breiman [34] is called Gini index.
Gini measures of how often a randomly chosen element from the data would be erroneously labeled if it were randomly labeled according to the distribution of labels in
the subset. Gini impurity measure is computed as a sum of the probabilities of each
item being chosen multiplied with the probability of a mistake in categorizing this item.
Thus, it equals to zero when all entries of a node belong to the same class. Formally,
Gini looks as follows:
Gini = 1 −
X
p2i
With (pi ) - probabilities for each class to appear. In case we have a pure class in
each node, the probability becomes 1 (Gini = 1 − 12 = 0). Similarly to Entropy, Gini
becomes maximal, when all classes carry equal probability. Originally, Gini measures
the probability of missclassification of a set of objects, rather than the impurity of a
split.
Gini index together with Information Gain are the most commonly used measures in
classifiers built with a help of decision trees. However, Gini index behaves with data
a bit different. As it was mentioned before, information gain tries split the data into
distinct classes, whereas Gini seeks for the largest class and extracts it at first. Than
in the residual data looks for the next attribute which would also help in extracting the
next largest class. This continues till the final tree is built.
If the data was such that splitting into classes was quite clear, the tree will result
with pure nodes (i.e. leafs nodes that contain objects of the only class). In practice,
pure decision trees are attainable only in very rare circumstances. In our Algorithms
for redescription mining we use Information Gain and Gini index as impurity measures.
We discuss the affect of them for different real-world data sets in Subsection 5.3.1.
Chapter 3 Related research
3.3
21
Redescription Mining Algorithms
Since redescription mining was firstly introduced by Ramakrishnan et al. [48], there
have been several other contributions to the topic. In particular, both Kumar [33] and
Ramakrishnan et al. [48] were working on redescription mining using decision trees.
They introduced alternating approach to grow decision trees, which are later used to
derive redescriptions.
Beside decision trees, redescription mining was presented through such ideas as Karnaugh maps in [63], where they depicted it as a simple game with only 2 rules. A pair of
two identical maps contains variables on its sides, and the blocks inside maps represent
intersection of these variables. Blocks can be uncolored (if there is no intersection of
corresponding variables in data set) or colored, if they are non-empty. Rules are: A
colored cell can be removed as long as it is removed from both maps and an uncolored
cell can be removed from either (or both) maps. If one thinks of objects as transactions
and descriptors as items, then a colored cell in the Karnaugh map corresponds to a
closed item set from the association mining literature [61].
Redescription mining algorithms based on frequent itemsets are covered in [20]. Here
it is considered as the task of finding subgroups having several descriptions. Authors
offer algorithms, based on heuristic methods, that produce pairs of formulae that are
almost equivalent on the given data set. Methods also use different pruning strategies
to avoid useless paths in a search space.
Another algorithm for redescription mining, based on co-clusters, was presented in
[43]. Redescription mining task is viewed as a form of conceptual clustering, where the
goal is to identify clusters that afford dual characterizations, i.e. mined clusters are
required to have two meaningful descriptions.
Greedy search algorithm used in [20] to mine redescriptions was extended with efficient on-the-fly discretization by ReReMi algorithm, introduced in [18]. Here the
algorithm defines initial pairs of variables from a data set and update them until no
further improvements can be made. Updates can include addition, deletion and edition
of predicates.
Since in this thesis we use decision tree induction to elaborate algorithms for redescription mining over non-binary data sets, existing approach which exploits the same idea
is to be covered. Previously CART was incorporated into the CARTwheels algorithm
[48] to mine redescriptions.
CARTwheels. The main contribution to make redescription mining a relevant research topic was made by Ramakrishnan et. al. in 2004 [48]. Here CARTwheels algorithm was presented which derives redescriptions with a help of decision trees grown in
opposite directions and then matched at their leaves. Further, authors in [43] explicitly
showed applications for redescriptions and structured the formalist for the topic.
More to say, Kumar in his PhD thesis [33] exploits decision trees as well to characterize sets of genes. He also extended CARTwheels by presenting a theoretical framework
which allowed systematical exploration of the redescription space, involvement of redescription across domains, etc.
CARTwheels was the first introduced algorithm for redescription mining which involves
decision trees induction [48]. It uses two binary data sets to grow two decision trees
22
Chapter 3 Related research
which match in the end with their leaves. When they are matched, the paths which led
to a similar leaf nodes can be written as queries to form redescriptions.
In original paper [48] authors use data set consisting of 2 matrices, where they assign
class labels based on a greedy set covering of the objects using entities of left-hand
side part. The decision conditions from one tree are combined with the corresponding
decision conditions from the second tree. Hence, the paths which lead to the same
class can be treated as queries to form a redescription. Algorithm returns as many
redescription as there are matching paths in a pair of resulting trees.
This approach selects a paths in the grown trees and combines splitting rules of the corresponding terminal nodes via Boolean operators. Negations are also involved whenever
the path belong to the ’no’ side of the decision tree.
As an visual example, a pair of resulting trees, which are used to form redescriptions
is depicted on Figure 3.1.
Figure 3.1: Tree growing with alternations by CARTwheels. (left) tree defines settheoretic expressions to be matched. (middle) bottom tree is grown to match the first
one. (right) bottom tree is fixed and the top tree is re-grown again to match leaves.
Arrows represent matching paths which form redescriptions.(Following [48])
Fig. 3.1 shows three frames of tree growing process. The right-most frame depicts final
version of trees that form redescriptions. Matching paths can be written as following
redescriptions:
(X3 ∩ X1 ) ∪ (X4 − X3 ) ←→ Y4
(O − X3 − X4 ) ←→ (Y3 − Y4 )
(X3 − X1 ) ←→ (O − Y3 − Y4 )
These alternations can be continued until leaves match good enough or until the maximal number of unsuccessful alternations is reached. However, it is important to notice,
that in this approach authors set the depth as a constant (in this example (d = 2)) and
re-grown in each iteration trees of the same depth.
Chapter 3 Related research
23
CARTwheels algorithm uses duality between path partition and class partition. Thus,
the crucial issue here is to combine paths into redescriptions only when they lead to
the same class label. Further evaluation of this partition determines the quality of the
result. In CARTwheels results in a single pair of trees, which are re-grown with the
same, previously fixed by the user, depth and cover the whole data set.
In this Thesis we, inspired by the idea of usage decision trees for redescription mining,
elaborate two algorithms which grow decision trees to match in their leaves gradually
increasing the depth. Also, we enable them to: firstly, work with real-valued data one
side. Secondly, extend to them process both sides of real-valued attributes, using data
discretization routine described in Section 4.6.
Chapter 4
Contributions
4.1
Redescription Mining Over non-Binary Data Sets
Redescription mining techniques based on decision tree induction previously were able
to handle solely Boolean data and were not able to handle other cases without data preprocessing. However, techniques which use other redescription mining approaches, for
example, presented by Galbrun et al. [18] are able to handle numerical and categorical
data directly.
In this Thesis we extend redescription mining techniques which exploit decision tree
induction to non-Binary setting, apply in on real-world data, test its ability to find
planted redescriptions and compare with existing redescription mining techniques.
In particular, we work with two methods which both have decision tree induction
as a basis. As a result we expect our algorithms to return interesting and informative
redescriptions, which are useful in a particular domain or can assist in solution of existing
problem. Except this, these method can be applied and be tested in other domains as
well, since redescription mining might be useful for them as well. For example, a good
choice is bioinformatics. As long as data sets have determined form, our approaches can
be exploited in any domain.
Very often domain knowledge is essential for to make conclusions regarding outcomes.
For example, one possible domain is biological niche-finding problem. Here we are
looking for the rules which in detail determine specific conditions for the species. It is
comparatively easy for a layman to assess the quality of for redescription mined in such
a domain. For example, if we get the rule which says that a Polar Bear lives in the
places where average January temperature is below 2 degrees Celsius, this statement is
quite understandable ever for a person without profound knowledge in biology.
Nevertheless, user might encounter more specific cases, where background knowledge
in domain becomes crucial. Also, configuration of parameters, which very often is a
key to success in data mining, might involve some extent of consideration of the data
and domain we work with. Redescriptions are aimed to bring new interesting insight on
data. Thus, it is crucial for the method to deliver not only intuitively expected rules,
but also reveal some specific treats which assist in niche finding problem or any other
one.
We introduce two algorithms for redescription mining over non-binary data sets. Both
25
26
Chapter 4 Contributions
of them involve decision tree induction. In particular, we grow trees in opposite directions to match in the end by gradually increasing the depth.
As an input a data set (O; A; v) consisting of two matrices is used. One side contains
binary attributes, another side is composed with real-valued attributes. Not all realworld data sets meet this requirements, thus in Section 4.6 we discuss a way to overcome
this restriction.
Target vectors needed for each step are formed starting from binary data set. Then
further they are formed based on the previous split result. Thus, every next iteration is
adjusted based on the previous one. In the end we get pairs of decision trees, grown in
parallel to match at their leaves.
Then queries are derived from resulting trees for future analysis. As an accuracy of the
redescription we use Jaccard’s coefficient which is chosen for this due to its computational
simplicity and ability to provide a reasonable assessment of similarity for two queries
that form a redescription. Statistical significance of the result is determined with the
help of p-value computation, since we want the results to be not only informative, but
also carry a statistically meaningful information.
Chapter 4 Contributions
4.2
27
Algorithm 1
Algorithm 1 extends redescription mining based on decision tree induction to nonBoolean world. As it already mentioned, the starting point of the algorithm with alternation scheme is an important aspect to be defined. Algorithm expects data two arrays
(e.g. matrices) as income data. Left matrix (L) is expected to contain Boolean data,
right matrix (R) contains numerical data.
To initialize tree induction, the algorithm needs to have a data set with a target vector
(the vector based on which the tree would be built). A target vector consists of all
entries from left matrix. Namely, each column from the left-hand side is used as a target
vector for one run of the Algorithm 1. Here we initiate tree induction (CART) with the
right data set and build the tree with the depth 1. Thus, we have a short classifier which
uses some parameter from right-hand side matrix as a splitting rule. Further, we form
a new target vector based on the first split. After dividing data we get two child nodes
with the class labels which correspond to the majority class in it. In our case: 0 and 1.
Having that, we proceed to grow the second tree to match the first one. To do so, the
new target vector is formed based on the right hand-side split. And the algorithm is
run on the left side with the depth 2. This process of forming new targets and building
deeper tress is continued until the one of the stopping criteria is met.
Algorithm outline. Figure 4.1 represents steps undertaken of the first algorithm. As
initialization stage, we use target vector from left matrix (binary matrix) and perform
a split with the depth 1 on the right matrix, which possibly contains real-valued data.
Figure 4.1 depicts trees (left and right) with the maximal depth 2. Enumeration for
the nodes is used in the following manner: every parent node is marked as n; every left
child node is enumerated as 2n; every right child node is enumerated as 2n+1. This
holds for both trees and both algorithms.
First frame (d=1) depicts the initial split of the both data arrays with the depth 1,
where the split of the right array is made with a target vector from the left (an arrow).
Further, the algorithm forms a target vector based on the right split and proceeds to
split the initial left matrix (but with newly modified target vector) with the depth 2.
Thus, every time the tree is re-grown from a scratch using CART algorithm using target
vector. It in turn is formed based on the previous split result (i.e. class labels are
assigned depending on the leaf nodes they fall into). New targets are formed and the
depth is increased until the termination.
As a result we get a pair of trees: left tree classifies binary data, right tree classifies
real-valued data. For instance, if we work with biological niche finding problem one
includes attributes from the animal data; another consists of climatic data.
At each terminal node algorithm picks up a splitting parameter and splitting value
(together called a splitting rule) which both maximize purity of the resulting nodes (i.e.
purity measure). The actual impurity function used for this does play a crucial role for
now. Splitting rules on terminal nodes from both trees will be further used to build
redescriptions.
28
Chapter 4 Contributions
Figure 4.1: Tree-growing process in Algorithm 1
Algorithmic framework of Algorithm 1. Table 1 describes the Algorithm’s 1 s
algorithmic framework in detail. Firstly, the data set suitable for CART induction is
formed. Construct tree creates a decision tree with provided parameters using one part
of the data set, either left of right. That is, target vector formed based on the previous
split result and min bucket parameter which is responsible for the minimal size of tree
nodes.
M in bucket is an important tunable parameter which controls a trade off between
redundancy and interpretability. We pay attention to this parameter, since it prevents
overfitting and help to terminate the tree induction earlier. This makes resulting queries
less massive and more interpretable. In particular, in problems bound to search of
biological niche finding user might be interested for the nodes, which include the majority
of particular species because for a reasonable redescription we expect majority of animals
share similar living conditions. If set to high might not give any insight for rare in
Europe animals such as Polar Bear or Moose. In other cases min bucket parameter is
also crucial. It helps to adjust CART to split a data set in such a way that every node
contains at least of defined number of entities.
Function construct target vector forms a vector based on the result of previous split
to be given to the next split of the data. In the end, the list of redescriptions is formed
Chapter 4 Contributions
and each of them is evaluated by Jaccard’s coefficient.
Algorithm 1: Algorithmic framework
Data: Descriptor sets {Li }, {Ri }
Result: redescriptions Rd , Θ - Jaccard’s Coefficients
Parameters:
d - maximal depth of the tree
min bucket - minimal number of entries in the node
md - maximal depth
Initialization:
Set answer set Rd = {}
Set Jaccard’s Coefficients set Θ= {}
Set left matrix L = {Li },
Set right matrix R = {Ri }
Alternations:
foreach column i in L do
Set all paths tl = {}
Set all paths tr = {}
Set target vector tv = construct target vector(Li )
Set tree tr = construct tree(R, tv , max depth = 1, min bucket)
tv = construct target vector(tr )
if all entries in tv are of the same class then
Rdi = N U LL; Θi = N U LL; flag = false
else
flag = true; depth = 2
while (flag) do
if (depth ≤ d) then
tl = construct tree(L, tv, depth, min bucket)
tv = construct target vector(tl )
tr = construct tree(R, tv, depth, minb ucket)
tv = construct target vector(tr );
if (tl current = tl previous&&tr current! = tr previous) then
depth = depth+ 1
else
flag = false;
end
else
flag = false;
end
end
Rdi = paths tl +0 ←→0 +paths tr
Θi = Jaccard(tl , tr )
end
end
29
30
4.3
Chapter 4 Contributions
Algorithm 2
Configuration of the Algorithm 2.
Algorithm 2 is also based on alternating decision tree induction as well and starts with
the same initialization. It contains some specific features: instead of increasing depth
with every next step and re-building trees from a scratch, Algorithm 2 continues to build
trees and make them deeper and deeper on every iteration. Every new depth of the tree
is built on either left or right matrix using the target vectors from either right or left
tree respectively. This procedure is continued until the stopping criteria is met.
Thus, we start again with initial target vector taken from left-hand side data and build
the decision tree with the depth 1 using CART algorithm. It picks up a splitting rule
which maximizes the purity of the nodes. After this, new target is formed, based on the
class label assignment from first split. Left-hand side data is then split with the depth
1 using formed target. Trees are grown in level-wise fashion until the stopping criteria
is met.
As a result we get two tree structures (left - for binary matrix classification; right - for
real-valued matrix) where each new depth is build based on previous split result from
other tree. Final trees are used to extract and evaluate redescriptions.
Algorithm’s 2 outline. Figure 4.2 depicts sequence of frames representing steps
undertaken by algorithm. Firstly, initial split is performed with a Target vector 1 on
the right matrix. Split is performed with CART algorithm. Features used (impurity
measure, node size) are to be set based on preferences.
Further algorithm proceeds to the split on a left matrix with the Target Vector 2 which
is formed from the previous split (on the right part). This process continues until any
further split is able to be performed by CART algorithm (i.e. splitting result in more
pure nodes) or any of other termination criterion are met. Hence, each branch in the
trees receives a target vector formed after the split from previous depth.
As an outcome, we get two tree-like structures. These structures in practice are parts
of a decision trees built with CART algorithm (several decision trees with the depth
1) can be assembled into a final trees. At the end, we move to query extraction from
them. Thus, each tree in a pair returns one query. The extent of their correspondence
is evaluated via Jaccard’s coefficient. This coefficient is computed between two resulting
vectors which are formed based on the final trees (Figure 4.3). Finally, two queries,
mined from final trees, form a redescription. All redescriptions mined from given data
set are to be analyzed further.
Chapter 4 Contributions
31
Figure 4.2: Tree-growing process in Algorithm 2
Algorithmic framework of Algorithm 2. Table 2 describes the Algorithm’s 2
algorithmic framework in detail. As previously, we form a data set consisting of two
views and construct in step-wise manner decision trees. Maximal depth is parameter
defined by the user.
32
Chapter 4 Contributions
After the whole data set is processed, the redescription set is formed and returned.
Each of the redescriptions is to be evaluated for future interpretation.
Algorithm 2: Algprithmic framefork
Data: Descriptor sets {Li }, {Ri }
Result: redescriptions Rd , Θ - Jaccard’s Coefficients
Parameters:
min bucket - minimal number of entries in the node
md - maximal depth
Initialization:
Set answer set Rd = {}
Set Jaccard’s Coefficients set Θ= {}
Set left matrix L = {Li },
Set right matrix R = {Ri }
Alternations:
foreach column i in L do
Set all paths tl = {}
Set all paths tr = {}
Set tl = {}
Set tr = {}
Set target vector tv = construct target vector(Li )
flag = true; count = 0
tr = construct tree(R, tv, md = 1, min bucket)
while (count ≤ depth(tl ))&&(count ≤ depth(tr ))&&(count ≤ d)) do
foreach leaf in the tree tr do
tv leaf = construct target vector(tr , leaf )
tl (construct tree(L leaf, tv leaf, md = 1, min bucket))
end
foreach leaf in the tree tl do
tv leaf = construct target vector(tl , leaf )
tr .(construct tree(R leaf, tv leaf, md = 1, min bucket))
end
count = count+ 1
end
Rdi = paths tl +0 ←→0 +paths tr
Θi = Jaccard(tl , tr )
end
4.4
Stopping Criterion
While building the decision trees there a very common issue is over-fitting, when trees
grow too large and tend to make the whole redescription mining process ineffective. In
the end we might get huge trees and redescription derived from them will be massive
and will contain big variety of variables. Hence, interpretation of long queries is difficult
and not desirable. We adopted several aspects to help early termination of the process.
Firstly, user is able to determine the maximal depth of the resulting trees. These
enables algorithm to be tailored for many domains. Given flexibility allows building
different trees for user to compare the result and discover the suitable depth parameter.
Chapter 4 Contributions
33
However, in practice experiments users do not need trees longer than maximal depth=3
since redescription derived from them would be difficult to interpret.
Secondly, (min bucket) is a parameter which is responsible for the minimal amount of
entries in the node. Limitation of this parameter is very useful. Usually, the lower the
minimal number of entries on the node, the bigger tree will be returned. The data set
we work with should be taken into consideration when setting this parameter. Thus, we
suggest using small (min bucket) parameters at the beginning and gradually increase it
until resulting redescriptions will have an optimal size, depending on the data set and
problem that are being solved.
Moreover, a logical choice to stop the tree building process is when further splitting
will not provide any changes. Thus, we check weather the performed split (on the next
depth) has resulted in data reorganization across the nodes comparing with the previous
result. If yes, we continue to split the data until no change occurs. Both our algorithms
contain this check as an built-in feature. However, this stopping criterion in practice
sometimes produce quite deep trees (i.e. does not prevent overfitting entirely).
Impurity measure used within this algorithm is not principal. Within our experiments
with real-world data sets (discussed in Section 5.2) Gini index and Information Gain
were used. Hence, user is able to pick the one which is more suitable or try all of the to
select the most prolific. As an outcome we get final trees, namely, a set of duplet trees
which are used to derive redescriptions.
34
4.5
Chapter 4 Contributions
Extraxting Redescriptions
We use trees to mine one-dimensional rules and partition them with each other to form a
redescription. The Figure 4.3 exemplifies a pair of trees derived after algorithms are run.
To extract a redescription from it we use splitting parameters (with splitting values) and
Boolean operators: ’OR’ and ’AND’.
To partition paths into a query from any of these trees we join splitting rules within one
path via ’AND’ operator, and these paths are joined with each other with ’OR’ operator.
Labels which correspond to ’yes’ and ’no’ assignment are also taken into consideration by
flipping of the sign. For instance, a query corresponding to the left-most given tree would
look like: (1 ∧ 2 < 0.5) ∨ (1 < 0.5 ∧ 3 < 0.5). Or, if we use negations: (1 ∧ ¬2) ∨ (¬1 ∧ ¬3)
The Figure 4.3 also depicts an example how we assess mined redescriptions with Jaccard’s coefficient derived from two trees grown by any of two presented algorithms. In
fact, when having one side with a Boolean data and another with real-valued, after
processing with Algorithm 1 or 2 we get nodes which belong to class ’0’ or ’1’. Leaf
nodes from resulting trees then are grouped into two binary vectors (left and right) and
Jaccard’s coefficient is computed.
Figure 4.3: Redescription extraction and evaluation
Jaccard’s coefficient is equal to 1 when we have a perfect match. In theory, we should
be interested only in such redescriptions. However, in practice redescriptions even with
lower Jaccard coefficient pose an interest. In addition, support of two queries is especially
important (i.e. E1,1 - where both queries hold), since we do not want to get a redescription
which covers all attributes. This would mean that a redescription does not provide any
interesting insight. Or vice versa, if the support is really low, it holds for almost no
entries from the data set.
Chapter 4 Contributions
4.6
35
Extending to Fully non-Boolean setting
Before this we were considering data sets which contain one binary and one real-valued
matrices. However, this setting poses quite imposing constraint when solving real-world
redescription mining problems, because many domains produce real-valued data. Thus,
data discretization is to be performed.
This issue has been studied by in a clue of Association Rule Discovery so far by Srikant
and Agrawal in [50]. Their methods are based on a priori bucketing, but they are
very specific to association rule discovery making them inappropriate for redescription
mining. Thus, we adopt discretization routine of real-valued hand side of our data set
based on clustering. Further this binarized matrix is used for initialization for both of
our Algorithms. Having used each column from it as a target vector for the very first
slit of the data set, we can further use initial (before binarization) left-hand side matrix,
because CART requires only target vectors to binary.
4.6.1
Data Discretization
It is possible to apply Algorithms 1 and 2 to fully non-Boolean data set as well. To
enable them working with both-sided real-valued we apply a binarization routine for
one of the sides. This routine can be considered as a pre-processing step which prepares
the data set to look exactly the way algorithms to expect it to be.
This kind of pre-processing is applicable to real-valued matrices before the algorithms’
application. To implement this we use three of available clustering techniques. However,
this list is not limited to only those three.
A good example of the data that can be transformed from real-valued to binary can be
DBLP data [2] which contains information about computer science bibliography (more
details in Section 5.4). Here left matrix corresponds to the conferences and the number
of papers published by some author within it. For example, author N has submitted
4 papers for FOCS conference. The right hand side matrix contains same authors and
the number of co-authoring between them. Thus, it describes how often each author
co-worked on a paper with other author. Left hand side matrix can be transformed into
binary with the help of clustering techniques covered in Section 5.4 to allow application
of elaborated methods for redescription mining.
Regardless the the clustering method used, the binarization routine is conducted with
the following steps.
1. Select the first column as initial point;
2. Perform clustering of this column into several columns using one of the available
clustering techniques;
3. Split taken column into several based on clustering result (initial attribute values
are split into several intervals);
4. Assign to the attributes new values ’0’ or ’1’, according to initial values;
5. Repeat the procedure until all columns from initial matrix are split into several
intervals and filled with ’0’ or ’1’.
36
Chapter 4 Contributions
Algorithmic framework for Data Discretization is in a Table 3
Algorithm 3: Algorithmic framework for data discretization
Data: Real-valued descriptor set {L} of a size i × j(real-valued)
Result: Real-valued descriptor set {Lnew } of a size i · n × j (Boolean)
Parameters:
Cluster - function to perform clusterization, {DBSCAN, hclust, hclust}
parms - parameters for selected cluterization method
n - number of clusters
Rangecluster - range of values which fall into cluster n
Algorithm:
Set {Lnew }= {}
Set Cluster = {DBSCAN, hclust, k − means}
foreach column {Lj } in {L} do
Clusterparms (Lj ) into n clusters
and
Split into n columns with Rangecluster
foreach Li,j do
if valueLi,j ∈ Rangecluster then
set Lnewi,j = 1
else
Lnewi,j = 0
Return {Lnew }
end
end
end
As a result we would get the binarized matrix with increased number of columns. This
matrix can be easily used with both methods to find redescriptions. The parameters
used within clustering routine are mostly determined by the user and the data we have
on hand. Some treats and peculiarities are discussed in Section 5.4 on the real-world
data experiments.
Chapter 4 Contributions
4.7
37
Quality of Redescriptions
The quality of the redesctiption is more abstract term. It is a compound of several
characteristics which we try to evaluate with objective criteria. For example, a good
redescription can be the one which is easy to interpret, have reasonable support, and it
is statistically significant.
4.7.1
Support and Accuracy
One of the most defining feature for a redescription is support (E1,1 ). There is no strict
bounds on support, which make a redescription good or bad. This depends on the data
set we work with. Intuitively, we are not interested in redescriptions which are supported
by either one row or by almost all rows from data set. It might be desirable to fix lower
or upper bounds on the support cardinality of the queries and possibly on that of the
individual involved predicates in each case individually.
In our experiments we adopt Jaccard measure [27] to assess the accuracy of mined
redescriptions. It provides a nice balance between the simplicity of computation and
its agreement with the symmetric approach. Weights in Jaccard’s coefficient take into
account the support of both queries equally. It is also scaled to the unit interval and
does not involve entities that are not supported by any of two queries.
Jaccard’s coefficient for both methods (Algorithm 1 and 2) is computed analogously.
Two resulting vectors are formed based on final structure of the decision trees. The
entities which fall into corresponding nodes are arranged into resulting vectors: l -vector
for the left tree and r -vector for the right tree. Since rows from both sides of data
are keyed by id, it becomes convenient to compute indexes we need for final assessment. This depicted on the Figure 4.3. Green arrows indicate paths in the trees that
match. Namely, they compound a redescription mined from a particular pair of trees.
Then, based on resulting vectors we compute the following to be plugged into Jaccard’s
similarity function:
1. E1,1 - number of entries where both queries holds (i.e paths leading to ’1 class’
assignment;
2. E1,0 - number of entries where only the first (left) query holds;
3. E0,1 - number of entries where only the second (right) query holds.
Depending on the purpose, the user is able to determine the minimal Jaccard coefficient associated with each redescription to make it relevant for future analysis. In
many domains, queries with similarity lower than 1 are also desirable and pose scientific
interest.
The quality of queries involved in a redescription also determine the its expressiveness and interestingness For instance, long and nested expressions are hard to interpret
and, hence, they carry minor interest for data mining tasks. Nevertheless, very strong
restrictions applied to the syntactic complexity of queries might severely limit the expressiveness. Thus, the balance between these two partly conflicting characteristics,
which at the same time are difficult to assess, is needed.
The expressiveness of the language and the extent of interpretability of individual
elements of it are highly defined by syntactic restrictions applied within the construction
38
Chapter 4 Contributions
of queries (rules). One way to keep queries interpretable is to limit the maximal length
of them. In our algorithms we limit them by adopting the maximal depth of decision
trees. We combine paths in the trees to form a one hand side of a redescription and
avoid negations by flipping the sign represents the node of the tree which is connected
with its child with ’no’ label on the tree’s edge.
4.7.2
Assessing of Significance
It is important to have an ability to determine how significant mined redescriptions are.
Statistical significance is a crucial feature to assess the quality of returned results. The
present-day concept of statistical significance, originated from R. Fisher [32], is being
widely used in statistical analysis and we exploit it in our experiments as well.
The simplest constraint applied to the mined redescription can be an accuracy, leaving
out the redescriptions which do not overcome the given accuracy threshold. Nevertheless,
statistical significance of the redescriptions is also important. That is, the support of a
redescription (ql , qr ) should have some information, given the support of the queries. To
measure this, we test against null-model when these two queries could be independent
[21].
Statistical significance plays a vital role in statistical hypothesis testing, where it is
used to determine whether a null hypothesis should be rejected or retained [39]. The
intuition if as follows: a redescription should not be likely to appear at random from the
underlying data distribution. That is, accuracy of redescriptions should not be readily
deducible from the support of its queries. In particular, if both queries that form a
redescription cover almost all objects, the overlap of their supports is definitely large as
well. Thus, the high accuracy of this redescription is naturally predictable.
P-value is computed to represent the probability that two random queries with marginal
probabilities equal to ql and qr have an intersection equal or greater than |supp(ql , qr )|.
Binomial distribution [58] is used by this probability, given as follows:
pvalM (qL ; qr ) =
|E|
X
s=|(ql ;qR )|
|E|
(pR )s (1 − pR )|E|−s
s
with pR = |supp(qL )||supp(qr )/|E|2 . This is the probability to obtain a set of same
cardinality |E1;1 | or greater, if each element of a set of size |E| has a probability equal
to the product of marginals pL and pR to be selected, according to the independence
assumption. Authors is [18] used the same approach to evaluate statistical significance
of redescriptions.
The higher the p-value, the more likely it is to encounter the same support of two
independent variables. Thus, the query becomes less significant, i.e. the null hypothesis
cannot be rejected and the redescription becomes less significant.
This theoretical p-value computation relies on underlying data distribution assumption. Namely, that all elements of the population can be sampled with equal probability out pre-defined distribution. The sampling distribution is calculated based only
on expectation in the past, while the future relies on the stronger assumption of fixed
marginals. But real data sets can differ from these assumptions, which make these
Chapter 4 Contributions
39
significance tests weaker. These questions do not compound the main idea of our contribution, so we do not discuss them here in detail. Instead we refer reader to relevant
literature [35, 14].
Chapter 5
Experiments with Algorithms for
Redescription Mining
5.1
Finding Planted Redescriptions
To asses the power of any elaborated method or technique it is essential study its behavior
on synthetic data, where we have a complete control on the data format and parameters.
Thus, we crate synthetic data sets to imitate the real-world setting in order to assess our
algorithms’ ability to find previously planted redescriptions which would give an insight
into the performance of algorithms.
To implement this, it is necessary to make sure that planted redescriptions consist of
queries in both sides of data set have perfect correspondence i.e. their Jaccard’s is 1.
For Algorithm 1 the size of each matrix is set to be 300 × 5. Further, two queries
that involve 3 parameters are planted in this pair in such a way to form an exact
correspondence. For Algorithm 2 the size of each matrix is set to be 300 × 10, since it
builds several decision trees with the depth 1 and every new depths is restricted from
picking up the splitting rule which has been already used. Thus, planting of redescription
involving 6 variables is vital for the algorithm to be able to reach the maximal allowed
depth. Planting queries into such massive data array, especially when using randomized
procedure to turn right-hand side into real-valued, results in a noisy data set. However,
to study the behavior and ability of the algorithm to deal with the noise, we can track
a the accuracy in the same manner as we did in Algorithm 1.
In total we planted different looking redescription with support from 30 to 50 rows
for both algorithms. Later the random noise is added with different density ranging
between 0.01 and 0.1. Noise can be both: constructive (not interfering the actual query)
and destructive (damaging queries). To generate a real-valued side of a data set it is
possible to substitute values in one matrix. To implement this, we substitute each 0 by
values uniformly distributed on the interval [0, 0.25], and each 1 by value on interval
[0.75, 1].
In data sets without noise, Algorithm 1 were able to find planted redescriptions with
the highest accuracy. Having applied constructive noise Algorithm 1 were able to find
planted redescription up to density 0.03. In other cases it returned redescriptions which
had better accuracy, than the planted one in a ’noisy’ matrices. This confirms anticipated
behavior of the Algorithm.
41
42
Chapter 5 Experiments with Algorithms for Redescription Mining
Figure 5.1 illustrates comparison of Jaccards’ coefficients for Algorithm 1 (a) and
2 (b). Red line on a chart shows Jaccard’s coefficients for the planted redescription
in matrices with noise, (x-axis determines the density of applied noise - from 0.01 to
0.1). While blue line represents Jaccard’s coefficients of mined redescriptions returned
by the algorithms on noisy data. Here can be seen, that Jaccard’s coefficient of planted
is lower (or equal) than Jaccard’s of mined redescription. This happens because the
effect of applied noise managed to form a better match and it was naturally mined by
algorithms. Thus, found redescriptions in ’noisy’ data possess greater accuracy than the
planted one, so algorithms are not to be blamed.
(a) Algorithm 1
(b) Algorithm 2
Figure 5.1: Jaccard’s coefficients of planted and mined redescriptions on ’noisy’ data
Note, due to the character of input data for Algorithm 2, is was able to mine redescription with Jaccard’s coefficient of 0.67. The reason for that, is that generated synthetic
matrix with query involving 6 variables naturally contain noise. Hence, more detailed
tests are needed fo overcome this issue.
With destructive noise we were destroying planted redescriptions gradually and were
able to find planted redescriptions up to density 0.09. For example, when planted the
redescription of a form (with support of 30 rows and Jaccard 1):
(x1 ≥ 0.5 ∧ x3 < 0.5) ←→ (x3 ≥ 0.7602 ∧ x2 < 0.4984)
with destructive noise of density 0.9 the Algorithm 1 mined (support 26 and Jaccard
0.838):
(x1 ≥ 0.5 ∧ x3 < 0.5) ←→ (x3 < 0.7602 ∧ x3 ≥ 0.2132 ∧ x3 < 0.2137)
∨(x3 ≥ 0.7602 ∧ x2 < 0.4984)
Note, CART is eligible to select the same splitting parameter several times within
one decision tree, uncovering the context dependency of the effects of certain variables
[3]. Thus, sometimes discovered redescription involved additional branches, which were
composed with the same splitting rule. The red part of the mined redescription is formed
by additional branch of the tree induced by CART. Here x3 variable is picked by several
times as it is claimed in CART algorithm [3]. Yet, the planted redescription was mined
accurately taking into account noise level. These kind of ’additional branches’ arise in
experimental runs and caused solely by peculiarities of CART. Tree building procedure
Chapter 5 Experiments with Algorithms for Redescription Mining
43
in Algorithm 2 prevent CART from usage of the same splitting rule. Thus, it managed
to mine planted redescriptions precisely up to level noise 0.4 (depending on a particular
run). For both Algorithms additional tests would be advantageous for more profound
assessment. In Section 5.5 we provide comparison to ReReMi algorithm to give a better
insight into performance of our contributions.
44
Chapter 5 Experiments with Algorithms for Redescription Mining
5.2
The Real-World Data Sets
In order to actually test and evaluate devised methods in practical conditions it is
important to apply them into real-world data. For this we use two data sets: Bio for biological niche finding problem and DBLP - mining redescription from data for
Computer Science bibliography.
Both our algorithms where initially implemented in R. In this section we exemplify
mined results using available tool for interactive redescription mining, called Siren [19].
Its plotting capacities provide some visual impression.
As an input data set we use two matrices. Data is composed from publicly available
sources: Bio 1 data set uses data from European mammals atlas [40] and climatic
data from [26]. DBLP 2 data set is formed from [2], where left matrix contains the
conferences and the number of papers published by some author within it, and right authors and the frequency of their cooperation with each other.
Table 5.1 describes the distinct data sets used in experiments with real-world data.
Table 5.1: Real-world data sets used in experiments
Data set
Bio
DBLP
5.3
Descriptions
Locations×Mammals
Locations×Climate
Authors×Conferences
Authors×Authors
Dimensions
2575 × 194
2575 × 48
2345 × 19
2345 × 2345
Type
Boolean
Real values
Integer
Integer
Experiments With Algorithms on Bio-climatic Data
Set
Algorithm 1.
Firstly experiments are run on the biological data from Table 5.1, called Bio. In
biology, for the species to survive, a terrain where they live should maintain certain
bio-climatic constrains which forms that species’ bioclimatic envelope (or niche) [23].
Finding this constrains with algorithms for redescription mining assists in determining
of bio-climatic envelopes.
In Bio data set left side is represented by the matrix, which contain locations in Europe
and mammals living there. If an animal is present in a particular area, there is a 1, and
vise versa: 0 for the places where this animal does not live. Thus, the left matrix contains
only Boolean data.
The right side (matrix R) consists of the same locations (keyed by IDs) and climatic
data. In particular, we take into consideration minimal, maximal and average temperature in each month and average rainfall measurements (in millimeters per months).
Algorithm 1 was run on Bio data with different parameters (impurity measures and
min bucket) and returned redescription sets for each of them. Example redescriptions
are indicated in Tables 5.2 and 5.3. Each of them has been composed with several
redescriptions mined by Algorithm 1 with different parameters (indicated in a tables’
1
2
http://www.worldclim.org.
http://www.informatik.uni-trier.de/∼ ley/db/
Chapter 5 Experiments with Algorithms for Redescription Mining
45
headers). Resulting p-value makes these redescription statistically significant with the
highest level (99%), we did not encounter any redescription with p-value higher than
0.0003 with all selected parameters on Bio data set.
Table 5.2: Redescriptions mined by Algorithm 1 from Bio data set (with Gini impurity measure and min bucket=20.) LHS is a left-hand side part of the redescription;
RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 - support;
max avg
tmin
X ; tX ; tX stand for minimum, maximum, and average temperature of month X in
degrees Celsius, and pavg
X stands for average precipitation of month X in millimeters.
LHS
RHS
J
E1,1
(P olar.bear ≥ 0.5)
(Arctic.F ox < 0.5 ∧ P olar.bear <
0.5)
(M oose ≥ 0.5 ∧ W ood.mouse <
0.5)
(tmax
M ar < −7.05)
min
(tmax
Aug < 14.65 ∧ tJul ≥ 6.35)
max
(tAug ≥ 14.65)
max
(tmax
F eb ≥ −1.15 ∧ tApr < 7.55
max
max
tJul ≥ 14.05) ∨ (tF eb < −1.15
pavg
Aug ≥ 58.85)
0.947
0.979
36
2379
0.801
449
∨
∧
∧
Thus, we have a scope of rules (redescriptions) which can be interpreted by analysts to
find environmental envelopes for species. If analyzing a bit Table 5.2, we can see that
rule:
(Polar.bear≥0.5) ←→ (tmax
M ar <-7.05)
or equivalently:
Polar.bear ←→ (tmax
M ar <-7.05)
implies that Polar Bear lives areas where maximal temperature in March is lower than
-7.05 degrees Celsius with the support of 36 rows and high Jaccard’s coefficient (above
0.9). This redescription outlines logical conditions for Polar Bear to live in (cold climate).
Decision tree pair which resulted in this redescription is depicted on the Figure 5.2. The
redescription and trees in this particular case look simple and interpretable. Yet, very
often user may encounter more complex cases (exemplified further), so the visualization
becomes more essential.
Figure 5.2: A pair of decision trees returned by the Algorithm
The second redescription from Table 5.2:
46
Chapter 5 Experiments with Algorithms for Redescription Mining
min
max
(¬Arctic.F ox ∧ ¬P olar.bear) ←→ (tmax
Aug < 14.65 ∧ tJul ≥ 6.35) ∨ (tAug ≥ 14.65)
Can be formulated as follows:
In places where neither Arctic Fox, nor Polar Bear live, maximal temperature in
August is below 14.65 and minimal temperature in July is greater or equal to 6.35 or
maximal temperature in August is greater or equal than 14.64 degrees Celsius.
This redescription rather describes living conditions which are not suitable for Polar Bear
and Arctic Fox, since these mammals in the left query are negated. The corresponding
pair of decision trees are depicted of the Figure 5.3
Figure 5.3: A pair of decision trees returned by the Algorithm
The final redescription from Table 5.2 is longer and has more complex structure:
max
max
max
(M oose ∧ ¬W ood.mouse) ←→ (tmax
F eb ≥ −1.15 ∧ tApr < 7.55 ∧ tJul ≥ 14.05) ∨ (tF eb <
avg
−1.15 ∧ pAug ≥ 58.85)
can be expressed as follows:
Moose lives in places without Wood Mouse where maximal temperature in February is
above -1.15, in April - below 7.55 and in July - above 14.05 degrees, or in a places
where maximal temperature in February is below -1.15 degrees and average rainfall in
August is greater than 58.85 millimeters.
Two decision trees, from which this redescription was formed are depicted on the
Figure 5.4
Figure 5.4: A pair of decision trees returned by the Algorithm
Chapter 5 Experiments with Algorithms for Redescription Mining
47
In similar manner all results can be interpreted. With parameters, indicated in Table
5.2, Algorithm 1 found 91 unique redescriptions, 55 of them have Jaccard’s coefficient
above 0.8. Redescriptions vary in support size: some of them cover only small part of
the data (below 200 rows), while others cover almost whole data (above 2000 rows out
of 2575). Yet, all of them have high accuracy and statistically significant.
We limited maximal depth of the trees to 3, because longer redescriptions have more
nested structure and harder to interpret. However, for many instances Algorithm 1
terminated earlier since there either were no changes comparing to previous depth, or
resulting leaf nodes were pure, consequently, resulting in shorter redescriptions.
Table 5.3 presents one more run of Algorithm 1 on Bio data set, e.g. we use Information Gain as an impurity measure, and indicate min bucket = 100, meaning we limit
underlying decision tree induction algorithm perform splits in such a way, that there are
at least 100 entries in each node.
Table 5.3: Redescriptions mined by Algorithm 1 from Bio data set (with IG impurity measure and min bucket=100.) LHS is a left-hand side part of the redescription;
RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 - support;
max avg
tmin
X ; tX ; tX stand for minimum, maximum, and average temperature of month X in
degrees Celsius, and pavg
X stands for average precipitation of month X in millimeters.
LHS
RHS
J
E1,1
(Arctic.F ox < 0.5)
max
(tavg
Jun <10.25 ∧tSep ≥ 10.75) ∨
avg
(tJun ≥10.25)
max
(tmax
Oct < 10.85 ∧ tF eb < −1.45 ∧
avg
tJul ≥ 10.65)
avg
(pavg
Oct < 45.15 ∧ pJun ≥ 61.85 ∧
pavg
Apr < 48.25)
0.965
2347
0.701
353
0.483
151
(W ood.mouse
<
∧M ountain.Hare ≥ 0.5)
(European.Hamster ≥ 0.5)
0.5
Let’s discuss redescription from this experimantal run as well. The first redescription
here is:
avg
max
(¬Arctic.F ox) ←→ (tavg
Jun <10.25 ∧tSep ≥ 10.75) ∨ (tJun ≥ 10.25)
can be expressed as follows:
Arctic Fox does not live in places, where June average temperature is below 10.25
degrees and maximal temperature in September is greater than 10.75. Or, where
average temperature in June is greater than 10.25 degrees Celsius.
This rule also describes living conditions which are not suitable for a mammal. The
information about conditions which does not allow to some species to survive combined
with other redescriptions that involve same animal can put all aspects of its preferences
together and describe both (suitable and inappropriate) living conditions for a particular
animal. Yet, in this particular case, redescription cover almost a whole Bio data set,
which diminishes its value.
48
Chapter 5 Experiments with Algorithms for Redescription Mining
Decision trees which were built using Algorithm 1 and formed this redescription are
on the Figure 5.5:
Figure 5.5: A pair of decision trees returned by the Algorithm
Let’s consider the second redescriprion from Table 5.3:
(Wood.mouse < 0.5∧M ountain.Hare ≥ 0.5) ←→
avg
max
(tmax
Oct < 10.85 ∧ tF eb < −1.45 ∧ tJul ≥ 10.65)
or equivalently:
avg
max
(¬W ood.mouse ∧ M ountain.Hare )←→ (tmax
Oct < 10.85 ∧ tF eb < −1.45 ∧ tJul ≥ 10.65)
This redescription is formed from a pair of decision trees, which had been grown by
the Algorithm 1. They are depicted on the Figure 5.6 and one can assess them even
without text representation of the redescription.
Figure 5.6: A pair of decision trees returned by the Algorithm
This redescription can be expressed in natural language as follows:
Mountain Hare lives in the places without Wood Mouse, where maximal temperature in
October is below 10.85, maximal temperature in February is lower than -1.54 and
average temperature in July is greater or equal to 10.65 degrees Celsius.
Final redescription from Table 5.3:
avg
avg
(European.Hamster) ←→ (pavg
Oct < 45.15 ∧ pJun ≥ 61.85 ∧ pApr < 48.25)
describes conditions for European Hamster and can be formulated as:
Chapter 5 Experiments with Algorithms for Redescription Mining
49
European Hamster dwells territories in Europe where rainfall in October is lower than
45.15, in June is greater than 61.85 and in April is lower than 48.25 millimeters.
A pair of decision trees for this particular case can be found on Figure
Figure 5.7: A pair of decision trees returned by the Algorithm
This rule shows, that for the European Hamster precipitation is more essential than
temperature conditions, because on each new depth of the decision tree CART selected
rainfall as a splitting rule which maximizes nodes’ purity. Yet, only an expert can
confirm the importance of rainfall measurements for the hamster.
Moreover, in some instance trees in a pair have different depth. This aspect is not
surprising because Algorithm 1 builds each of them increasing the depth with each
iteration and compares the result with the previous one. Termination happens when
there were no improvements comparing to the split on the previous depth or resulting
nodes are pure. Differently looking trees are assessed in the same manner as it was
described in detail in Section 4.5.
Using parameters indicated in Table 5.3, Algorithm 1 found 44 unique redescriptions,
25 of them with Jaccard’s coefficients above 0.8. Similarly to the experiment with
parameters from Table 5.2 the received different supports (from few rows to almost
whole data) and high accuracy. All redescriptions involve different parameters, yet they
are easy to be interpreted, since they do not include very complex and nested structures.
Full results can be found in Appendix A.
50
Chapter 5 Experiments with Algorithms for Redescription Mining
Plotting on a map. Biological data set provides one more flexibility option: spacial
coordinates associated with each location in the Europe assist in visualization of derived
results. Plotting on the map make is easier to evaluate and interpret the outcomes. So,
whenever user encounter difficulties in reading mined redescriptions, plots of resulting
trees solve this issue.
Let’s exemplify all example redescriptions discussed before from tables 5.2 and 5.3.
Figure 5.8 represents three aforementioned redescription from Table 5.2 on a map.
(a) shows first redescription. On (b) there is a plot of the second redescription. (c)
represents the third redescription from Table 5.2. For all plots: red color indicates
places whre only left-hand side query holds, blue - right-hand side query, purple ares
depict places where both queries hold.
(a)
(b)
(c)
Figure 5.8: Example redescriptions from Table 5.2.(a) first; (b) second; (c) third.
Moreover, plots on a map of redescriptions from Table 5.3 can be found on a Figure
5.9:
(a)
(b)
(c)
Figure 5.9: Example redescriptions from Table 5.3.(a) first; (b) second; (c) third.
Chapter 5 Experiments with Algorithms for Redescription Mining
51
Algorithm 2. We tested Algorithm 2 on the same data set (Bio) using equal parameters. Some example redescriptions for one of the runs of the Algorithm 2 using IG as
impurity measure and min bucket = 50 are presented in a Table 5.4. Full version of
results can be found in Appendix A.
Table 5.4: Redescriptions mined by Algorithm 2 from Bio data set (with IG impurity measure and min bucket=50.) LHS is a left-hand side part of the redescription;
RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 - support;
max avg
tmin
X ; tX ; tX stand for minimum, maximum, and average temperature of month X in
degrees Celsius, and pavg
X stands for average precipitation of month X in millimeters.
LHS
RHS
J
E1,1
(M editerranean.W ater.Shrew ≥
0.5 ∧ Alpine.Shrew ≥ 0.5) ∨
(M editerranean.W ater.Shrew <
0.5 ∧ M oose < 0.5 ∧ Arctic.F ox <
0.5)
Kuhl.s.P ipistrelle
≥
0.5 ∧ Alpine.marmot
<
0.5) ∨ (Kuhl.s.P ipistrelle
<
0.5 ∧ Common.Shrew
<
0.5 ∧ House.mouse ≥ 0.5)
(Brown.Bear
≥
0.5 ∧
M ountain.Hare
≥
0.5) ∨
(Brown.Bear
<
0.5) ∧
W ood.mouse < 0.5 ∧ M oose ≥
0.5)
avg
(pavg
M ay ≥ 58.65 ∧ pJun ≥ 86.85) ∨
max
(pavg
M ay < 58.65 ∧ tN ov ≥ 6.85 ∧
max
tSep ≥ 10.75)
0.912
1406
min
(tmax
M ar ≥ 11.05 ∧ tF eb ≥ −3.95) ∨
avg
(tmax
M ar < 11.05 ∧ tM ar ≥ 6.375 ∧
tmax
Jan ≥ 3.55)
0.802
759
max
(tmin
Jan < −8.25 ∧ tSep < 17.35) ∨
max
(tmin
Jan ≥ −8.25 ∧ tF eb < −1.15 ∧
max
tM ar < 5.55)
0.762
492
These redescriptions constitute a good example to show that visualization options become crucial. Let’s consider the first redescription:
(M editerranean W ater Shrew ∧ Alpine Shrew) ∨ (¬M editerranean W ater Shrew ∧
avg
avg
max
¬M oose ∧ ¬Arctic F ox) ←→ (pavg
M ay ≥ 58.65 ∧ pJun ≥ 86.85) ∨ (pM ay < 58.65 ∧ tN ov ≥
max
6.85 ∧ tSep ≥ 10.75)
This rule implies that Mediterranean Water Shrew and Alpine Shrew or neither Mediterranean Water Shrew, nor Moose, nor Arctic Fox live in areas where either in May it rains
more than 58.65 millimeters and in June more than 86.85 millimeters, or it rains in May
less than 58.65 millimeters and where maximal temperature in November is above 6.85
and in September - above 10.75 degrees Celsius. In similar manner residual redescriptions can be expressed. A pair of resulting decision trees for this redescription from
Table 5.4 is on the Figure 5.10.
52
Chapter 5 Experiments with Algorithms for Redescription Mining
Figure 5.10: A pair of decision trees returned by the Algorithm
Second redescription from the Table 5.4:
Kuhl0 s P ipistrelle ≥ 0.5 ∧ Alpine.marmot < 0.5) ∨ (Kuhl0 s P ipistrelle <
min
0.5 ∧ Common Shrew < 0.5 ∧ House.mouse ≥ 0.5) ←→ (tmax
M ar ≥ 11.05 ∧ tF eb ≥
avg
max
max
−3.95) ∨ (tM ar < 11.05 ∧ tM ar ≥ 6.375 ∧ tJan ≥ 3.55)
was formed from a pair of trees depicted on the Figure 5.11.
Figure 5.11: A pair of decision trees returned by the Algorithm
Final redescription from Table 5.4
(Brown Bear ∧ M ountain Hare) ∨ (¬Brown Bear) ∧ ¬W ood mouse ∧ M oose) ←→
max
min
max
max
(tmin
Jan < −8.25 ∧ tSep < 17.35) ∨ (tJan ≥ −8.25 ∧ tF eb < −1.15 ∧ tM ar < 5.55)
was formed from a pair of decision trees which are depicted on the Figure 5.12:
Chapter 5 Experiments with Algorithms for Redescription Mining
53
Figure 5.12: A pair of decision trees returned by the Algorithm
With parameters, indicated in Table 5.4 Algorithm 2 returned 17 unique redescriptions, all of them are statistically significant. Support of all is in a range [300; 1600] rows,
which is acceptable for Bio data set. Half of redescriptions have the highest accuracy
(above 0.8), yet even others have Jaccard’s coefficient above 0.5.
Plotting on a map. For the second Algorithm plotting is available as well. Figure
5.13 illustrates all redescriptions from Table 5.4 on a Europe’s map. Representation on
a map makes it easier to evaluate the quality of the redescription. As previously, red
color is responsible for left query, blue - for the right and overlap of both is purple color
on a map:
(a)
(b)
(c)
Figure 5.13: Support of redescription from Bio data set.(a) First redescription; (b)
Second redescription; (c) Third redescription from Table 5.4
Maps also help to consider the ares where actually animals live. The overlap part
indicates places in Europe where the whole redescription is true, i.e. animals from left
query live (or do not live, if they are negated in the left query) and climatic conditions
hold from right query. The size of overlap is actually the support of the redescription
(i.e. E1,1 ) which becomes a defying feature when assessing the quality of results in
real-world data sets.
54
Chapter 5 Experiments with Algorithms for Redescription Mining
5.3.1
Discussion
While running experiments with Bio data set, we used two Impurity measures: Gini
index and Information Gain (their default implementations from R’s package rpart [52]).
Yet, any others can be plugged in both Algorithm.
In addition to this, increasing of min bucket parameter (minimal number of entities
in a leaf node) results in smaller number of final redescriptions. Yet, they are of slightly
higher Jaccard similarity and shorter structure. Note, decision tree induction routine
will not be able to split the data if min bucket parameter is indicated erroneously, i.e.
min bucket is set to be greater than the total number of places in Europe where a
particular animal lives
P divided by two, since we split each parent node into 2 children
(e.g. min bucket > 2Li ;).
We used the minimal size of min bucket to cover least 1% of the rows from data set
in our experiments. In such a setting, Algorithms that had been run on Bio data set,
returned quite big trees for the animals, which live in many areas in Europe (more than
in 1000 areas out 2575 available). Nevertheless, animals such as Polar Bear or Moose,
which have quite specific living conditions (cold climate) returned nice, easy to analyze
trees.
When minimal number of entries to the node is set to 100 or 50, trees naturally tend to
be smaller. It is more suitable for the animals which are live in more than 500 locations
in Europe (i.e. Shrew, Fox, Mice, etc.). Since all of them live all over the Europe,
this limitation helps in finding a meaningful redescription for them, with more specific
climatic features. Such animals as Moose, Polar Bear or Seal are not very wide-spread
and live in a very specific climatic conditions (e.g. cold areas, and for the Seal - coastline). Setting min bucket too high is not suitable here. Setting minimal size of a node
size to the half of the number of places where a particular animal lives, can be considered
as a 50% threshold. This implies, that we expect at least a half of the population of a
particular animal share the same conditions. Which makes sense, since in such a case,
we know that the given redescriprion describes a niche which is shared at least by a half
of the considered animal.
If we compare both Algorithms based on their results on Bio data set it can be seen:
• Both algorithms returned statistically significant redescriptions (with p − value <
0.01 within all runs).
• If we compare accuracy of the results (e.g. Jaccard’s), it can be concluded that
in every run of Algorithm 1 it returns up to 67% of redescriptions with accuracy
above 0.8, while Algorithm 2 only up to 50%. Note that only Algorithm 1 managed
to mine redescription with a perfect accuracy (Jaccard is exactly 1) in some runs
Li
(for instance, with Gini and IG, min bucket = 100
) but they have low support
size (below 30 rows) making them less informative. Algorithm 2 did not return
redescriptions with Jaccard’s coefficient exactly 1.
• The size of support of the redescription is an important parameter which determines the extent of its interestingness. If we apply a threshold of 100 rows for the
redescription to regard them as interesting and compare only those redescriptions
which has E1,1 > 100, it can be seen that Algorithm 1 mines up to 75% of redescriptions with accuracy above 0.8 and support size at least 100, while In Algorithm 2
this parameter is not greater than 48% for each run of both algorithms.
Chapter 5 Experiments with Algorithms for Redescription Mining
55
• If we look at Top-20 (per Jaccard) redescriptions that have supp > 100 and p −
value < 0.01 we can conclude that, redescriptions from Algorithm 1 in most cases
cover above 1700 rows, while support sizes in Algorithm 2 are more diverse (from
150 up to 2000). Based on this, Algorithm 1 returns rules which hold for majority
(or almost all) rows from data set, e.g. redescriptions describe conditions which
are true all over the Europe. Also, queries are shorter, involving fewer attributes,
because decision tree induction very often terminates before reaching the maximal
allowed depth of the trees (we were using d=3 ). In Algorithm 2 more parameters
are involved and the structure of queries is more nested (e.g. trees mainly terminate
when reaching maximal depth). These redescriptions reveal more specific details
concerning fauna and climate peculiarities in Europe.
• One more aspect to be compared is overlap of the redescriptions (e.g. queries
involve same parameters making some redescriptions similar to each other). Algorithm 1 tends to produce more overlapping redescriptions (approx. 65%), because
CART selects same splitting rules from the whole data set over and over again regardless the initialization point used. For the Algorithm 2 overlapping redescriptions happen less frequently (approx. 50%), because we build every depth and
branch of the tree independently, using corresponding part of the data set. These
percentages vary slightly depending on parameters used within each run, but the
global tendency holds for all experimental runs.
Hence, if we sort redescriptions by Jaccard (from highest to lowest) and discard
redescriptions which involve identical animals, we can see that accuracy of residual
redescriptions for both algorithms is similarly high (around 0.8 on average) and
support of the redescriptions for Algorithm 1 is about 10% greater (depending on
the parameters used) than in Algorithm 2.
• Usage of Gini impurity measure on Bio data set (with other equal conditions)
in Algorithm 1 resulted in slightly deeper trees and consequently in longer redescriptions. For example, in Algorithm 1 with min bucket = 20 Gini returned
91 redescriptions (average length of a query - 5.51 variables), IG - 71 unique redescriptions (average length of a query - 4.87 variables). However, for Algorithm
2, Information Gain returned slightly longer redescriptions than the ones mined
with Gini index. For example, the Algorithm 2 with min bucket = 20 mined with
Gini 67 redescriptions (average length of a query - 7.06 variables); while with IG
- 37 unique redescriptions (average length of a query - 7.75 variables).
For both Algorithm, usage of Information Gain tends to produce higher percentage
of repeating redescriptions in each experimental run and fewer unique redesctiptions in total comparing to Gini index.
All in all, both algorithms return reasonable redescriptions for Bio data set with high
accuracy. While testing them with equal parameters, it can be seen that Algorithm
1 found greater number of redescriptions which are shorter than the ones mined by
Algorithm 2 and they are more similar between each other. For example, Moose, House
Mouse and Stoat participate in many of them. This is caused by the fact, that CART
algorithm tend to pick up splitting rules which maximize purity of resulting nodes and
very often these splitting rules coincide for different initialization points, since they
provide the greater contribution to nodes’ purity.
Redescriptions mined my Algorithm 2 consist of more various variables in both (left
and right) queries. This is due to the fact, that we build each layer of the decision tree
56
Chapter 5 Experiments with Algorithms for Redescription Mining
independently using a corresponding part of a target vector. Moreover, each branch of
a decision tree in Algorithm 2 also grows independently until stopping criteria is met.
Basically, we induce several decision trees with depth 1 using the corresponding part
of either left or right matrix. This fact explains the inclination of the Algorithm 2 to
produce deeper trees: whenever there are attributes of both classes (’1’ class and ’0
class) CART is able to perform a split of them into two leaf nodes.
Both elaborated algorithms found interesting rules while applied to Bio data set. All
of them were statistically significant and had varying support (from few rows to an
almost whole data set). Thus, they can be used in a problem of finding of bio-climatic
envelopes for species. Nevertheless, some resulting redescription with high supports,
despite being accurate, might pose little interest to the biologists. They combine complex
climatic queries with several, possibly unrelated, species. Usage of p-value to check the
significance of the results mitigates this issue but does not resolve it completely.
Chapter 5 Experiments with Algorithms for Redescription Mining
5.4
57
Experiments With Algorithms on Conference Data Set
Many real-world data can not be represented as a data set consisting of two matrices,
one of which is binary. That is why, the binarization routine is essential to enable
application of our algorithms in such cases. For left-hand side of DBLP data set we tried
three different clusterization techniques plugged in discretization procedure, described
in Subsection 4.6.1. In total, we used three clustering approaches, but this set can be
extended to other ones, if needed. Generally, when analyzing bibliography data, mined
redescriptions shed light on the communities of researches that contribute to the field
at most.
Density-based spatial clustering. (DBSCAN)
This clustering technique does not require specification of the number of clusters and
it is automatically tailored to detect necessary number of clusters based on the notion
of density reachability [15]. A cluster, which is a subset of the points of the database,
should satisfy two properties: all points within it are mutually density-connected and
if a point is density-reachable from any point of the cluster, it belongs to cluster too.
Algorithm on its own requires two parameters from a user: minimal number of points
in the cluster and distance [4].
We have applied DBSCAN to our DBLP data set, where initially we had 19 conferences
in left matrix and used clusterization to split each of them into intervals and transform
data into binary matrix. These intervals represent the number of papers each author
published within a particular conference. Columns from discretized matrix are used as
initial target vectors for both Algorithms.
In this particular case we should take into account the characteristics of the data set.
Namely, we have most of the authors submitted somewhere below 7-10 papers to the
conference. However, there are rare instances when author submits more than 15 papers.
Thus, first clusters are quite dense and last ones are mainly sparse. We picked up such
distance and number of points for DBSCAN to receive segregation of each conference
into 5-10 clusters. Every new data sets may require different set of initial parameters
to perform well. Very often in data mining results are highly dependable on parameter
selection.
DBSCAN with Algorithm 1. Table 5.5 illustrates several redescriptions mined by
Algorithm 1 after aforementioned discretization of the left-hand side matrix.
Table 5.5: Redescriptions mined by Algorithm 1 from DBLP data set (with DBSCAN
binarization routine; Gini-impurity measure; min bucket= 5 ) LHS is a left-hand side
part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard
similarity E1,1 - support.
LHS
RHS
J
E1,1
ECM L ≥ 2.5 ∧ U AI ≥ 1.5
ICDE ≥ 12.5 ∧ EDBT < 3.5
P eterGrunwald ≥ 0.5
AnthonyK.H.T ung ≥
Jef f reyXuY u ≥ 0.5
AviW igderson
≥
SilvioM icali ≥ 0.5
1.5 ∧
0.571
0.5
4
5
∧
0.133
10
ST OC ≥ 8.5 ∧ SODA < 5.5
0.5
58
Chapter 5 Experiments with Algorithms for Redescription Mining
With these parameters Algorithm 1 mined 15 unique redescriptions with high accuracy.
Majority of the cover either below 10 rows or almost whole data set, which makes them
quite obvious or expected, because they are supported by insufficient amount of rows.
Complete list of the results can be found in Appendix B. First redescription from Table
5.5
ECM L ≥ 2.5 ∧ U AI ≥ 1.5 ←→ P eter Grunwald ≥ 0.5
implies that if the some author has published at least 3 papers within ECML and at
least 2 papers within UAI, he/she has likely co-authored with Peter Grunwald at least
once. This redescription hold only for 4 rows in the DBLP data set, which makes it less
informative. Formally there no strict bounds for minimal or maximal support of the
redescription which make it interesting.
The following redescription from the Table 5.5:
ICDE ≥ 12.5 ∧ EDBT < 3.5 ←→ Anthony K. H. T ung ≥ 1.5 ∧ Jef f rey Xu Y u ≥ 0.5
claims that if you published more than 13 papers on ICDE and from 0 to 3 papers
within EDBT, you have probably co-authored twice (or more) with Anthony K. H. Tung
and at least once with Jeffrey Xu Yu. This redescription has support 5, which can be
considered as low as well. But there are not so many people who submit more than 13
papers for a single conference, so the size of support in this case can be considered as
acceptable to regard this redescription as informative.
Let’s consider the last redescription from this table:
ST OC ≥ 8.5 ∧ SODA < 5.5 ←→ Avi W igderson ≥ 0.5 ∧ Silvio M icali ≥ 0.5
It can be formulated with natural language as follows:
If you have more 9 or more papers accepted in STOC and 5 or fewer papers in SODA
conferences, you have co-authored at least once with Avi Wigderson and at least once
with Silvio Micali
In analogous way all rules can be interpreted as well. The decision tree pair for this
redescription is depicted on the Figure 5.14. We provide a decision tree exemplification
only for this redescription among others from DBLB data set, they can be plotted
analogously.
DBSCAN with Algorithm 2. Similarly, we run experiments on the same DBLP
data set usingPAlgorithm 2. When using Information Gain impurity measure and
Li
min bucket = 100
, Algorithm 2 found 110 unique redescriptions with diverse support
size (from few rows to almost whole data set) 15 of which have p − value > 0.01 which
make the statistically insignificant. 31 out of all number redescriptions have Jaccard’s
coefficient higher than 0.8. Unlike with first algorithm, here we received lower Jaccard’s
similarity but greater support of each redescription. Some of the results are listed in
the Table 5.6, while full report including p-values for each redescription can be found
in Appendix B.
Chapter 5 Experiments with Algorithms for Redescription Mining
59
Figure 5.14: A pair of decision trees returned by the Algorithm 1
Table 5.6: Redescriptions mined by Algorithm 2 from DBLP
data set (with DBSCAN
P
Li
binarization routine; IG-impurity measure; min bucket= 100 ); LHS is a left-hand side
part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard
similarity E1,1 - support
LHS
RHS
J
E1,1
ST OC < 0.5 ∧ F OCS ≥ 0.5 ∧
SIGM ODConf erence < 0.5 ∨
ST OC ≥ 0.5 ∧ SODA ≥ 0.5 ∧
ICDE < 0.5
T omasF eder
<
0.5 ∧
AviW igderson
≥
0.5 ∧
CatrielBeeri
<
0.5 ∨
T omasF eder ≥ 0.5∧AmosF iat ≥
0.5 ∧ SergeA.P lotkin < 1.5
RakeshAgrawal
≥
0.5 ∨ RakeshAgrawal
<
0.5 ∧ HamidP irahesh
≥
0.5 ∧ JiaweiHan < 0.5
M anf redK.W armuth ≥ 0.5
0.919
711
0.809
689
0.226
19
V LDB ≥ 1.5 ∨ V LDB < 1.5 ∧
SIGM ODConf erence ≥ 0.5 ∧
ICDM < 0.5
COLT ≥ 3.5
The first redescription from Table 5.6
ST OC < 0.5 ∧ F OCS ≥ 0.5 ∧ SIGM ODConf erence < 0.5 ∨ ST OC ≥ 0.5 ∧ SODA ≥
0.5 ∧ ICDE < 0.5 ←→ T omas F eder < 0.5 ∧ Avi W igderson ≥ 0.5 ∧ Catriel Beeri <
0.5 ∨ T omas F eder ≥ 0.5 ∧ Amos F iat ≥ 0.5 ∧ Serge A. P lotkin < 1.5
states that if you have not published any paper in either STOC, or SIGMOD conference
but have at least one publication in FOCS, or you have at least 1 paper in STOC and
SODA but no papers in ICDE, you likely co-authored neither with Tomas Feder, nor
with Catriel Beeri but have collaborated with Avi Wigderson at least once. Or, you
have collaborated with Tomas Feder and Amos Fiat at least once, and have worked with
Serge A. Plotkin from 0 to 1 times. The support of this redescription is in acceptable
range to claim it is informative and accurate enough. p-value of these redescription is
zero, which makes the result statistically significant as well.
The second redescription from Table 5.6
V LDB ≥ 1.5 ∨ V LDB < 1.5 ∧ SIGM ODConf erence ≥ 0.5 ∧ ICDM < 0.5 ←→
Rakesh Agrawal ≥ 0.5 ∨ Rakesh Agrawal < 0.5 ∧ Hamid P irahesh ≥
0.5 ∧ Jiawei Han < 0.5
60
Chapter 5 Experiments with Algorithms for Redescription Mining
claims that if you have published more than 2 papers in VLDB or from 0 to 1 paper
ind VLDB and at least 1 paper in SIGMODConference, you have probably co-authored
with Rakesh Agrawal at least once, or you have co-authored with neither him, nor Jiawei
Han but you have at the same time at least one publication with Hamid Pirahesh.
And final redescription from Table 5.6
COLT ≥ 3.5 ←→ M anf red K. W armuth ≥ 0.5
states that if you have published 4 or more papers within COLT, than you have coauthored with Manfred K. Warmuth once ore more times. Note, that this redescription
has quite low accuracy (0.226) which make it less interesting regardless the acceptable
level of support.
Thus, Algorithm 2 mined more interesting and diverse redescriptions which are still
statistically significant. They have longer structures and involve grater amount of variables comparing to Algorithm 1. Whenever the length of the resulting redescription
start to be bigger than desired, user may use limitation of the maximal depth of the
trees (we were using max depth = 3 so far).
Usage of DBSCAN is advantageous due to its ability to detect necessary amount of
clusters automatically. Thus, the user has no need to specify which makes the whole
process easier.
k-means.
As one more option to test our algorithms and compare result, we adopted k-means
clustering technique to be used in discretization of data set. Unlike DBSCAN, k-means
clusterization [37] require from user indication of desired number of clusters. This poses
an issue on its own which is vigorously discussed in scientific literature [30, 53, 22].
The correct choice of clusters’ number is often ambiguous, with interpretations depending on the shape and scale of the distribution of points in a data set and the desired
clustering resolution of the user. Increasing number of clusters without penalty will
always reduce the amount of error in the resulting clustering, to the extreme case of
zero error when each data point is considered its own cluster (i.e. when there are as
many clusters as number of data points). Intuitively, the optimal number of clusters is
a balance between these extreme cases.
Nevertheless, when working with DBLP data set, we experimented in partition of
each conference into 5 clusters. This choice is caused by previous knowledge about
data. Partition in smaller number of clusters results in highly dense clusters which
represent from 0 to 7 submitted papers within one conference. When partitioning into
more than 10 clusters, some data points would be considered as a separate cluster and
put unwanted computational burden. Some redescriptions returned by Algorithm 1 are
listed in a Table 5.7.
Chapter 5 Experiments with Algorithms for Redescription Mining
61
Table 5.7: Redescriptions mined by Algorithm 1 from DBLP data set (with k-means
(5 clusters) binarization routine; Gini-impurity measure; min bucket= 5) LHS is a lefthand side part of the redescription; RHS is a right hand-side part of a redescription; J
- Jaccard similarity E1,1 - support.
LHS
RHS
J
E1,1
U AI ≥ 2.5 ∧ KDD ≥ 2.5
V LDB
≥
18.5
∧
SIGM ODConf erence < 26.5
ST OC ≥ 8.5 ∧ SODA < 5.5
T omiSilander ≥ 0.5
ShaulDar ≥ 0.5
0.500
0.357
4
5
0.113
10
AviW igderson
≥
SilvioM icali ≥ 0.5
0.5
∧
With these parameters Algorithm 1 mined 8 unique redescriptions (2 of them have
p − value > 0.1). Other statistically significant redescriptions have support around 10
rows, leading to the conclusion that with k-means clustring used within discretization
routine, Algorithm 1 returns more intuitively expected rules (full set of outcomes can
be found in Appendix B).
Algorithm 2 was also applied to this data set, some resulting redescriptions are listed
in the Table 5.8, while full set is presented in Appendix B.
Table 5.8: Redescriptions mined by Algorithm 2 from DBLP data set (with k-means
Li
(5 clusters) binarization routine; Gini-impurity measure; min bucket = 100
) LHS is a
left-hand side part of the redescription; RHS is a right hand-side part of a redescription;
J - Jaccard similarity E1,1 - support
LHS
RHS
ICDE ≥ 12.5 ∧ EDBT ≥
2 ∨ ICDE
<
12.5 ∧
SIGM ODConf erence
≥
19.5 ∧ W W W ≥ 0.5
F lipKorn
≥
0.5
KrithiRamamritham
4.5 ∨ F lipKorn
<
0.5
SudarshanS.Chawathe
3 ∧ M ayankBawa ≥ 0.5
RichardCole
≥
0.5
RichardCole
<
0.5
LaszloLovasz
≥
0.5
JurisHartmanis < 0.5
AviW igderson
≥
0.5
AviW igderson
<
0.5
SalilP.V adhan
≥
4.5
Shaf iGoldwasser ≥ 0.5
SODA ≥ 17.5 ∨ SODA < 17.5 ∧
F OCS ≥ 10.5 ∧ ST OC ≥ 9.5
ST OC ≥ 8.5 ∨ ST OC < 8.5 ∧
F OCS ≥ 8.5 ∧ SODA < 1.5
J
E1,1
∧
<
∧
≥
0.833
15
∨
∧
∧
0.534
43
∨
∧
∧
0.337
33
62
Chapter 5 Experiments with Algorithms for Redescription Mining
With these parameters Algorithm 2 returned 70 unique redescriptions, 30 of which
have Jaccard’s coefficient above 0.8, with support from several rows up to covering of a
whole data set.
When comparing the performance of both algorithms on this data set, it can be seen,
that, as before, Algorithm 1 results in more simple and intuitive rules with lower support (around 10 rows) and high Jaccard’s similarity between queries, while Algorithm
2 returns longer, more detailed redescriptions with higher support, but lower Jaccard’s
similarity.
Hierarchical clustering.
To exploit diversity we have tested one more clustering technique to discretize DBLP
data set. Hierarchical clustering [29, 25], similarly to k-means, does not detect the
number of cluster automatically. We tried splitting each conference into 5 clusters and
run presented algorithms for redescription mining to see the performance.
Some resulting redescriptions for Algorithm 1 are presented in Table 5.9. Full results
can be found in Appendix B
Table 5.9: Redescriptions mined by Algorithm 1 from DBLP data set (with hierarchical (5 clusters) binarization routine; IG-impurity measure;) LHS is a left-hand side
part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard
similarity E1,1 - support.
LHS
RHS
ICDT ≥ 4.5 ∧ V LDB ≥ 0.5 ∧
ICDE ≥ 2.5
GostaGrahne
<
2.5
KotagiriRamamohanarao
13 ∨ GostaGrahne
≥
2.5
JigneshM.P atel < 0.5
BenyuZhang ≥ 12
P eterGrunwald
<
1.5
StephenD.Bay
≥
3.5
P eterGrunwald ≥ 1.5
W W W ≥ 4.5 ∧ ICDM ≥ 3.5
ECM L ≥ 2.5 ∧ U AI ≥ 1.5
∧
≥
∧
∧
∨
J
E1,1
1
3
1
1
2
5
With indicated parameters in Table 5.9 Algorithm 1 was able to return 15 unique
redescriptions with high Jaccards’ and low supports (below 10 rows),
Algorithm 2 returned 76 unique redescriptions. They share analogous features as the
ones returned by Algorithm 2 before (i.e. using DBSCAN and k-means). Table 5.10
contains some examples of them, while full report can be found in Appendix B.
Chapter 5 Experiments with Algorithms for Redescription Mining
63
Table 5.10: Redescriptions mined by Algorithm 2 from DBLP data set (with hierarchical (5 clusters) binarization routine; IG-impurity measure;) LHS is a left-hand side
part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard
similarity E1,1 - support.
LHS
RHS
J
E1,1
SDM ≥ 1.5 ∨ SDM < 1.5 ∧
ICDM ≥ 0.5 ∧ KDD ≥ 0.5
P hilipS.Y u ≥ 4.5 ∨ P hilipS.Y u <
4.5 ∧ V ipinKumar ≥ 0.5 ∧
SunilP rabhakar < 1.5
CatrielBeeri
≥
0.5 ∨
CatrielBeeri
<
0.5 ∧
LeonidLibkin
≥
0.5 ∧
T homasSchwentick ≥ 0.5
M osesCharikar
≥
0.5 ∨ M osesCharikar
<
0.5 ∧ AviW igderson
≥
0.5 ∧ M oniN aor ≥ 0.5
0.621
125
0.342
13
0.245
39
P ODS ≥ 2.5 ∨ P ODS < 2.5 ∧
ICDT ≥ 0.5 ∧ ST ACS ≥ 0.5
SODA ≥ 5.5 ∨ SODA < 5.5 ∧
F OCS ≥ 0.5 ∧ ST OC ≥ 8.5
64
Chapter 5 Experiments with Algorithms for Redescription Mining
5.4.1
Discussion
When running two algorithms on DBLP data set, regardless the binarization procedure,
Algorithm 2 tend to find considerably greater amount of redescriptions. While at the
same time they are more complex in structure and longer than the ones found by Algorithm 1. However, first algorithm finds more intuitive, or obvious, redescriptions which
are shorter and have Jaccard coefficient either 1, or very close to it. Majority of them
supported by either by below 10 or above 2000 rows. This leads to the conclusion that
Algorithm 1 with DBLP data tend to select obvious rules which hold only for few rows
of data set. This issue can be slightly adjusted by parameter min bucket. Whenever
we increase it, we tend to mine redescriptions with higher supports. This effect can by
exploited on other data set as well.
Algorithm 2 in this data set tend to find more interesting results, supported by the
number of rows which is greater than 10 but small enough not to cover the whole data.
This makes results more informative. However, redescriptions which carry almost no
useful information happen to appear here as well. As before, increasing min bucket
parameter can fix this.
If we investigate the performance of both algorithms on DBLP data set in detail, we
can see the following:
• When using DBSCAN, Algorithm 1 tends to mine considerably fewer redescriptions than Algorithm 2. The accuracy of the results vary as well. Namely, Algorithm 1 returns redescriptions with perfect Jaccard’s (e.g. exactly 1) in most
cases, but the support of these redescriptions is below 10 rows, yet all of them
have p − value = 0 making the significant of the highest level.
Algorithm 2 returns less uniform outcomes, which means we observed variety of
supports (from few rows up to almost all)and accuracy. Here Jaccard’s coefficients
drop from 0.99 to 0.06. Algorithm 2 returned up to 20 % of statistically insignificant results (e.g. p − value > 0.01), They happen in redescriptions which have
high support - E1,1 > 1500.
• The structure of resulting redescription is similar to the results recieved on Bio
data set, e.g. Algorithm 1 returns more compact structures, involving fewer attributes. Decisions tree induction routine terminated before reaching maximal
allowed depth. Respectively, Algorithm 2 returned deeper trees which resulted in
longer redescriptions, involving greater amount of parameters.
• When using k-means for discretization of the left-hand side of the data set, both Algorithms returned greater amount of statistically insignificant results. Algorithm
1 - up to 30% and Algorithm 2 - up to 10%. If we look at Top-5 redescriptions
(per Jaccard) it can be underlined, that Algorithm 1 returns rules with support
around 5 row, but the redescriptions involve quite extreme cases. For example, one
author which published above 10 papers within 1 conference, so low support here
is not surprising, because there are not so many researches in Computer Science
who publish that many of scientific articles.
Algorithm 2 in Top-5 redescriptions (per Jaccard) returned rules with E1,1 >
1700 rows. And parameters inside reflect more common amount of papers that a
researcher submit within one conference (from 0 to 7 papers). Thus, these high
supports are not surprising as well.
Chapter 5 Experiments with Algorithms for Redescription Mining
65
• Having applied Hierarchical clustering to turn left-hand side of the data set into
a binary matrix, both algorithms behave as before. Yet, Algorithm 1 returned all
statistically significant redescriptions, but in Algorithm 2 up to 15% of them did
not pass the p−value < 0.01 threshold. The accuracy of the results in Algorithm 1
is perfect (Jaccard is exactly 1), but redescription again describe cases when author
submits unusually big amount of papers within conferences (above 10). Hence,
support sizes of results are low. In Algorithm 2 we observed diverse supports,
majority of which are between 20 and 700 rows giving the redescriptions desirable
interestingness.
There are no strict formal limitation a of the support of the mined redescriptions.
These criterion rather caused by the data set we work with. Thus, in DBLP data we
adopt the idea that support between 10 and 1800 rows would pose an interest. However,
this choice is influenced only by the nature of this particular data set.
All in all, selection of clustering method within binarization routine (DBSCAN, kmeans, Hierarchical clustering) on DBLP data set does not affect significantly neither
the amount of mined redescriptions, nor quality of them. The only noticeable difference
is when using k-means both algorithms return more statistically insignificant redescriptions. Both algorithm tend to return typical for them results with all clustering methods
used. This caused by the fact, that discretized data participates in the inception of the
algorithm’s run only. Algorithm further uses fully non-Binary setting to work through
the data.
Hence, in cases when the user has no previous knowledge of data, we suggest using
DBSCAN, because it defines amount of clusters automatically and can prevent from
clustering of the data set into to many clusters, which would lead to unwanted computational burden. On the other hand, when user want to segregate data in a certain amount
of clusters, k-means, hierarchical clustering on any other available clustering algorithms
can be used for this purpose.
66
5.5
Chapter 5 Experiments with Algorithms for Redescription Mining
Experiments against ReReMi algorithm
To evaluate Algorithm 1 and 2 we compared them to ReReMi algorithm presented in [20]
and extended with on-fly-line-bucketing in [18]. ReReMi reported meaningful results
for redescription mining both with real-world an synthetic data. Hence, it is a logical
choice to compare Algorithm 1 and 2 using the same data sets.
Comparing algorithms for redescription mining on real-world data sets is an intricate
task, since they might produce different type of the redescriptions. Such parameter as
’interestingness’ is hard to measure, yet it is important when analyzing set of mined
redescriptions. We run ReReMi algorithm with analogous parameters on both Bio and
DBLP data set. We used limitation of the depth=3 when running Algorithms 1 and
2 which corresponds to maximum 7 number of variables involved into each query in
ReReMi algorithm. We allowed on both sides of a redescription usage of conjunction and
disjunction operators in Algorithms 1, 2 and ReReMi. However, when running Algorithm
1 and 2, we change Impurity measure, which does not have identical equivalents in
ReReMi algorithm. min bucket parameter could be related to minimal contribution in
ReReMi (used 0.05; details in [19]). In addition, we allow as much initial pairs as there
runs of our Algorithms 1 and 2 for each particular case.
Bio. Using the same Bio data set, ReReMi returned 209 unique statistically significant
redescriptions. 201 of them are of Jaccard higher than 0.8. However, the size of the
support tend to be large, i.e. E1,1 > 1300, meaning that most redescription cover high
percentage of rows from data set.
Algorithm 1 returned 140 unique redescriptions which are statistically significant.
They also have diverse supports. Yet, unlike ReReMi, we observed redescriptions which
have Jaccard’s exactly 1 and low support (around 10). This means, that Algorithm 1
tends to mine more obvious and less informative rules than ReReMi.
Algorithm 2 returned 156 unique statistically significant redescriptions, with support
not lower than 30 rows. which make them informative. In general results are closer to
the ReReMi. Many of mined redescriptions (by either ReReMi, or Algorithms 1 and 2)
overlap, describing similar parts of a Bio data set.
DBLP. We used the same DBLP data set to test with ReReMi algorithm and Algorithms 1 and 2.
ReReMi returned 102 redescriptions with support mainly around 10 rows, yet, many
of them have higher support (up to 68 rows) making them quite interesting. Jaccard
coefficients of 37 out of total number are above 0.5.
Algorithm 1 (with Gini index) mined only 32, majority of which have support below 10
rows. These redescriptions have shorter, comparing to ReReMi, structure. And despite
the fact we allowed maximal depth to be 3 (e.g. involving at most 7 parameters) include
fewer parameters. Thus, this results show obvious rules which carry small interesting
information.
At the same time, Algorithm 2 returned 81 redescriptions which are statistically significant, whose support confirms their interestingness (above 15 rows). 30 redescriptions
have Jaccard above 0.5 They are more complex in structure than the ones returned by
Algorithm 1, yet similar to the ones returned by ReReMi.
General remark. Algorithms 1 and 2 are different from ReReMi, since the use
distinct approaches to mine and assess redescriptions (decision tree induction versus
Chapter 5 Experiments with Algorithms for Redescription Mining
67
greedy atomic updates). Underlying in Algorithms 1 and 2 CART approaches involves
usage of Impurity measures, while in ReReMi this can not be anyhow used. In addition,
ReReMi allows direct indication of the resulting minimal support size for a redescription,
while min bucket parameter we use only adjusts the minimal amount of entities in each
(terminal and leaf) nodes of the decision tree, which does not guaranty to provide
minimal support size.
Query language and the way we extract redescriptions from a pair of decision trees
involves participation of the same variable in a query for several times, which is not
applicable in ReReMi. This makes results difficult to compare with each other. Finally,
Algorithms 1 and 2 when processing fully-numerical data set, use a clusterization as
a pre-processing step, since they require binary targets in the inception. This brings
up one more aspect to incompatibility of results, because ReReMi uses on-fly-bucketing
approach [18] when processing fully-numerical data sets. Despite this, rules, mined by
Algorithm 2 over DBLP data set resemble the ones mined by ReReMi. Note, that there
is no strict limits on minimal/maximal support of the redescription which is considered
acceptable or not. Usually, this indicator is set based on the particular data set we work
with. In experiments with computer science bibliography we consider redescriptions
interesting when their support is higher than 10 rows. Logically, redescriptions which
are supported by nearly whole data set carry no useful information as well, because the
simply describe the rule which is true fro all attributes of a data.
Chapter 6
Conclusions and Future Work
This Thesis is dedicated to the data analysis task called redescriptions mining, which
aims to discover objects which have multiple common descriptions. Or, vice versa,
revealing shared characteristics for the set of objects. Redescription mining gives insight
into the data with the help of queries that relate different views to the objects from it
and provides domain-neutral way to cast complex data mining scenarios in terms of
simpler primitives.
In this Thesis we extended alternating algorithm for redescription mining beyond
propositional Boolean queries to real-valued attributes and presented two algorithms
based on decision tree induction to mine redescriptions. Peculiarities of used parameters were discussed in detail and their influence in real-world data set was explored. We
run our Algorithms on two distinct real-world data set and received results that can be
used for discussed problems in these domains.
Numerous runs of algorithms proved, that they are able to find reasonable, statistically
significant redescriptions in studied domains. The actual value of outcomes can only
be evaluated by putting them to use in collaboration with experts of corresponding
fields. Underlying principle of redescription mining seems easy and intuitive, yet is forms
a powerful tool for data exploration, that can find practical application in numerous
domains. Existing algorithms for redescription mining, augmented by our contributions,
empowers scientists to create their own descriptors and reason with them for better
understanding of scientific data sets.
There is a big field for the future work with redescription mining. In particular, effective methods with profound theoretic foundations to model information content of
the redescriptions in the subjective interestingness framework. Cooperation of elaborated algorithms with existing methods of filtering and post-processing of redescriptions
poses an interest as well. Since uncertainties are inherent to most real-world scenarios,
redescription mining is to be tailored to take them into consideration, potentially by
usage of other data analysis developments [5].
69
Bibliography
[1] http://www.salford-systems.com/products/cart.
[2] http://www.informatik.uni-trier.de/∼ ley/db/.
[3] https://www.salford-systems.com/resources/whitepapers/115-technical-note-forstatisticians.
[4] http://en.wikipedia.org/wiki/DBSCAN.
[5] C. C. Aggarwal. Managing and Mining Uncertain Data: 3, A., volume 35. Springer, 2010.
[6] R. Agrawal, R. Srikant, et al. Fast algorithms for mining association rules. In Proc. 20th
int. conf. very large data bases, VLDB, volume 1215, pages 487–499, 1994.
[7] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In
Proceedings of the eleventh annual conference on Computational learning theory, pages 92–
100. ACM, 1998.
[8] E. Boros, P. L. Hammer, T. Ibaraki, A. Kogan, E. Mayoraz, and I. Muchnik. An implementation of logical analysis of data. Knowledge and Data Engineering, IEEE Transactions on,
12(2):292–306, 2000.
[9] P. Bradley, J. Gehrke, R. Ramakrishnan, and R. Srikant. Scaling mining algorithms to large
databases. Communications of the ACM, 45(8):38–43, 2002.
[10] L. Breiman. Technical note: Some properties of splitting criteria. Machine Learning,
24(1):41–47, 1996.
[11] T. Calders and B. Goethals. Mining all non-derivable frequent itemsets. In Principles of
Data Mining and Knowledge Discovery, pages 74–86. Springer, 2002.
[12] J. Crawford and F. Crawford. Data mining in a scientific environment. In In Proceedings
of AUUG 96 and Asia Pacific World Wide Web, 1996.
[13] P. S. E. Hunt, J. Marin. Experiments in induction. Academic Press, New York, 1966.
[14] E. Edgington and P. Onghena. Randomization tests. CRC Press, 2007.
[15] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering
clusters in large spatial databases with noise. In Kdd, volume 96, pages 226–231, 1996.
[16] J. Friedman, T. Hastie, and R. Tibshirani. The elements of statistical learning, volume 1.
Springer Series in Statistics New York, 2001.
[17] E. Galbrun et al. Methods for redescription mining. 2013.
[18] E. Galbrun and P. Miettinen. From black and white to full color: extending redescription
mining outside the boolean world. Statistical Analysis and Data Mining, pages 284–303,
2012.
[19] E. Galbrun and P. Miettinen. Siren demo at sigmod 2014. 2014.
[20] A. Gallo, P. Miettinen, and H. Mannila. Finding subgroups having several descriptions:
Algorithms for redescription mining. In SDM, pages 334–345. SIAM, 2008.
71
72
BIBLIOGRAPHY
[21] G. Gigerenzer and Z. Swijtink. The empire of chance: How probability changed science and
everyday life, volume 12. Cambridge University Press, 1989.
[22] C. Goutte, P. Toft, E. Rostrup, F. Å. Nielsen, and L. K. Hansen. On clustering fmri time
series. NeuroImage, 9(3):298–310, 1999.
[23] J. Grinnell. The niche-relationships of the california thrasher. The Auk, pages 427–433,
1917.
[24] J. Han, H. Cheng, D. Xin, and X. Yan. Frequent pattern mining: current status and future
directions. Data Mining and Knowledge Discovery, 15(1):55–86, 2007.
[25] T. Hastie, R. Tibshirani, J. Friedman, T. Hastie, J. Friedman, and R. Tibshirani. The
elements of statistical learning, volume 2. Springer, 2009.
[26] R. J. Hijmans, S. E. Cameron, J. L. Parra, P. G. Jones, and A. Jarvis. Very high resolution
interpolated climate surfaces for global land areas. International journal of climatology,
25(15):1965–1978, 2005.
[27] P. Jaccard. Distribution de la flore alpine dans le bassin des dranses et dans quelques rgions
voisines. Bulletin de la Socit Vaudoise des Sciences Naturelles, 37.
[28] C. Kamath, E. Cantú-Paz, I. K. Fodor, and N. A. Tang. Classifying of bent-double galaxies.
Computing in Science & Engineering, 4(4):52–60, 2002.
[29] L. Kaufman and P. J. Rousseeuw. Finding groups in data: an introduction to cluster
analysis, volume 344. John Wiley & Sons, 2009.
[30] D. J. Ketchen and C. L. Shook. The application of cluster analysis in strategic management
research: an analysis and critique. Strategic management journal, 17(6):441–458, 1996.
[31] J. P. Kleijnen. Cross-validation using the t statistic. European Journal of Operational
Research, 13(2):133–141, 1983.
[32] M. Krzywinski and N. Altman. Points of significance: Significance, p values and t-tests.
Nature methods, 10(11):1041–1042, 2013.
[33] D. Kumar. Redescription mining: Algorithms and applications in bioinformatics. PhD
thesis, Virginia Polytechnic Institute and State University, 2007.
[34] R. O. L. Breiman, J.H. Friedman and C. Stone. Classification and Regression Trees. Chapman and Hall/CRC, 1984.
[35] E. L. Lehmann and J. P. Romano. Testing statistical hypotheses. springer, 2006.
[36] S. C. Lemon, J. Roy, M. A. Clark, P. D. Friedmann, and W. Rakowski. Classification
and regression tree analysis in public health: methodological review and comparison with
logistic regression. Annals of behavioral medicine, 26(3):172–181, 2003.
[37] J. MacQueen. Some methods for classification and analysis of multivariate observations,
1967.
[38] H. Mannila, H. Toivonen, and A. I. Verkamo. Eficient algorithms for discovering association
rules. In KDD-94: AAAI workshop on Knowledge Discovery in Databases, pages 181–192,
1994.
[39] K. Meier, J. Brudney, and J. Bohte. Applied statistics for public and nonprofit administration. Cengage Learning, 2011.
[40] A. J. Mitchell-Jones. The atlas of european mammals,. Academic Press, London,, 1999.
[41] P. K. Novak, N. Lavrač, and G. I. Webb. Supervised descriptive rule discovery: A unifying
survey of contrast set, emerging pattern and subgroup mining. The Journal of Machine
Learning Research, 10:377–403, 2009.
[42] V. K. Pang-Ning Tan, Michael Steinbach. Introduction to Data Mining. Addison Wesley,
2006.
BIBLIOGRAPHY
73
[43] L. Parida and N. Ramakrishnan. Redescription mining: Structure theory and algorithms.
In AAAI, volume 5, pages 837–844, 2005.
[44] F. Questier, R. Put, D. Coomans, B. Walczak, and Y. V. Heyden. The use of cart and multivariate regression trees for supervised and unsupervised feature selection. Chemometrics
and Intelligent Laboratory Systems, 76(1):45–54, 2005.
[45] J. R. Quevedo, A. Bahamonde, and O. Luaces. A simple and efficient method for variable
ranking according to their usefulness for learning. Computational Statistics & Data Analysis,
52(1):578–595, 2007.
[46] J. R. Quinlan. Induction of decision trees. Machine learning, 1(1):81–106, 1986.
[47] J. R. Quinlan. C4. 5: programs for machine learning, volume 1. Morgan kaufmann, 1993.
[48] N. Ramakrishnan, D. Kumar, B. Mishra, M. Potts, and R. F. Helm. Turning cartwheels: an
alternating algorithm for mining redescriptions. In Proceedings of the tenth ACM SIGKDD
international conference on Knowledge discovery and data mining, pages 266–275. ACM,
2004.
[49] J. Soberón and M. Nakamura. Niches and distributional areas: concepts, methods, and
assumptions. Proceedings of the National Academy of Sciences, 106(Supplement 2):19644–
19650, 2009.
[50] R. Srikant and R. Agrawal. Mining quantitative association rules in large relational tables.
In ACM SIGMOD Record, volume 25, pages 1–12. ACM, 1996.
[51] D. Steinberg and P. Colla. Cart: classification and regression trees. The Top Ten Algorithms
in Data Mining, 9:179, 2009.
[52] T. M. Therneau, B. Atkinson, and M. B. Ripley. The rpart package, 2010.
[53] R. L. Thorndike. Who belongs in the family? Psychometrika, 18(4):267–276, 1953.
[54] A. Tripathi, A. Klami, M. Orešič, and S. Kaski. Matching samples of multiple views. Data
Mining and Knowledge Discovery, 23(2):300–321, 2011.
[55] G. Tsoumakas, I. Katakis, and I. Vlahavas. Mining multi-label data. In Data mining and
knowledge discovery handbook, pages 667–685. Springer, 2010.
[56] L. Umek, B. Zupan, M. Toplak, A. Morin, J.-H. Chauchat, G. Makovec, and D. Smrke.
Subgroup Discovery in Data Sets with Multi–dimensional Responses: A Method and a Case
Study in Traumatology. Springer, 2009.
[57] V. N. Vapnik. The nature of statistical learning theory. statistics for engineering and information science. Springer-Verlag, New York, 2000.
[58] G. P. Wadsworth and J. G. Bryan. Introduction to probability and random variables, volume 7. McGraw-Hill New York:, 1960.
[59] G. J. Williams. Rattle: a data mining gui for r. The R Journal, 1(2):45–55, 2009.
[60] D. Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. In
Proceedings of the 33rd annual meeting on Association for Computational Linguistics, pages
189–196. Association for Computational Linguistics, 1995.
[61] M. J. Zaki. Generating non-redundant association rules. In Proceedings of the sixth ACM
SIGKDD international conference on Knowledge discovery and data mining, pages 34–43.
ACM, 2000.
[62] M. J. Zaki and C.-J. Hsiao. Charm: An efficient algorithm for closed itemset mining. In
SDM, volume 2, pages 457–473. SIAM, 2002.
[63] M. J. Zaki and N. Ramakrishnan. Reasoning about sets using redescription mining. In
Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery
in data mining, pages 364–373. ACM, 2005.
74
*
BIBLIOGRAPHY
Appendix A
Redescription Sets from
experiments with Bio Data Set
75
E1,1
2454
2434
p-val.
0
0
0.979
0.972
2379
2276
0
0
0.97
2326
0
0.967
2236
0
0.956
1975
0
0.956
1831
0
0.955
1876
0
0.953
1993
0
Redescription
(Arctic.F ox ≥ 0.5 ∧ Red.F ox ≥ 0.5) ∨ (Arctic.F ox < 0.5 ∧ P olar.bear < 0.5) ←→ (t7 − max ≥ 13.45)
(Arctic.F ox ≥ 0.5 ∧ Roe.Deer ≥ 0.5) ∨ (Arctic.F ox < 0.5 ∧ P olar.bear < 0.5) ←→ (t8 − max < 13.35 ∧ t7 − avg ≥
9.72) ∨ (t8 − max ≥ 13.35)
(Arctic.F ox < 0.5 ∧ P olar.bear < 0.5) ←→ (t8 − max < 14.65 ∧ t7 − min ≥ 6.35) ∨ (t8 − max ≥ 14.65)
(Laxmann.s.Shrew ≥ 0.5∧Y ellow.necked.M ouse ≥ 0.5)∨(Laxmann.s.Shrew < 0.5∧Grey.Red.Backed.V ole ≥ 0.5∧
Brown.long.eared.bat ≥ 0.5)∨(Laxmann.s.Shrew < 0.5∧Grey.Red.Backed.V ole < 0.5∧N orth.American.Beaver <
0.5) ←→ (t12 − avg < −5.72 ∧ p8 ≥ 53.2 ∧ t4 − max ≥ 5.95) ∨ (t12 − avg < −5.72 ∧ p8 < 53.2) ∨ (t12 − avg ≥ −5.72)
(Arctic.F ox < 0.5 ∧ N orway.lemming ≥ 0.5 ∧ W olverine < 0.5) ∨ (Arctic.F ox < 0.5 ∧ N orway.lemming < 0.5) ←→
(t9 − max < 10.85 ∧ p8 ≥ 36.85 ∧ t6 − avg ≥ 10.35) ∨ (t9 − max < 10.85 ∧ p8 < 36.85) ∨ (t9 − max ≥ 10.85)
(Arctic.F ox < 0.5 ∧ Grey.Red.Backed.V ole ≥ 0.5 ∧ European.Hedgehog ≥ 0.5) ∨ (Arctic.F ox < 0.5 ∧
Grey.Red.Backed.V ole < 0.5 ∧ P olar.bear < 0.5) ←→ (t9 − avg < 7.835 ∧ t9 − max < 10.95 ∧ t7 − max ≥
18.95) ∨ (t9 − avg < 7.835 ∧ t9 − max ≥ 10.95 ∧ p8 ≥ 102.9) ∨ (t9 − avg ≥ 7.835)
(M oose ≥ 0.5 ∧ W ood.mouse < 0.5 ∧ Beech.M arten ≥ 0.5) ∨ (M oose ≥ 0.5 ∧ W ood.mouse ≥ 0.5 ∧ European.Hare ≥
0.5)∨(M oose < 0.5∧Arctic.F ox ≥ 0.5∧American.M ink < 0.5)∨(M oose < 0.5∧Arctic.F ox < 0.5) ←→ (t10−max <
9.75 ∧ p8 ≥ 58.85 ∧ t12 − min ≥ −3.35) ∨ (t10 − max < 9.75 ∧ p8 < 58.85) ∨ (t10 − max ≥ 9.75 ∧ t3 − max < 3.45 ∧ p10 <
55.1) ∨ (t10 − max ≥ 9.75 ∧ t3 − max ≥ 3.45)
(M oose ≥ 0.5 ∧ M ountain.Hare < 0.5 ∧ Beech.M arten ≥ 0.5) ∨ (M oose < 0.5 ∧ Arctic.F ox < 0.5 ∧ P olar.bear <
0.5) ←→ (t3−max < 4.65∧t10−max < 11.55∧p6 ≥ 114.5)∨(t3−max < 4.65∧t10−max ≥ 11.55)∨(t3−max ≥ 4.65)
(M oose ≥ 0.5 ∧ M ountain.Hare < 0.5 ∧ Brown.long.eared.bat ≥ 0.5) ∨ (M oose < 0.5 ∧ Arctic.F ox ≥ 0.5 ∧
American.M ink < 0.5) ∨ (M oose < 0.5 ∧ Arctic.F ox < 0.5) ←→ (t3 − max < 4.55 ∧ p8 ≥ 56.75 ∧ p5 ≥
96.5) ∨ (t3 − max < 4.55 ∧ p8 < 56.75) ∨ (t3 − max ≥ 4.55)
(W ood.mouse < 0.5 ∧ M ountain.Hare ≥ 0.5 ∧ Southern.W hite.breasted.Hedgehog ≥ 0.5) ∨ (W ood.mouse < 0.5 ∧
M ountain.Hare < 0.5 ∧ P olar.bear < 0.5) ∨ (W ood.mouse ≥ 0.5 ∧ Arctic.F ox < 0.5) ←→ (t4 − max < 7.75 ∧ t10 −
max ≥ 8.95) ∨ (t4 − max ≥ 7.75 ∧ t4 − max < 8.35 ∧ t8 − max < 19.85) ∨ (t4 − max ≥ 7.75 ∧ t4 − max ≥ 8.35)
Continued
Appendix A Redescription Sets from experiments with Bio Data Set
J
0.995
0.988
76
Table A.1: Redescriptions mined by Algorithm 1 from Bio data set (with Gini impurity measure and min bucket=20.) J - Jaccard similarity E1,1 support ; tn − {max; min; avg} stand for minimum, maximum, and average temperature of month n in degrees Celsius, and pn stands for average
precipitation of month n in millimeters.
J
0.95
E1,1
1923
p-val.
0
0.947
0.945
36
1803
0
0
0.944
1996
0
0.94
2232
0
0.934
1926
0
0.923
1722
0
0.91
1450
0
0.91
1906
0
77
Redescription
(M oose ≥ 0.5 ∧ Lesser.W hite.toothed.Shrew < 0.5 ∧ Harbor.Seal ≥ 0.5) ∨ (M oose ≥ 0.5 ∧
Lesser.W hite.toothed.Shrew ≥ 0.5) ∨ (M oose < 0.5) ←→ (t2 − max < 0.45 ∧ t6 − max ≥ 11.45 ∧ p5 ≥
63.15) ∨ (t2 − max < 0.45 ∧ t6 − max < 11.45) ∨ (t2 − max ≥ 0.45 ∧ t3 − max < 4.85 ∧ t6 − max < 19.05) ∨ (t2 − max ≥
0.45 ∧ t3 − max ≥ 4.85)
(P olar.bear ≥ 0.5) ←→ (t3 − max < −7.05)
(M oose ≥ 0.5 ∧ Lesser.W hite.toothed.Shrew ≥ 0.5) ∨ (M oose < 0.5 ∧ Arctic.F ox ≥ 0.5 ∧ American.M ink <
0.5) ∨ (M oose < 0.5 ∧ Arctic.F ox < 0.5) ←→ (t3 − max < 5.15 ∧ p6 ≥ 34.35 ∧ t12 − min ≥ −0.95) ∨ (t3 − max <
5.15 ∧ p6 < 34.35) ∨ (t3 − max ≥ 5.15)
(M oose ≥ 0.5 ∧ W ood.mouse ≥ 0.5) ∨ (M oose < 0.5 ∧ P olar.bear < 0.5) ←→ (t2 − max < −1.15 ∧ t2 − max ≥ −2.05 ∧
t12 − min < −5.25) ∨ (t2 − max ≥ −1.15 ∧ t4 − max < 7.55 ∧ t7 − max < 14.05) ∨ (t2 − max ≥ −1.15 ∧ t4 − max ≥ 7.55)
(European.Hamster ≥ 0.5 ∧ House.mouse.1 < 0.5) ∨ (European.Hamster < 0.5 ∧ European.ground.squirrel ≥
0.5 ∧ Southern.V ole ≥ 0.5) ∨ (European.Hamster < 0.5 ∧ European.ground.squirrel < 0.5) ←→ (p10 < 45.15 ∧ p6 ≥
61.85 ∧ p6 ≥ 92.15) ∨ (p10 < 45.15 ∧ p6 < 61.85) ∨ (p10 ≥ 45.15)
(W ild.boar < 0.5 ∧ M ountain.Hare ≥ 0.5 ∧ Y ellow.necked.M ouse ≥ 0.5) ∨ (W ild.boar < 0.5 ∧ M ountain.Hare <
0.5 ∧ Arctic.F ox < 0.5) ∨ (W ild.boar ≥ 0.5) ←→ (t8 − max < 18.85 ∧ t10 − max < 7.15 ∧ p8 < 36.85) ∨ (t8 − max <
18.85 ∧ t10 − max ≥ 7.15 ∧ p5 ≥ 101.5) ∨ (t8 − max ≥ 18.85 ∧ t8 − max < 19.35 ∧ t6 − avg ≥ 13.65) ∨ (t8 − max ≥
18.85 ∧ t8 − max ≥ 19.35)
(M ountain.Hare ≥ 0.5 ∧ Beech.M arten ≥ 0.5) ∨ (M ountain.Hare < 0.5 ∧ Arctic.F ox < 0.5) ←→ (t9 − max <
15.95 ∧ p8 < 36.85) ∨ (t9 − max ≥ 15.95 ∧ t7 − max ≥ 19.15)
(American.M ink ≥ 0.5 ∧ Grey.long.eared.bat < 0.5 ∧ Greater.W hite.toothed.Shrew ≥ 0.5) ∨ (American.M ink ≥
0.5∧Grey.long.eared.bat ≥ 0.5)∨(American.M ink < 0.5∧Gray.Seal < 0.5∧M ountain.Hare < 0.5) ←→ (t9−max <
17.95 ∧ t5 − max < 4.5) ∨ (t9 − max ≥ 17.95 ∧ t4 − max < 13.05 ∧ p7 ≥ 56.9) ∨ (t9 − max ≥ 17.95 ∧ t4 − max ≥ 13.05)
(Roe.Deer < 0.5 ∧ Eurasian.P ygmy.Shrew < 0.5 ∧ Eurasian.Lynx ≥ 0.5) ∨ (Roe.Deer < 0.5 ∧
Eurasian.P ygmy.Shrew ≥ 0.5) ∨ (Roe.Deer ≥ 0.5 ∧ Granada.Hare ≥ 0.5 ∧ M editerranean.W ater.Shrew ≥
0.5) ∨ (Roe.Deer ≥ 0.5 ∧ Granada.Hare < 0.5) ←→ (p6 < 39.65 ∧ p7 ≥ 50.1) ∨ (p6 ≥ 39.65 ∧ t7 − max <
14.65 ∧ t1 − max < −1.9) ∨ (p6 ≥ 39.65 ∧ t7 − max ≥ 14.65 ∧ t9 − avg < 19.75)
Continued
Appendix A Redescription Sets from experiments with Bio Data Set
Table A.1: Redescriptions mined by Algorithm 1 from Bio data set (with Gini impurity measure and min bucket=20.) J - Jaccard similarity E1,1 support ; tn − {max; min; avg} stand for minimum, maximum, and average temperature of month n in degrees Celsius, and pn stands for average
precipitation of month n in millimeters.
E1,1
1798
p-val.
0
0.906
1597
0
0.905
1942
0
0.9
1525
0
0.899
1644
0
0.898
1929
0
0.897
1679
0
0.896
1399
0
0.894
693
0
Redescription
(M ountain.Hare ≥ 0.5 ∧ W ild.boar ≥ 0.5 ∧ European.Snow.V ole < 0.5) ∨ (M ountain.Hare < 0.5) ←→ (t5 − avg <
10.35 ∧ t7 − max < 13.45) ∨ (t5 − avg ≥ 10.35 ∧ t6 − avg ≥ 13.55)
(Common.Shrew < 0.5∧House.mouse.1 < 0.5∧M ountain.Hare ≥ 0.5)∨(Common.Shrew < 0.5∧House.mouse.1 ≥
0.5) ∨ (Common.Shrew ≥ 0.5 ∧ Etruscan.Shrew < 0.5) ←→ (t3 − max ≥ 11.15 ∧ t1 − avg < 1.045 ∧ t10 − max <
18.4) ∨ (t3 − max < 11.15 ∧ t8 − max < 11.95 ∧ t9 − avg ≥ 4.705) ∨ (t3 − max < 11.15 ∧ t8 − max ≥ 11.95)
(Common.Shrew < 0.5 ∧ House.mouse ≥ 0.5 ∧ Common.V ole ≥ 0.5) ∨ (Common.Shrew < 0.5 ∧ House.mouse <
0.5∧M editerranean.M onk.Seal < 0.5)∨(Common.Shrew ≥ 0.5) ←→ (t12−avg < 5.69∧t1−avg ≥ 2.715∧t6−avg ≥
13.35) ∨ (t12 − avg < 5.69 ∧ t1 − avg < 2.715)
(M ountain.Hare ≥ 0.5 ∧ Chamois ≥ 0.5 ∧ Alpine.F ield.M ouse < 0.5) ∨ (M ountain.Hare < 0.5 ∧ Arctic.F ox <
0.5∧Gray.Seal < 0.5) ←→ (t9−max < 17.55∧t9−max < 15.65∧p8 < 36.85)∨(t9−max ≥ 17.55∧t5−max ≥ 15.85)
(M ountain.Hare ≥ 0.5 ∧ Beech.M arten ≥ 0.5 ∧ Alpine.F ield.M ouse < 0.5) ∨ (M ountain.Hare < 0.5 ∧ Gray.Seal ≥
0.5 ∧ Serotine.bat ≥ 0.5) ∨ (M ountain.Hare < 0.5 ∧ Gray.Seal < 0.5) ←→ (t9 − max < 17.15 ∧ t8 − max <
12.35) ∨ (t9 − max ≥ 17.15 ∧ t8 − max < 20.85 ∧ t3 − min < 1.9) ∨ (t9 − max ≥ 17.15 ∧ t8 − max ≥ 20.85)
(Stoat < 0.5 ∧ House.mouse ≥ 0.5 ∧ Gray.Seal ≥ 0.5) ∨ (Stoat < 0.5 ∧ House.mouse < 0.5) ∨ (Stoat ≥ 0.5 ∧
Algerian.M ouse < 0.5) ←→ (t11 − max ≥ 11.25 ∧ t9 − max ≥ 21.9 ∧ p10 < 54.05) ∨ (t11 − max ≥ 11.25 ∧ t9 − max <
21.9) ∨ (t11 − max < 11.25 ∧ t11 − max ≥ 10.45 ∧ t11 − min ≥ 0.95) ∨ (t11 − max < 11.25 ∧ t11 − max < 10.45)
(M ountain.Hare ≥ 0.5 ∧ W ild.boar ≥ 0.5 ∧ Gray.Seal < 0.5) ∨ (M ountain.Hare < 0.5 ∧ Gray.Seal < 0.5) ←→
(t5 − max < 15.85 ∧ t8 − max < 12.35) ∨ (t5 − max ≥ 15.85)
(M ountain.Hare ≥ 0.5 ∧ Chamois ≥ 0.5 ∧ Alpine.F ield.M ouse < 0.5) ∨ (M ountain.Hare < 0.5 ∧ American.M ink ≥
0.5 ∧ Greater.W hite.toothed.Shrew ≥ 0.5) ∨ (M ountain.Hare < 0.5 ∧ American.M ink < 0.5 ∧ Gray.Seal < 0.5) ←→
(t9−max < 17.95∧t5−max < 4.5)∨(t9−max ≥ 17.95∧t4−max < 13.75∧p5 ≥ 56.8)∨(t9−max ≥ 17.95∧t4−max ≥
13.75)
(M ountain.Hare < 0.5 ∧ Arctic.F ox < 0.5 ∧ P olar.bear ≥ 0.5) ∨ (M ountain.Hare < 0.5 ∧ Arctic.F ox ≥ 0.5) ∨
(M ountain.Hare ≥ 0.5 ∧ M oose < 0.5 ∧ European.Badger < 0.5) ∨ (M ountain.Hare ≥ 0.5 ∧ M oose ≥ 0.5) ←→
(t10 − max < 11.55 ∧ p5 < 109.5 ∧ t3 − max < 6.05)
Continued
Appendix A Redescription Sets from experiments with Bio Data Set
J
0.908
78
Table A.1: Redescriptions mined by Algorithm 1 from Bio data set (with Gini impurity measure and min bucket=20.) J - Jaccard similarity E1,1 support ; tn − {max; min; avg} stand for minimum, maximum, and average temperature of month n in degrees Celsius, and pn stands for average
precipitation of month n in millimeters.
J
0.894
E1,1
1735
p-val.
0
0.892
1450
0
0.89
1354
0
0.887
1450
0
0.877
1761
0
0.874
1694
0
0.873
982
0
0.869
1540
0
0.868
1607
0
0.866
1653
0
79
Redescription
(Common.Shrew < 0.5 ∧ House.mouse ≥ 0.5 ∧ M ountain.Hare ≥ 0.5) ∨ (Common.Shrew < 0.5 ∧ House.mouse <
0.5 ∧ M editerranean.M onk.Seal < 0.5) ∨ (Common.Shrew ≥ 0.5 ∧ Savi.s.P ine.V ole < 0.5) ←→ (t3 − max ≥
11.25 ∧ t1 − avg < 0.2825) ∨ (t3 − max < 11.25)
(House.mouse ≥ 0.5 ∧ Raccoon < 0.5 ∧ N orthern.Bat ≥ 0.5) ∨ (House.mouse ≥ 0.5 ∧ Raccoon ≥ 0.5 ∧
Y ellow.necked.M ouse ≥ 0.5) ∨ (House.mouse < 0.5 ∧ Granada.Hare < 0.5 ∧ M editerranean.M onk.Seal < 0.5) ←→
(t1 − max ≥ 4.35 ∧ t1 − min < −1.95 ∧ t11 − min ≥ 2.45) ∨ (t1 − max < 4.35)
(American.M ink ≥ 0.5 ∧ Grey.long.eared.bat ≥ 0.5 ∧ Common.V ole ≥ 0.5) ∨ (American.M ink < 0.5 ∧ Gray.Seal <
0.5∧M ountain.Hare < 0.5) ←→ (t8−max < 22.05∧t5−max < 4.5)∨(t8−max ≥ 22.05∧t10−max ≥ 12.05∧p6 < 125)
(House.mouse ≥ 0.5 ∧ Raccoon < 0.5 ∧ N orthern.Bat ≥ 0.5) ∨ (House.mouse ≥ 0.5 ∧ Raccoon ≥ 0.5) ∨
(House.mouse < 0.5 ∧ Granada.Hare < 0.5 ∧ M editerranean.M onk.Seal < 0.5) ←→ (t1 − max < 4.45)
(Stoat < 0.5 ∧ Bank.V ole < 0.5 ∧ Common.Shrew ≥ 0.5) ∨ (Stoat < 0.5 ∧ Bank.V ole ≥ 0.5) ∨ (Stoat ≥ 0.5 ∧
Granada.Hare < 0.5) ←→ (p7 ≥ 39.6 ∧ t7 − max ≥ 13.45)
(M uskrat < 0.5 ∧ Common.Shrew < 0.5 ∧ House.mouse.1 ≥ 0.5) ∨ (M uskrat < 0.5 ∧ Common.Shrew ≥ 0.5) ∨
(M uskrat ≥ 0.5) ←→ (p7 ≥ 42.35 ∧ p1 ≥ 100.5 ∧ t2 − min < −0.25) ∨ (p7 ≥ 42.35 ∧ p1 < 100.5)
(American.M ink < 0.5 ∧ Gray.Seal < 0.5 ∧ M ountain.Hare ≥ 0.5) ∨ (American.M ink < 0.5 ∧ Gray.Seal ≥
0.5) ∨ (American.M ink ≥ 0.5 ∧ Grey.long.eared.bat < 0.5 ∧ Greater.W hite.toothed.Shrew < 0.5) ←→ (t9 − max ≥
17.95 ∧ t4 − max < 13.05 ∧ p7 < 56.9) ∨ (t9 − max < 17.95 ∧ t5 − max ≥ 4.5)
(Stoat < 0.5 ∧ American.M ink < 0.5 ∧ Gray.Seal ≥ 0.5) ∨ (Stoat < 0.5 ∧ American.M ink ≥ 0.5 ∧ Beech.M arten <
0.5) ∨ (Stoat ≥ 0.5 ∧ Common.Genet ≥ 0.5 ∧ Algerian.M ouse < 0.5) ∨ (Stoat ≥ 0.5 ∧ Common.Genet < 0.5 ∧
Steppe.M ouse < 0.5) ←→ (t8 − max ≥ 24.75 ∧ p8 ≥ 65.45 ∧ t9 − avg < 17.35) ∨ (t8 − max < 24.75 ∧ t5 − max ≥ 4.5)
(Common.Shrew < 0.5 ∧ Beech.M arten < 0.5 ∧ Black.rat < 0.5) ∨ (Common.Shrew ≥ 0.5 ∧ Stoat < 0.5 ∧
European.ground.squirrel < 0.5) ∨ (Common.Shrew ≥ 0.5 ∧ Stoat ≥ 0.5) ←→ (t3 − max ≥ 10.55 ∧ p7 ≥ 64.95) ∨
(t3 − max < 10.55 ∧ t9 − max ≥ 22.35 ∧ p11 < 45.9) ∨ (t3 − max < 10.55 ∧ t9 − max < 22.35)
(House.mouse ≥ 0.5 ∧ Common.Shrew ≥ 0.5) ∨ (House.mouse < 0.5 ∧ European.F ree.tailed.Bat < 0.5) ←→
(t2 − max ≥ 6.75 ∧ t3 − avg < 6.375 ∧ t6 − min ≥ 9.65) ∨ (t2 − max < 6.75 ∧ t2 − avg ≥ 1.985 ∧ t9 − max ≥
13.95) ∨ (t2 − max < 6.75 ∧ t2 − avg < 1.985)
Continued
Appendix A Redescription Sets from experiments with Bio Data Set
Table A.1: Redescriptions mined by Algorithm 1 from Bio data set (with Gini impurity measure and min bucket=20.) J - Jaccard similarity E1,1 support ; tn − {max; min; avg} stand for minimum, maximum, and average temperature of month n in degrees Celsius, and pn stands for average
precipitation of month n in millimeters.
E1,1
1536
p-val.
0
0.864
667
0
0.845
551
0
0.844
949
0
0.842
1195
0
0.838
335
0
0.837
589
0
0.836
940
0
0.831
813
0
Redescription
(Bank.V ole < 0.5 ∧ House.mouse.1 ≥ 0.5 ∧ Arctic.F ox < 0.5) ∨ (Bank.V ole ≥ 0.5 ∧ N orway.lemming ≥ 0.5 ∧
Raccoon.Dog ≥ 0.5) ∨ (Bank.V ole ≥ 0.5 ∧ N orway.lemming < 0.5 ∧ Roman.M ole < 0.5) ←→ (p7 < 36.65 ∧ p8 ≥
47.5) ∨ (p7 ≥ 36.65 ∧ t7 − max ≥ 17.95 ∧ t10 − avg < 13.95)
(M oose < 0.5∧Arctic.F ox ≥ 0.5∧American.M ink ≥ 0.5)∨(M oose ≥ 0.5∧Lesser.W hite.toothed.Shrew < 0.5) ←→
(t3 − max < 5.15 ∧ p6 ≥ 34.35 ∧ t12 − min < −0.95)
(M oose ≥ 0.5 ∧ Lesser.W hite.toothed.Shrew < 0.5 ∧ Harbor.Seal < 0.5) ←→ (t2 − max ≥ 0.45 ∧ t3 − max <
4.85 ∧ t6 − max ≥ 19.05) ∨ (t2 − max < 0.45 ∧ t6 − max ≥ 11.45 ∧ p5 < 63.15)
(House.mouse < 0.5 ∧ Granada.Hare < 0.5 ∧ M editerranean.M onk.Seal ≥ 0.5) ∨ (House.mouse < 0.5 ∧
Granada.Hare ≥ 0.5) ∨ (House.mouse ≥ 0.5 ∧ Raccoon ≥ 0.5 ∧ Y ellow.necked.M ouse < 0.5) ∨ (House.mouse ≥
0.5∧Raccoon < 0.5∧N orthern.Bat < 0.5) ←→ (t1−max ≥ 4.35∧t1−min < −1.95∧t11−min < 2.45)∨(t1−max ≥
4.35 ∧ t1 − min ≥ −1.95)
(House.mouse ≥ 0.5 ∧ Raccoon ≥ 0.5 ∧ Y ellow.necked.M ouse ≥ 0.5) ∨ (House.mouse < 0.5 ∧ House.mouse.1 <
0.5∧M oose ≥ 0.5)∨(House.mouse < 0.5∧House.mouse.1 ≥ 0.5) ←→ (t1−max < 3.85∧t6−max < 10.7∧t10−avg ≥
1.35) ∨ (t1 − max < 3.85 ∧ t6 − max ≥ 10.7)
(Grey.Red.Backed.V ole < 0.5 ∧ Siberian.F lying.Squirrel < 0.5 ∧ W olverine ≥ 0.5) ∨ (Grey.Red.Backed.V ole <
0.5 ∧ Siberian.F lying.Squirrel ≥ 0.5) ∨ (Grey.Red.Backed.V ole ≥ 0.5) ←→ (t2 − avg ≥ −6.82 ∧ t2 − avg <
−5.515 ∧ p5 < 41.35) ∨ (t2 − avg < −6.82 ∧ p8 ≥ 49.8)
(M oose < 0.5 ∧ P olar.bear ≥ 0.5) ∨ (M oose ≥ 0.5 ∧ Lesser.W hite.toothed.Shrew < 0.5 ∧ Harbor.Seal < 0.5) ←→
(t2 − max ≥ 0.45 ∧ t3 − max < 4.85 ∧ t6 − max ≥ 19.05) ∨ (t2 − max < 0.45 ∧ p5 < 63.15)
(House.mouse < 0.5 ∧ Granada.Hare < 0.5 ∧ M editerranean.M onk.Seal ≥ 0.5) ∨ (House.mouse < 0.5 ∧
Granada.Hare ≥ 0.5) ∨ (House.mouse ≥ 0.5 ∧ Raccoon < 0.5 ∧ N orthern.Bat < 0.5) ←→ (t1 − max ≥ 4.45)
(Common.Shrew ≥ 0.5 ∧ Etruscan.Shrew ≥ 0.5) ∨ (Common.Shrew < 0.5 ∧ M ountain.Hare < 0.5 ∧
House.mouse.1 < 0.5) ←→ (t3 − max < 11.15 ∧ t8 − max < 11.95 ∧ t9 − avg < 4.705) ∨ (t3 − max ≥ 11.15 ∧ t1 − avg <
1.045 ∧ t10 − max ≥ 18.4) ∨ (t3 − max ≥ 11.15 ∧ t1 − avg ≥ 1.045)
Continued
Appendix A Redescription Sets from experiments with Bio Data Set
J
0.866
80
Table A.1: Redescriptions mined by Algorithm 1 from Bio data set (with Gini impurity measure and min bucket=20.) J - Jaccard similarity E1,1 support ; tn − {max; min; avg} stand for minimum, maximum, and average temperature of month n in degrees Celsius, and pn stands for average
precipitation of month n in millimeters.
J
0.831
E1,1
813
p-val.
0
0.828
1080
0
0.813
1206
0
0.811
877
0
0.803
293
0
0.802
449
0
0.802
937
0
0.797
839
0
Redescription
(Common.Shrew ≥ 0.5 ∧ Etruscan.Shrew ≥ 0.5) ∨ (Common.Shrew < 0.5 ∧ House.mouse.1 < 0.5 ∧
M ountain.Hare < 0.5) ←→ (t3 − max < 11.15 ∧ t8 − max < 11.95 ∧ t9 − avg < 4.705) ∨ (t3 − max ≥ 11.15 ∧ t1 − avg <
1.045 ∧ t10 − max ≥ 18.4) ∨ (t3 − max ≥ 11.15 ∧ t1 − avg ≥ 1.045)
(House.mouse < 0.5 ∧ Common.Shrew < 0.5 ∧ House.mouse.1 < 0.5) ∨ (House.mouse ≥ 0.5 ∧ N orthern.Bat <
0.5 ∧ Striped.F ield.M ouse < 0.5) ←→ (t1 − avg < 0.684 ∧ t8 − max < 11.95 ∧ t9 − avg < 4.705) ∨ (t1 − avg ≥
0.684 ∧ p6 < 98.65)
(Common.V ole < 0.5 ∧ European.M ole < 0.5 ∧ European.P ine.V ole ≥ 0.5) ∨ (Common.V ole < 0.5 ∧
European.M ole ≥ 0.5 ∧ M oose < 0.5) ∨ (Common.V ole ≥ 0.5 ∧ M oose ≥ 0.5 ∧ M ountain.Hare < 0.5) ∨
(Common.V ole ≥ 0.5 ∧ M oose < 0.5) ←→ (t3 − max ≥ 3.95 ∧ t12 − max ≥ 10.25 ∧ p6 ≥ 63.4) ∨ (t3 − max ≥
3.95 ∧ t12 − max < 10.25)
(Common.Shrew ≥ 0.5 ∧ M arbled.P olecat < 0.5 ∧ Etruscan.Shrew ≥ 0.5) ∨ (Common.Shrew ≥ 0.5 ∧
M arbled.P olecat ≥ 0.5) ∨ (Common.Shrew < 0.5 ∧ American.M ink ≥ 0.5 ∧ Greater.W hite.toothed.Shrew ≥
0.5) ∨ (Common.Shrew < 0.5 ∧ American.M ink < 0.5 ∧ Gray.Seal < 0.5) ←→ (t10 − max < 15.25 ∧ t5 − max <
4.5) ∨ (t10 − max ≥ 15.25 ∧ t1 − avg < 0.6415 ∧ p8 < 49.85) ∨ (t10 − max ≥ 15.25 ∧ t1 − avg ≥ 0.6415)
(Grey.Red.Backed.V ole < 0.5 ∧ N orth.American.Beaver
< 0.5 ∧ Laxmann.s.Shrew
≥ 0.5) ∨
(Grey.Red.Backed.V ole < 0.5 ∧ N orth.American.Beaver ≥ 0.5) ∨ (Grey.Red.Backed.V ole ≥ 0.5) ←→ (t2 − avg ≥
−6.82 ∧ t2 − avg < −5.865 ∧ p5 < 42.5) ∨ (t2 − avg < −6.82 ∧ p8 ≥ 53.2 ∧ t11 − max < 1.65)
(M oose ≥ 0.5 ∧ W ood.mouse < 0.5) ←→ (t2 − max ≥ −1.15 ∧ t4 − max < 7.55 ∧ t7 − max ≥ 14.05) ∨ (t2 − max <
−1.15 ∧ p8 ≥ 58.85)
(Stoat ≥ 0.5 ∧ Common.Bent.wing.Bat < 0.5 ∧ Steppe.M ouse ≥ 0.5) ∨ (Stoat ≥ 0.5 ∧ Common.Bent.wing.Bat ≥
0.5) ∨ (Stoat < 0.5 ∧ Arctic.F ox < 0.5 ∧ Gray.Seal < 0.5) ←→ (t8 − max < 24.75 ∧ t8 − max < 23.85 ∧ p7 <
29.4) ∨ (t8 − max < 24.75 ∧ t8 − max ≥ 23.85 ∧ p4 ≥ 53.5) ∨ (t8 − max ≥ 24.75)
(Common.Shrew ≥ 0.5 ∧ Etruscan.Shrew ≥ 0.5) ∨ (Common.Shrew < 0.5 ∧ House.mouse.1 < 0.5 ∧ M oose <
0.5) ←→ (t2 − max < 7.25 ∧ t7 − max < 12.35 ∧ t9 − avg < 4.705) ∨ (t2 − max ≥ 7.25 ∧ t3 − avg < 6.37 ∧ t5 − min <
6.35) ∨ (t2 − max ≥ 7.25 ∧ t3 − avg ≥ 6.37)
Continued
Appendix A Redescription Sets from experiments with Bio Data Set
Table A.1: Redescriptions mined by Algorithm 1 from Bio data set (with Gini impurity measure and min bucket=20.) J - Jaccard similarity E1,1 support ; tn − {max; min; avg} stand for minimum, maximum, and average temperature of month n in degrees Celsius, and pn stands for average
precipitation of month n in millimeters.
81
E1,1
460
p-val.
0
0.792
880
0
0.787
0.786
111
704
0
0
0.775
802
0
0.772
922
0
0.758
681
0
0.754
898
0
0.735
613
0
0.733
0.722
55
666
0
0
Redescription
(M oose < 0.5∧P olar.bear ≥ 0.5)∨(M oose ≥ 0.5∧W ood.mouse < 0.5) ←→ (t2−max ≥ −1.15∧t4−max < 7.55∧t7−
max ≥ 14.05) ∨ (t2 − max < −1.15 ∧ t2 − max ≥ −2.05 ∧ t12 − min ≥ −5.25) ∨ (t2 − max < −1.15 ∧ t2 − max < −2.05)
(Stoat ≥ 0.5 ∧ Algerian.M ouse < 0.5 ∧ Steppe.M ouse ≥ 0.5) ∨ (Stoat ≥ 0.5 ∧ Algerian.M ouse ≥ 0.5) ∨ (Stoat <
0.5 ∧ American.M ink ≥ 0.5 ∧ Beech.M arten ≥ 0.5) ∨ (Stoat < 0.5 ∧ American.M ink < 0.5 ∧ Gray.Seal < 0.5) ←→
(t8 − max < 24.75 ∧ t5 − max < 4.5) ∨ (t8 − max ≥ 24.75)
(Arctic.F ox < 0.5∧P olar.bear ≥ 0.5)∨(Arctic.F ox ≥ 0.5∧Roe.Deer < 0.5) ←→ (t8−max < 13.35∧t7−avg < 9.72)
(M ountain.Hare < 0.5 ∧ Gray.Seal ≥ 0.5) ∨ (M ountain.Hare ≥ 0.5 ∧ W ild.boar ≥ 0.5 ∧ Gray.Seal ≥ 0.5) ∨
(M ountain.Hare ≥ 0.5 ∧ W ild.boar < 0.5) ←→ (t5 − max < 15.85 ∧ t8 − max ≥ 12.35)
(Stoat ≥ 0.5 ∧ Common.Genet < 0.5 ∧ Steppe.M ouse ≥ 0.5) ∨ (Stoat ≥ 0.5 ∧ Common.Genet ≥ 0.5 ∧
Algerian.M ouse ≥ 0.5) ∨ (Stoat < 0.5 ∧ American.M ink ≥ 0.5 ∧ Beech.M arten ≥ 0.5) ∨ (Stoat < 0.5 ∧
American.M ink < 0.5 ∧ Gray.Seal < 0.5) ←→ (t8 − max < 24.75 ∧ t5 − max < 4.5) ∨ (t8 − max ≥ 24.75 ∧ p8 ≥
65.45 ∧ t9 − avg ≥ 17.35) ∨ (t8 − max ≥ 24.75 ∧ p8 < 65.45)
(W ild.boar ≥ 0.5 ∧ Gray.Seal ≥ 0.5) ∨ (W ild.boar < 0.5 ∧ M ountain.Hare < 0.5 ∧ Black.rat < 0.5) ∨ (W ild.boar <
0.5 ∧ M ountain.Hare ≥ 0.5) ←→ (t5 − max ≥ 15.95 ∧ t6 − max < 20.15 ∧ p9 < 69.6) ∨ (t5 − max < 15.95 ∧ p5 < 112)
(Stoat ≥ 0.5 ∧ Common.Genet ≥ 0.5 ∧ European.P ine.V ole < 0.5) ∨ (Stoat < 0.5 ∧ House.mouse.1 < 0.5 ∧
Gray.Seal < 0.5) ←→ (t11 − max < 11.85 ∧ p7 < 41.9) ∨ (t11 − max ≥ 11.85 ∧ t10 − max ≥ 17.55)
(Common.V ole < 0.5 ∧ European.ground.squirrel ≥ 0.5) ∨ (Common.V ole ≥ 0.5 ∧ Alpine.P ine.V ole < 0.5) ←→
(t8 − max < 20.45 ∧ t6 − min < 9.85 ∧ p5 ≥ 112) ∨ (t8 − max < 20.45 ∧ t6 − min ≥ 9.85 ∧ t4 − avg < 4.22) ∨ (t8 − max ≥
20.45 ∧ t12 − max < 10.25 ∧ p10 < 86.05)
(Stoat ≥ 0.5 ∧ Spanish.M ole ≥ 0.5) ∨ (Stoat < 0.5 ∧ Common.Shrew ≥ 0.5 ∧ European.F ree.tailed.Bat ≥ 0.5) ∨
(Stoat < 0.5 ∧ Common.Shrew < 0.5 ∧ European.M ole < 0.5) ←→ (p7 ≥ 42.85 ∧ t7 − max < 13.45) ∨ (p7 < 42.85)
(Egyptian.M ongoose ≥ 0.5) ←→ (p8 < 7.47 ∧ p5 ≥ 27.5)
(House.mouse < 0.5 ∧ European.F ree.tailed.Bat ≥ 0.5) ∨ (House.mouse ≥ 0.5 ∧ Common.Shrew < 0.5) ←→
(t2 − max < 6.75 ∧ t2 − avg ≥ 1.985 ∧ t9 − max < 13.95) ∨ (t2 − max ≥ 6.75 ∧ t3 − avg < 6.375 ∧ t6 − min <
9.65) ∨ (t2 − max ≥ 6.75 ∧ t3 − avg ≥ 6.375)
Continued
Appendix A Redescription Sets from experiments with Bio Data Set
J
0.794
82
Table A.1: Redescriptions mined by Algorithm 1 from Bio data set (with Gini impurity measure and min bucket=20.) J - Jaccard similarity E1,1 support ; tn − {max; min; avg} stand for minimum, maximum, and average temperature of month n in degrees Celsius, and pn stands for average
precipitation of month n in millimeters.
J
0.722
E1,1
636
p-val.
0
0.721
536
0
0.718
140
0
0.715
178
0
0.708
600
0
0.707
188
0
0.684
0.681
117
552
0
0
0.657
142
0
0.642
70
0
0.638
347
0
0.605
130
0
83
Redescription
(M uskrat < 0.5 ∧ Common.Shrew < 0.5 ∧ House.mouse.1 < 0.5) ←→ (p7 ≥ 42.35 ∧ p1 ≥ 100.5 ∧ t2 − min ≥
−0.25) ∨ (p7 < 42.35)
(Stoat ≥ 0.5 ∧ Common.Genet ≥ 0.5) ∨ (Stoat < 0.5 ∧ Black.rat < 0.5 ∧ Kuhl.s.P ipistrelle ≥ 0.5) ∨ (Stoat < 0.5 ∧
Black.rat ≥ 0.5∧House.mouse.1 < 0.5) ←→ (t11−max < 10.95∧t3−max ≥ 11.35∧t2−max ≥ 7.65)∨(t11−max ≥
10.95 ∧ t8 − max ≥ 22.7 ∧ p11 ≥ 49.95)
(Grey.Red.Backed.V ole < 0.5 ∧ P olar.bear ≥ 0.5) ∨ (Grey.Red.Backed.V ole ≥ 0.5 ∧ Laxmann.s.Shrew < 0.5 ∧
Eurasian.W ater.Shrew < 0.5) ∨ (Grey.Red.Backed.V ole ≥ 0.5 ∧ Laxmann.s.Shrew ≥ 0.5) ←→ (t11 − min <
−8.45 ∧ t2 − max ≥ −6.85 ∧ p4 < 42.6) ∨ (t11 − min < −8.45 ∧ t2 − max < −6.85)
(Arctic.F ox < 0.5 ∧ N orway.lemming ≥ 0.5 ∧ W olverine ≥ 0.5) ∨ (Arctic.F ox ≥ 0.5) ←→ (t9 − max < 10.85 ∧ p8 ≥
36.85 ∧ t6 − avg < 10.35)
(House.mouse < 0.5 ∧ M editerranean.M onk.Seal ≥ 0.5) ∨ (House.mouse ≥ 0.5 ∧ Y ellow.necked.M ouse ≥ 0.5 ∧
Common.V ole < 0.5) ∨ (House.mouse ≥ 0.5 ∧ Y ellow.necked.M ouse < 0.5 ∧ Common.Shrew < 0.5) ←→ (t1 − avg ≥
2.575 ∧ t2 − min < −0.05 ∧ t7 − min ≥ 12.95) ∨ (t1 − avg ≥ 2.575 ∧ t2 − min ≥ −0.05)
(Grey.Red.Backed.V ole < 0.5 ∧ P olar.bear ≥ 0.5) ∨ (Grey.Red.Backed.V ole ≥ 0.5 ∧ European.Hedgehog < 0.5) ←→
(t11 − min ≥ −8.3 ∧ t11 − avg < −2.175 ∧ p7 ≥ 80.95) ∨ (t11 − min < −8.3 ∧ t11 − max < −0.65)
(P olar.bear < 0.5∧Laxmann.s.Shrew ≥ 0.5∧European.P olecat < 0.5)∨(P olar.bear ≥ 0.5) ←→ (t3−min < −12.55)
(Kuhl.s.P ipistrelle < 0.5 ∧ Southwestern.W ater.V ole < 0.5 ∧ Savi.s.P ine.V ole ≥ 0.5) ∨ (Kuhl.s.P ipistrelle <
0.5 ∧ Southwestern.W ater.V ole ≥ 0.5) ∨ (Kuhl.s.P ipistrelle ≥ 0.5 ∧ P arti.coloured.bat < 0.5 ∧ Alpine.marmot <
0.5) ←→ (t3 − max ≥ 11.15 ∧ t1 − avg ≥ 0.483)
(Laxmann.s.Shrew < 0.5 ∧ P olar.bear ≥ 0.5) ∨ (Laxmann.s.Shrew ≥ 0.5 ∧ Y ellow.necked.M ouse < 0.5) ←→
(t2 − max < −5.55 ∧ t3 − min ≥ −12.95 ∧ t7 − max ≥ 19.95) ∨ (t2 − max < −5.55 ∧ t3 − min < −12.95)
(P olar.bear < 0.5∧N orthern.Red.backed.V ole ≥ 0.5∧Laxmann.s.Shrew ≥ 0.5)∨(P olar.bear ≥ 0.5) ←→ (t3−avg <
−8.635)
(Raccoon.Dog ≥ 0.5 ∧ House.mouse.1 < 0.5 ∧ Siberian.F lying.Squirrel ≥ 0.5) ∨ (Raccoon.Dog ≥ 0.5 ∧
House.mouse.1 ≥ 0.5 ∧ W ildcat < 0.5) ←→ (p2 < 34.05 ∧ p8 ≥ 55.15 ∧ t7 − max ≥ 20.15)
(Alpine.marmot < 0.5 ∧ Alpine.Shrew ≥ 0.5 ∧ European.Hamster < 0.5) ∨ (Alpine.marmot ≥ 0.5) ←→ (p5 ≥
89.45 ∧ p1 < 146.5 ∧ t2 − min < −3.35)
Continued
Appendix A Redescription Sets from experiments with Bio Data Set
Table A.1: Redescriptions mined by Algorithm 1 from Bio data set (with Gini impurity measure and min bucket=20.) J - Jaccard similarity E1,1 support ; tn − {max; min; avg} stand for minimum, maximum, and average temperature of month n in degrees Celsius, and pn stands for average
precipitation of month n in millimeters.
E1,1
200
p-val.
0
0.571
100
0
0.551
194
0
0.494
198
0
0.492
405
0
0.455
0.444
35
179
0
0
0.407
48
0
0.392
143
0
0.383
0.349
0.323
0.294
31
22
20
15
0
0
0
0
Redescription
(European.Hamster < 0.5 ∧ European.ground.squirrel ≥ 0.5 ∧ Southern.V ole < 0.5) ∨ (European.Hamster ≥
0.5 ∧ House.mouse.1 ≥ 0.5) ←→ (p10 < 45.15 ∧ p6 ≥ 61.85 ∧ p6 < 92.15)
(Alpine.Shrew ≥ 0.5 ∧ Alpine.marmot < 0.5 ∧ European.Hamster < 0.5) ∨ (Alpine.Shrew ≥ 0.5 ∧ Alpine.marmot ≥
0.5) ←→ (p5 ≥ 90.15 ∧ t1 − avg < −0.286 ∧ p10 < 152.5)
(European.Hamster < 0.5 ∧ P olar.bear ≥ 0.5) ∨ (European.Hamster ≥ 0.5 ∧ T undra.V ole < 0.5 ∧ House.mouse.1 ≥
0.5) ∨ (European.Hamster ≥ 0.5 ∧ T undra.V ole ≥ 0.5) ←→ (p10 ≥ 42.75 ∧ p10 < 45.15 ∧ p6 ≥ 73.75) ∨ (p10 <
42.75 ∧ t11 − max < 9.55 ∧ p4 < 49.6)
(Etruscan.Shrew < 0.5 ∧ Common.Shrew < 0.5 ∧ Savi.s.P ine.V ole ≥ 0.5) ∨ (Etruscan.Shrew ≥ 0.5 ∧
P yrenean.Desman < 0.5) ←→ (t10 − avg ≥ 12.75 ∧ p10 ≥ 54.25 ∧ p1 < 104.5)
(Edible.dormouse < 0.5 ∧ M editerranean.W ater.Shrew ≥ 0.5 ∧ P yrenean.Desman ≥ 0.5) ∨ (Edible.dormouse ≥
0.5 ∧ M editerranean.W ater.Shrew < 0.5 ∧ American.M ink < 0.5) ∨ (Edible.dormouse ≥ 0.5 ∧
M editerranean.W ater.Shrew ≥ 0.5) ←→ (p5 ≥ 58.55 ∧ t7 − max < 21.15 ∧ p5 ≥ 107.5) ∨ (p5 ≥ 58.55 ∧ t7 − max ≥
21.15 ∧ t5 − min < 7.45)
(Alpine.Ibex ≥ 0.5) ←→ (p5 ≥ 107.5)
(Common.Genet < 0.5 ∧ Kuhl.s.P ipistrelle ≥ 0.5 ∧ Coypu ≥ 0.5) ∨ (Common.Genet ≥ 0.5 ∧
Lesser.W hite.toothed.Shrew
<
0.5 ∧ Egyptian.M ongoose
<
0.5) ∨ (Common.Genet
≥
0.5 ∧
Lesser.W hite.toothed.Shrew ≥ 0.5) ←→ (t3 − max < 11.95 ∧ t3 − max ≥ 11.05 ∧ t5 − min < 5.65) ∨ (t3 − max ≥
11.95 ∧ t10 − max ≥ 20.25 ∧ p12 < 59.45) ∨ (t3 − max ≥ 11.95 ∧ t10 − max < 20.25 ∧ t3 − max ≥ 13.75)
(Alpine.marmot < 0.5 ∧ Alpine.Shrew ≥ 0.5 ∧ Brown.rat < 0.5) ∨ (Alpine.marmot ≥ 0.5 ∧ Common.Shrew ≥
0.5) ←→ (p5 ≥ 103.5 ∧ t2 − min < −6.35)
(Chamois < 0.5 ∧ Gerbe.s.V ole < 0.5 ∧ Alpine.Shrew ≥ 0.5) ∨ (Chamois < 0.5 ∧ Gerbe.s.V ole ≥ 0.5) ∨ (Chamois ≥
0.5) ←→ (p5 ≥ 75.95 ∧ p1 < 132.5 ∧ p5 ≥ 90.9)
(Egyptian.M ongoose ≥ 0.5 ∧ Etruscan.Shrew ≥ 0.5) ←→ (t3 − max ≥ 13.25 ∧ t3 − max ≥ 17.25)
(Stoat < 0.5 ∧ M editerranean.M onk.Seal ≥ 0.5) ←→ (p9 < 15.5)
(Alpine.marmot ≥ 0.5 ∧ Alpine.F ield.M ouse ≥ 0.5) ←→ (p5 ≥ 106.5)
(Granada.Hare ≥ 0.5 ∧ Iberian.Lynx ≥ 0.5) ←→ (t7 − max ≥ 34.25)
Continued
Appendix A Redescription Sets from experiments with Bio Data Set
J
0.583
84
Table A.1: Redescriptions mined by Algorithm 1 from Bio data set (with Gini impurity measure and min bucket=20.) J - Jaccard similarity E1,1 support ; tn − {max; min; avg} stand for minimum, maximum, and average temperature of month n in degrees Celsius, and pn stands for average
precipitation of month n in millimeters.
J
E1,1
p-val.
Redescription
Appendix A Redescription Sets from experiments with Bio Data Set
Table A.1: Redescriptions mined by Algorithm 1 from Bio data set (with Gini impurity measure and min bucket=20.) J - Jaccard similarity E1,1 support ; tn − {max; min; avg} stand for minimum, maximum, and average temperature of month n in degrees Celsius, and pn stands for average
precipitation of month n in millimeters.
85
E1,1
2347
2262
2159
p-val.
0.000
0.000
0.000
0.949
0.947
0.947
0.947
2372
2286
2267
2178
0.000
0.000
0.000
0.000
0.938
0.938
1905
1905
0.000
0.000
0.932
2072
0.000
0.907
0.906
1789
2079
0.000
0.000
0.900
1816
0.000
0.891
1750
0.000
0.883
1968
0.000
Redescription
(Arctic.F ox < 0.5) ←→ (t6 − avg < 10.25 ∧ t9 − max ≥ 10.75) ∨ (t6 − avg ≥ 10.25)
(Arctic.F ox < 0.5 ∧ W olverine < 0.5) ←→ (t9 − max < 12.15 ∧ t9 − max ≥ 10.85) ∨ (t9 − max ≥ 12.15)
(W ood.Lemming < 0.5∧M oose ≥ 0.5∧Brown.Bear < 0.5)∨(W ood.Lemming < 0.5∧M oose < 0.5) ←→ (t2−avg <
−4.99 ∧ t7 − avg < 10.85) ∨ (t2 − avg ≥ −4.99)
(Stoat < 0.5 ∧ Granada.Hare < 0.5) ∨ (Stoat ≥ 0.5) ←→ (p8 < 40.15 ∧ p11 ≥ 54) ∨ (p8 ≥ 40.15)
(N orway.lemming < 0.5) ←→ (t8 − avg < 12.55 ∧ t7 − max < 13.95) ∨ (t8 − avg ≥ 12.55)
(Grey.Red.Backed.V ole < 0.5) ←→ (t1 − min < −11.45 ∧ t7 − max ≥ 20.05) ∨ (t1 − min ≥ −11.45)
(W ood.mouse < 0.5∧M ountain.Hare ≥ 0.5∧Striped.F ield.M ouse ≥ 0.5)∨(W ood.mouse < 0.5∧M ountain.Hare <
0.5)∨(W ood.mouse ≥ 0.5) ←→ (t4−max < 7.85∧t7−max ≥ 14.05∧t10−max ≥ 7.15)∨(t4−max < 7.85∧t7−max <
14.05) ∨ (t4 − max ≥ 7.85)
(M oose ≥ 0.5∧M ountain.Hare < 0.5)∨(M oose < 0.5) ←→ (t3−max < 4.65∧t7−max < 13.45)∨(t3−max ≥ 4.65)
(M ountain.Hare ≥ 0.5 ∧ M oose < 0.5) ∨ (M ountain.Hare < 0.5) ←→ (t3 − max < 4.65 ∧ t7 − max < 13.45) ∨ (t3 −
max ≥ 4.65)
(W ood.mouse < 0.5 ∧ M ountain.Hare < 0.5) ∨ (W ood.mouse ≥ 0.5) ←→ (t10 − max < 10.85 ∧ t2 − max <
−1.45 ∧ t7 − avg < 10.65) ∨ (t10 − max < 10.85 ∧ t2 − max ≥ −1.45) ∨ (t10 − max ≥ 10.85)
(M oose < 0.5) ←→ (t2 − max < 1.55 ∧ t6 − max < 12.05) ∨ (t2 − max ≥ 1.55)
(Stoat < 0.5 ∧ House.mouse ≥ 0.5 ∧ Common.V ole ≥ 0.5) ∨ (Stoat < 0.5 ∧ House.mouse < 0.5) ∨ (Stoat ≥ 0.5) ←→
(t10 − max ≥ 18.65 ∧ t11 − avg < 9.73) ∨ (t10 − max < 18.65)
(M ountain.Hare ≥ 0.5 ∧ W ild.boar ≥ 0.5) ∨ (M ountain.Hare < 0.5 ∧ W ild.boar < 0.5 ∧ Gray.Seal < 0.5) ∨
(M ountain.Hare < 0.5∧W ild.boar ≥ 0.5) ←→ (t5−max < 16.05∧t7−max ≥ 13.45∧t8−max ≥ 19.55)∨(t5−max <
16.05 ∧ t7 − max < 13.45) ∨ (t5 − max ≥ 16.05)
(M ountain.Hare < 0.5) ←→ (t9−avg < 13.05∧t7−max ≥ 13.45∧t10−max ≥ 11.45)∨(t9−avg < 13.05∧t7−max <
13.45) ∨ (t9 − avg ≥ 13.05)
(Stoat < 0.5 ∧ House.mouse < 0.5) ∨ (Stoat ≥ 0.5) ←→ (t11 − max ≥ 11.05 ∧ t8 − avg ≥ 19.55 ∧ p10 < 55.45) ∨ (t11 −
max ≥ 11.05 ∧ t8 − avg < 19.55) ∨ (t11 − max < 11.05)
Continued
Appendix A Redescription Sets from experiments with Bio Data Set
J
0.966
0.958
0.956
86
Table A.2: Redescriptions mined by Algorithm 1 from Bio data set (with IG - impurity measure and min bucket=100.) J - Jaccard similarity E1,1 support ; tn − {max; min; avg} stand for minimum, maximum, and average temperature of month n in degrees Celsius, and pn stands for average
precipitation of month n in millimeters.
J
0.877
E1,1
1613
p-val.
0.000
0.870
1667
0.000
0.849
1625
0.000
0.842
1691
0.000
0.835
1701
0.000
0.829
0.823
0.808
0.804
0.802
1414
1599
1325
1436
1167
0.000
0.000
0.000
0.000
0.000
0.781
0.767
0.749
0.748
0.745
1013
603
870
935
825
0.000
0.000
0.000
0.000
0.000
0.741
0.738
0.714
0.702
611
637
282
353
0.000
0.000
0.000
0.000
87
Redescription
(M ountain.Hare ≥ 0.5 ∧ Beech.M arten ≥ 0.5) ∨ (M ountain.Hare < 0.5 ∧ Gray.Seal < 0.5) ←→ (t8 − max <
21.15 ∧ t7 − max < 13.45) ∨ (t8 − max ≥ 21.15)
(M uskrat < 0.5 ∧ Common.Shrew < 0.5 ∧ House.mouse.1 ≥ 0.5) ∨ (M uskrat < 0.5 ∧ Common.Shrew ≥ 0.5) ∨
(M uskrat ≥ 0.5) ←→ (p7 ≥ 39.85 ∧ p10 ≥ 83.35 ∧ t1 − avg < 2.745) ∨ (p7 ≥ 39.85 ∧ p10 < 83.35)
(House.mouse ≥ 0.5 ∧ Y ellow.necked.M ouse ≥ 0.5 ∧ Common.V ole ≥ 0.5) ∨ (House.mouse < 0.5) ←→ (t1 − avg ≥
0.95 ∧ t1 − min < −0.35 ∧ t8 − max ≥ 22.65) ∨ (t1 − avg < 0.95)
(Common.Shrew < 0.5 ∧ House.mouse < 0.5) ∨ (Common.Shrew ≥ 0.5) ←→ (t1 − max ≥ 4.35 ∧ t3 − avg <
6.235) ∨ (t1 − max < 4.35)
(Stoat < 0.5 ∧ American.M ink < 0.5 ∧ Eurasian.W ater.Shrew ≥ 0.5) ∨ (Stoat < 0.5 ∧ American.M ink ≥ 0.5) ∨
(Stoat ≥ 0.5) ←→ (p8 ≥ 49.55)
(House.mouse ≥ 0.5 ∧ Raccoon ≥ 0.5) ∨ (House.mouse < 0.5) ←→ (t1 − max < 4.35)
(Stoat < 0.5 ∧ European.M ole ≥ 0.5) ∨ (Stoat ≥ 0.5) ←→ (p7 ≥ 42 ∧ t7 − max ≥ 16.75)
(American.M ink < 0.5 ∧ M ountain.Hare < 0.5) ←→ (t8 − max < 22.05 ∧ p8 < 57.65) ∨ (t8 − max ≥ 22.05)
(Stoat < 0.5 ∧ Black.rat < 0.5 ∧ American.M ink ≥ 0.5) ∨ (Stoat ≥ 0.5) ←→ (t9 − max < 22.15 ∧ p8 ≥ 51.75)
(House.mouse ≥ 0.5∧Raccoon ≥ 0.5)∨(House.mouse < 0.5∧M oose < 0.5∧House.mouse.1 ≥ 0.5)∨(House.mouse <
0.5 ∧ M oose ≥ 0.5) ←→ (t1 − max < 5.05 ∧ p2 ≥ 40.45 ∧ t1 − max < 3.15) ∨ (t1 − max < 5.05 ∧ p2 < 40.45)
(House.mouse.1 < 0.5 ∧ Raccoon.Dog ≥ 0.5) ∨ (House.mouse.1 ≥ 0.5) ←→ (t1 − max < 4.35 ∧ t7 − avg ≥ 12.55)
(M oose ≥ 0.5) ←→ (t2 − max < 1.55 ∧ t6 − max ≥ 12.05)
(House.mouse ≥ 0.5 ∧ Raccoon < 0.5) ←→ (t1 − max ≥ 4.35)
(American.M ink < 0.5 ∧ M ountain.Hare ≥ 0.5) ∨ (American.M ink ≥ 0.5) ←→ (t8 − max < 22.05 ∧ p8 ≥ 57.65)
(American.M ink < 0.5 ∧ M ountain.Hare ≥ 0.5) ∨ (American.M ink ≥ 0.5 ∧ Beech.M arten < 0.5) ←→ (t8 − max <
21.15)
(M ountain.Hare ≥ 0.5) ←→ (t9 − avg < 13.05 ∧ t7 − max ≥ 13.45 ∧ t10 − max < 11.45)
(M oose < 0.5∧Red.F ox < 0.5∧House.mouse < 0.5)∨(M oose ≥ 0.5∧Beech.M arten < 0.5) ←→ (t10−max < 10.45)
(M oose ≥ 0.5 ∧ Brown.Bear ≥ 0.5) ←→ (t2 − max < −1.75 ∧ t7 − avg ≥ 10.85)
(W ood.mouse < 0.5 ∧ M ountain.Hare ≥ 0.5) ←→ (t10 − max < 10.85 ∧ t2 − max < −1.45 ∧ t7 − avg ≥ 10.65)
Continued
Appendix A Redescription Sets from experiments with Bio Data Set
Table A.2: Redescriptions mined by Algorithm 1 from Bio data set (with IG - impurity measure and min bucket=100.) J - Jaccard similarity E1,1 support ; tn − {max; min; avg} stand for minimum, maximum, and average temperature of month n in degrees Celsius, and pn stands for average
precipitation of month n in millimeters.
E1,1
604
p-val.
0.000
0.693
789
0.000
0.663
0.655
0.641
0.634
212
182
567
772
0.000
0.000
0.000
0.000
0.591
0.484
0.442
0.418
182
151
76
107
0.000
0.000
0.000
0.000
Redescription
(Common.Shrew < 0.5 ∧ Greater.W hite.toothed.Shrew < 0.5 ∧ Black.rat ≥ 0.5) ∨ (Common.Shrew < 0.5 ∧
Greater.W hite.toothed.Shrew ≥ 0.5) ←→ (t3 − max ≥ 10.55 ∧ t1 − avg ≥ 0.657)
(Stoat < 0.5 ∧ Black.rat < 0.5 ∧ American.M ink < 0.5) ∨ (Stoat < 0.5 ∧ Black.rat ≥ 0.5) ←→ (t9 − max <
22.15 ∧ p8 < 51.75) ∨ (t9 − max ≥ 22.15)
(Arctic.F ox < 0.5 ∧ N orway.lemming ≥ 0.5) ∨ (Arctic.F ox ≥ 0.5) ←→ (t8 − avg < 12.55 ∧ t9 − avg < 6.535)
(W ood.Lemming ≥ 0.5) ←→ (t12 − max < −0.65 ∧ t5 − avg ≥ 3.195 ∧ t12 − avg < −6.125)
(Common.Shrew < 0.5 ∧ House.mouse ≥ 0.5) ←→ (t1 − max ≥ 4.35 ∧ t3 − avg ≥ 6.235)
(M oose < 0.5 ∧ Greater.Horseshoe.Bat < 0.5 ∧ Gray.W olf ≥ 0.5) ∨ (M oose < 0.5 ∧ Greater.Horseshoe.Bat ≥
0.5) ←→ (t10 − max ≥ 14.25 ∧ t9 − min < 16.25)
(Grey.Red.Backed.V ole ≥ 0.5) ←→ (t1 − min < −11.45 ∧ t7 − max < 20.05)
(European.Hamster ≥ 0.5) ←→ (p10 < 45.15 ∧ p6 ≥ 61.85 ∧ p4 < 48.25)
(Alpine.marmot ≥ 0.5) ←→ (p4 ≥ 51.75 ∧ p5 ≥ 97.35)
(Alpine.Shrew ≥ 0.5) ←→ (p6 ≥ 86.85 ∧ p5 ≥ 90.15)
Appendix A Redescription Sets from experiments with Bio Data Set
J
0.701
88
Table A.2: Redescriptions mined by Algorithm 1 from Bio data set (with IG - impurity measure and min bucket=100.) J - Jaccard similarity E1,1 support ; tn − {max; min; avg} stand for minimum, maximum, and average temperature of month n in degrees Celsius, and pn stands for average
precipitation of month n in millimeters.
J
0.912
E1,1
1406
p-val.
0.000
0.876
1590
0.000
0.841
1671
0.000
0.823
1293
0.000
0.811
1368
0.000
0.803
1237
0.000
0.802
759
0.000
0.801
1180
0.000
Redescription
(M editerranean.W ater.Shrew ≥ 0.5) ∧ (Alpine.Shrew ≥ 0.5) ∨ (M editerranean.W ater.Shrew < 0.5) ∧ (M oose <
0.5) ∧ (Arctic.F ox < 0.5) ←→ (p5 ≥ 58.65) ∧ (p6 ≥ 86.85) ∨ (p5 < 58.65) ∧ (t11 − max ≥ 6.85) ∧ (t9 − max ≥ 10.75)
(Brown.rat < 0.5) ∧ (Common.Shrew ≥ 0.5) ∨ (Brown.rat ≥ 0.5) ∧ (Eurasian.W ater.Shrew ≥ 0.5) ∨
(Common.Shrew < 0.5) ∧ (American.M ink ≥ 0.5) ∨ (Eurasian.W ater.Shrew < 0.5) ∧ (M ountain.Hare ≥ 0.5) ←→
(p8 < 38.25) ∧ (p7 ≥ 46.1) ∨ (p8 ≥ 38.25) ∧ (t7 − avg < 20.25) ∨ (t7 − avg ≥ 20.25) ∧ (t9 − max < 17.35)
(Eurasian.P ygmy.Shrew < 0.5) ∧ (American.M ink ≥ 0.5) ∨ (American.M ink < 0.5) ∧ (W ood.mouse < 0.5) ∨
(Eurasian.P ygmy.Shrew ≥ 0.5) ∧ (Lusitanian.P ine.V ole < 0.5) ∧ (Etruscan.Shrew < 0.5) ←→ (p8 < 49.55) ∧ (t3 −
max < 5.45) ∨ (t3 − max ≥ 5.45) ∧ (t1 − max < 1.65) ∨ (p8 ≥ 49.55) ∧ (t2 − max < 8.85) ∧ (t11 − max < 10.95)
(European.P ine.M arten ≥ 0.5) ∧ (Common.Shrew ≥ 0.5) ∨ (European.P ine.M arten < 0.5) ∧ (House.mouse <
0.5) ∧ (House.mouse.1 ≥ 0.5) ∨ (Common.Shrew < 0.5) ∧ (W ood.mouse < 0.5) ←→ (t1 − avg < 2.365) ∧ (t2 − max <
6.25) ∨ (t1 − avg ≥ 2.365) ∧ (t1 − max < 4.05) ∧ (p7 ≥ 36.75) ∨ (t2 − max ≥ 6.25) ∧ (t12 − max < 4.45)
(Garden.dormouse ≥ 0.5) ∧ (Common.Shrew < 0.5) ∨ (Garden.dormouse < 0.5) ∧ (American.M ink ≥ 0.5) ∧
(Beech.M arten ≥ 0.5) ∨ (Garden.dormouse < 0.5) ∧ (American.M ink < 0.5) ∧ (Gray.Seal < 0.5) ←→ (t9 − max ≥
18.95) ∧ (p7 < 61.85) ∨ (t9 − max < 18.95) ∧ (t8 − max < 22.25) ∧ (t5 − min ≥ 5.95) ∨ (t9 − max < 18.95) ∧ (t8 − max ≥
22.25) ∧ (t5 − max ≥ 15.9)
(Common.Shrew < 0.5) ∧ (House.mouse < 0.5) ∧ (House.mouse.1 ≥ 0.5) ∨ (Common.Shrew ≥ 0.5) ∧
(Kuhl.s.P ipistrelle < 0.5) ∧ (Chamois < 0.5) ←→ (t2 − max ≥ 7.25) ∧ (t1 − max < 3.55) ∧ (p7 ≥ 36.85) ∨ (t2 − max <
7.25) ∧ (p4 < 59.55) ∧ (p5 < 76.05)
(Kuhl.s.P ipistrelle ≥ 0.5) ∧ (Alpine.marmot < 0.5) ∨ (Kuhl.s.P ipistrelle < 0.5) ∧ (Common.Shrew < 0.5) ∧
(House.mouse ≥ 0.5) ←→ (t3 − max ≥ 11.05) ∧ (t2 − min ≥ −3.95) ∨ (t3 − max < 11.05) ∧ (t3 − avg ≥ 6.375) ∧ (t1 −
max ≥ 3.55)
(European.W ater.V ole ≥ 0.5) ∧ (Common.Shrew ≥ 0.5) ∨ (European.W ater.V ole < 0.5) ∧ (House.mouse < 0.5) ∧
(M oose ≥ 0.5) ←→ (t1−max < 7.25)∧(t3−max < 10.55)∨(t1−max ≥ 7.25)∧(t12−max < 5.65)∧(t2−max < 0.45)
Appendix A Redescription Sets from experiments with Bio Data Set
Table A.3: Redescriptions mined by Algorithm 2 from Bio data set (with IG impurity measure and min bucket=50.) J - Jaccard similarity E1,1 support ; tn − {max; min; avg} stand for minimum, maximum, and average temperature of month n in degrees Celsius, and pn stands for average
precipitation of month n in millimeters.
89
Appendix B
Redescription Sets from
experiments with DBLP data Set
91
E1,1
2335
2334
2334
2331
2270
4
4
4
5
3
3
5
4
3
10
p-val.
0.119
0.127
0.261
0.151
0.155
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
Redescription
COLT ≥ 9.5 ∧ U AI < 0.5 ∨ COLT < 9.5 ←→ Y oavF reund < 1.5
COLT ≥ 10.5 ∧ ICM L < 3.5 ∨ COLT < 10.5 ←→ RobertE.Schapire < 3.5
V LDB ≥ 17.5 ∧ P ODS < 3.5 ∨ V LDB < 17.5 ←→ S.Sudarshan < 6
V LDB ≥ 18.5 ∧ SIGM ODConf erence ≥ 26.5 ∨ V LDB < 18.5 ←→ ShaulDar < 0.5
ST OC ≥ 8.5 ∧ SODA ≥ 5.5 ∨ ST OC < 8.5 ←→ AviW igderson ≥ 0.5 ∧ SilvioM icali < 0.5 ∨ AviW igderson < 0.5
ECM L ≥ 2.5 ∧ U AI ≥ 1.5 ←→ P eterGrunwald ≥ 0.5
W W W ≥ 3.5 ∧ F OCS ≥ 1.5 ←→ RaviKumar ≥ 7.5
U AI ≥ 2.5 ∧ KDD ≥ 2.5 ←→ T omiSilander ≥ 0.5
ICDE ≥ 12.5 ∧ EDBT < 3.5 ←→ AnthonyK.H.T ung ≥ 1.5 ∧ Jef f reyXuY u ≥ 0.5
KDD ≥ 6.5 ∧ ICDM ≥ 6.5 ←→ JiongY ang ≥ 3.5
F OCS ≥ 13.5 ∧ COLT ≥ 0.5 ←→ RichardM.Karp ≥ 1.5 ∧ AviW igderson ≥ 0.5
P ODS ≥ 8.5 ∧ SIGM ODConf erence ≥ 3.5 ←→ CatrielBeeri ≥ 1.5 ∧ HectorGarcia − M olina ≥ 0.5
ICDE ≥ 13.5 ∧ W W W < 1.5 ←→ AnthonyK.H.T ung ≥ 1.5 ∧ RaymondT.N g ≥ 0.5
V LDB ≥ 17.5 ∧ P ODS ≥ 3.5) ←→ S.Sudarshan ≥ 6
ST OC ≥ 8.5 ∧ SODA < 5.5 ←→ AviW igderson ≥ 0.5 ∧ SilvioM icali ≥ 0.5
Appendix B Redescription Sets from experiments with DBLP data Set
J
0.998
0.997
0.997
0.996
0.972
0.571
0.500
0.500
0.500
0.429
0.333
0.313
0.308
0.273
0.133
92
Table B.1: Redescriptions mined by Algorithm 1 from DBLP data set (with DBSCAN binarization routine; Gini-impurity measure; min bucket = 5)
LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 - support ; CON F [a−b]
- author submitted from a to b papers for conference CONF .
J
0.996
0.972
0.667
0.500
0.417
0.357
0.333
0.133
E1,1
2330
2270
4
4
5
5
4
10
p-val.
0.158
0.155
0.000
0.000
0.000
0.000
0.000
0.000
Redescription
KDD ≥ 6.5 ∧ SIGM ODConf erence < 2.5 ∨ KDD < 6.5 ←→ JianyongW ang < 1.5
ST OC ≥ 8.5 ∧ SODA ≥ 5.5) ∨ (ST OC < 8.5 ←→ AviW igderson ≥ 0.5 ∧ SilvioM icali < 0.5 ∨ AviW igderson < 0.5
U AI ≥ 23 ←→ DavidM axwellChickering ≥ 0.5
U AI ≥ 2.5 ∧ KDD ≥ 2.5) ←→ (T omiSilander ≥ 0.5
P ODS ≥ 11.5 ∧ ST OC < 0.5 ←→ F rankN even ≥ 0.5
V LDB ≥ 18.5 ∧ SIGM ODConf erence < 26.5 ←→ ShaulDar ≥ 0.5
F OCS ≥ 20.5) ←→ SilvioM icali ≥ 1.5 ∧ RonaldL.Rivest ≥ 0.5
ST OC ≥ 8.5 ∧ SODA < 5.5 ←→ AviW igderson ≥ 0.5 ∧ SilvioM icali >= 0.5
Appendix B Redescription Sets from experiments with DBLP data Set
Table B.2: Redescriptions mined by Algorithm 1 from DBLP data set (with k-means (5 clusters) binarization routine; Gini-impurity measure;
min bucket = 5 LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 - support
; CON F [a−b] - author submitted from a tob papers for conference CONF
93
E1,1
2
3
p-val.
0
0
1
1
1
1
1
1
1
1
1
1
3
1
2
3
1
5
0
0
0
0
0
0
0
0
1
1
1
0.667
1
1
5
2
0
0
0
0
Redescription
ICDT ≥ 2.5 ∧ COLT ≥ 0.5 ←→ F otoN.Af rati ≥ 1.5 ∧ GeorgGottlob ≥ 0.5
ICDT ≥ 4.5 ∧ V LDB ≥ 0.5 ∧ ICDE ≥ 2.5 ←→ GostaGrahne < 2.5 ∧ KotagiriRamamohanarao ≥ 13 ∨
GostaGrahne ≥ 2.5 ∧ JigneshM.P atel < 0.5
ST ACS ≥ 14.5 ←→ LeenT orenvliet ≥ 7.5
P ODS ≥ 11.5 ∧ W W W ≥ 1.5 ←→ ArnaudSahuguet ≥ 6.5
W W W ≥ 3.5 ∧ F OCS ≥ 4 ←→ AndrewT omkins ≥ 8.5
P KDD ≥ 4.5 ∧ ECM L ≥ 2.5 ←→ LucDeRaedt ≥ 6.5
W W W ≥ 4.5 ∧ ICDM ≥ 3.5 ←→ BenyuZhang ≥ 12
ICDM ≥ 8.5 ∧ ICM L < 0.5 ←→ JiongY ang ≥ 2.5 ∧ P hilipS.Y u ≥ 8
ICDM ≥ 15.5 ←→ Kun − LungW u ≥ 24
U AI ≥ 23 ←→ DavidM axwellChickering < 0.5 ∧ BruceD0 Ambrosio ≥ 0.5 ∨ DavidM axwellChickering ≥ 0.5 ∧
P eterSpirtes < 1
V LDB ≥ 29.5 ←→ JigneshM.P atel ≥ 4.5
F OCS ≥ 28.5 ←→ BennySudakov ≥ 4.5
ECM L ≥ 2.5 ∧ U AI ≥ 1.5 ←→ P eterGrunwald < 1.5 ∧ StephenD.Bay ≥ 3.5 ∨ P eterGrunwald ≥ 1.5
ICDM ≥ 10 ←→ W eiF an ≥ 11
Appendix B Redescription Sets from experiments with DBLP data Set
J
1
1
94
Table B.3: Redescriptions mined by Algorithm 1 from DBLP data set (with hierarchical clustering binarization routine; IG-impurity measure;
Li
min bucket = 100
) LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 support ; CON F [a−b] - author submitted from a to b papers for conference CONF .
J
0.985
E1,1
2301
p-val.
0.125
0.978
2287
0.255
0.971
2272
0.279
0.959
745
0.000
0.959
1557
0.000
0.957
707
0.000
0.954
700
0.000
0.949
636
0.000
0.948
529
0.000
Redescription
COLT < 0.5 ∧ SIGM ODConf erence < 0.5 ∨ COLT ≥ 0.5 ∧ F OCS ≥ 1 ∨ SIGM ODConf erence ≥ 0.5 ∧ V LDB <
21.5 ∨ F OCS < 1 ∧ ECM L < 1.5 ←→ RobertE.Schapire < 0.5 ∧ HamidP irahesh < 0.5 ∨ RobertE.Schapire ≥
0.5 ∧ M ichaelJ.Kearns ≥ 1.5 ∨ HamidP irahesh ≥ 0.5 ∧ JohannesGehrke < 0.5 ∨ M ichaelJ.Kearns < 1.5 ∧
Stef anKramer < 0.5
COLT ≥ 0.5 ∨ COLT < 0.5 ∧ SIGM ODConf erence < 0.5 ∨ SIGM ODConf erence ≥ 0.5 ∧ V LDB <
21.5 ←→ RonittRubinf eld ≥ 2.5 ∨ RonittRubinf eld < 2.5 ∧ HamidP irahesh < 0.5 ∨ HamidP irahesh ≥
0.5 ∧ JohannesGehrke < 0.5
ST ACS ≥ 1.5 ∨ ST ACS < 1.5 ∧ SIGM ODConf erence < 0.5 ∨ SIGM ODConf erence ≥ 0.5 ∧ V LDB <
21.5 ←→ LeszekGasieniec ≥ 2.5 ∨ LeszekGasieniec < 2.5 ∧ HamidP irahesh < 0.5 ∨ HamidP irahesh ≥
0.5 ∧ JohannesGehrke < 0.5
SIGM ODConf erence < 1.5 ∧ V LDB ≥ 0.5 ∧ SODA < 1.5 ∨ SIGM ODConf erence ≥ 1.5 ∧ ICDE ≥ 3.5 ∧ KDD ≥
3 ←→ H.V.Jagadish < 0.5 ∧ M ichaelJ.Carey ≥ 0.5 ∧ M osesCharikar < 1.5 ∨ H.V.Jagadish ≥ 0.5 ∧ JiaweiHan ≥
2.5 ∧ BengChinOoi ≥ 0.5
ST OC ≥ 1.5 ∧ F OCS < 0.5 ∨ ST OC < 1.5 ∧ SODA < 0.5 ∨ F OCS ≥ 0.5 ∧ SIGM ODConf erence ≥ 0.5 ∨ SODA ≥
0.5 ∧ P ODS ≥ 1.5 ←→ F rankT homsonLeighton ≥ 0.5 ∧ AviW igderson < 0.5 ∨ F rankT homsonLeighton < 0.5 ∧
SergeA.P lotkin < 0.5 ∨ AviW igderson ≥ 0.5 ∧ CatrielBeeri ≥ 0.5 ∨ SergeA.P lotkin ≥ 0.5 ∧ M ayurDatar ≥ 0.5
ST OC < 0.5 ∧ F OCS ≥ 0.5 ∧ V LDB < 1.5 ∨ ST OC ≥ 0.5 ∧ SODA ≥ 4.5 ∧ SIGM ODConf erence ≥
1 ←→ F riedhelmM eyerauf derHeide < 1.5 ∧ AviW igderson ≥ 0.5 ∧ AlbertoO.M endelzon < 0.5 ∨
F riedhelmM eyerauf derHeide ≥ 1.5 ∧ SantoshV empala ≥ 0.5 ∧ M ayurDatar ≥ 2
ST OC < 0.5 ∧ F OCS ≥ 0.5 ∧ V LDB < 1.5 ∨ ST OC ≥ 0.5 ∧ SODA ≥ 4.5 ∧ ST ACS < 0.5 ←→ V ijayV.V azirani <
0.5∧AviW igderson ≥ 0.5∧AlbertoO.M endelzon < 0.5∨V ijayV.V azirani ≥ 0.5∧P iotrIndyk ≥ 0.5∧T etsuoAsano <
0.5
ST OC < 1.5 ∧ F OCS ≥ 0.5 ∧ V LDB < 1.5 ∨ ST OC ≥ 1.5 ∧ SODA ≥ 4.5 ∧ COLT ≥ 2.5 ←→ M ichaelE.Saks < 0.5 ∧
AviW igderson ≥ 0.5 ∧ AlbertoO.M endelzon < 0.5 ∨ M ichaelE.Saks ≥ 0.5 ∧ AmosF iat ≥ 0.5 ∧ N aderH.Bshouty ≥
0.5
F OCS < 2.5 ∧ ST OC ≥ 1.5 ∧ V LDB < 1.5 ∨ F OCS ≥ 2.5 ∧ SODA ≥ 6 ∧ ICDT ≥ 0.5 ←→ RobertW.F loyd < 0.5 ∧
AviW igderson ≥ 0.5∧RonaldF agin < 0.5∨RobertW.F loyd ≥ 0.5∧M ichaelA.Bender ≥ 0.5∧SanjeevKhanna ≥ 1.5
Continued
Appendix B Redescription Sets from experiments with DBLP data Set
Table
B.4: Redescriptions mined by Algorithm 2 from DBLP data set (with DBSCAN binarization routine; IG-impurity measure; min bucket=
P
Li
100 ); LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 - support
95
E1,1
775
p-val.
0.000
0.919
711
0.000
0.917
2061
0.000
0.904
2094
0.091
0.901
182
0.000
0.883
2032
0.022
0.879
2033
0.092
0.877
2028
0.099
0.876
2024
0.074
Redescription
V LDB < 0.5 ∧ SIGM ODConf erence ≥ 0.5 ∧ SDM < 0.5 ∨ V LDB ≥ 0.5 ∧ ICDE ≥ 7.5 ∧ ICDT < 0.5 ←→
ChristosF aloutsos < 0.5∧HamidP irahesh ≥ 0.5∧P hilipS.Y u < 2.5∨ChristosF aloutsos ≥ 0.5∧Kian−LeeT an ≥
0.5 ∧ RaymondT.N g < 4
ST OC < 0.5 ∧ F OCS ≥ 0.5 ∧ SIGM ODConf erence < 0.5 ∨ ST OC ≥ 0.5 ∧ SODA ≥ 0.5 ∧ ICDE <
0.5 ←→ T omasF eder < 0.5 ∧ AviW igderson ≥ 0.5 ∧ CatrielBeeri < 0.5 ∨ T omasF eder ≥ 0.5 ∧ AmosF iat ≥
0.5 ∧ SergeA.P lotkin < 1.5
U AI < 2.5 ∧ ICDE ≥ 0.5 ∧ V LDB < 5.5 ∨ U AI < 2.5 ∧ ICDE < 0.5 ∧ SIGM ODConf erence < 2.5 ∨ U AI ≥
2.5 ∧ ECM L ≥ 2.5 ∧ P KDD < 0.5 ←→ T omiSilander < 0.5 ∧ GioW iederhold ≥ 0.5 ∧ M ichaelJ.Carey < 1.5 ∨
T omiSilander < 0.5 ∧ GioW iederhold < 0.5 ∧ HamidP irahesh < 0.5 ∨ T omiSilander ≥ 0.5 ∧ P eterGrunwald ≥
1 ∧ P etriKontkanen ≥ 7
W W W ≥ 0.5 ∨ W W W < 0.5 ∧ F OCS < 0.5 ∨ F OCS ≥ 0.5 ∧ ST OC < 8.5 ←→ AmelieM arian ≥ 1.5 ∨
AmelieM arian < 1.5 ∧ AviW igderson < 0.5 ∨ AviW igderson ≥ 0.5 ∧ M oniN aor < 0.5
SIGM ODConf erence ≥ 0.5∧ICM L < 0.5∨SIGM ODConf erence < 0.5∧V LDB ≥ 0.5∧ICDE ≥ 13.5∨ICM L ≥
0.5 ∧ W W W < 0.5 ←→ W en − SyanLi ≥ 0.5 ∧ W ei − Y ingM a < 2 ∨ W en − SyanLi < 0.5 ∧ M ichaelJ.Carey ≥
0.5 ∧ LuisGravano ≥ 3.5 ∨ W ei − Y ingM a ≥ 2 ∧ Ji − RongW en < 0.5
COLT ≥ 1.5 ∧ SODA < 9.5 ∨ COLT < 1.5 ∧ ICDE ≥ 0.5 ∧ SIGM ODConf erence < 8.5 ∨ COLT < 1.5 ∧ ICDE <
0.5 ∧ V LDB < 0.5 ∨ SODA ≥ 9.5 ∧ F OCS < 3.5 ←→ LeslieG.V aliant ≥ 0.5 ∧ AmosF iat < 0.5 ∨ LeslieG.V aliant <
0.5 ∧ SurajitChaudhuri ≥ 0.5 ∧ RakeshAgrawal < 0.5 ∨ LeslieG.V aliant < 0.5 ∧ SurajitChaudhuri < 0.5 ∧
M ichaelJ.Carey < 0.5 ∨ AmosF iat ≥ 0.5 ∧ M artinF urer ≥ 0.5
ST ACS ≥ 2.5∨ST ACS < 2.5∧V LDB ≥ 0.5∧SIGM ODConf erence < 7.5∨ST ACS < 2.5∧V LDB < 0.5∧ICDE <
0.5 ←→ HansL.Bodlaender ≥ 1.5 ∨ HansL.Bodlaender < 1.5 ∧ M ichaelJ.Carey ≥ 0.5 ∧ DiveshSrivastava <
0.5 ∨ HansL.Bodlaender < 1.5 ∧ M ichaelJ.Carey < 0.5 ∧ Hans − P eterKriegel < 7.5
ST ACS ≥ 1.5 ∨ ST ACS < 1.5 ∧ V LDB ≥ 0.5 ∧ SIGM ODConf erence < 7.5 ∨ ST ACS < 1.5 ∧ V LDB < 0.5 ∧
ICDE < 0.5 ←→ AlanL.Selman ≥ 1.5 ∨ AlanL.Selman < 1.5 ∧ M ichaelJ.Carey ≥ 0.5 ∧ DiveshSrivastava <
0.5 ∨ AlanL.Selman < 1.5 ∧ M ichaelJ.Carey < 0.5 ∧ Hans − P eterKriegel < 7.5
U AI ≥ 2.5 ∨ U AI < 2.5 ∧ ICDE ≥ 0.5 ∧ SIGM ODConf erence < 8.5 ∨ U AI < 2.5 ∧ ICDE < 0.5 ∧ V LDB < 0.5 ←→
T omiSilander ≥ 1.5 ∨ T omiSilander < 1.5 ∧ SurajitChaudhuri ≥ 0.5 ∧ RakeshAgrawal < 0.5 ∨ T omiSilander <
1.5 ∧ SurajitChaudhuri < 0.5 ∧ M ichaelJ.Carey < 0.5
Continued
Appendix B Redescription Sets from experiments with DBLP data Set
J
0.937
96
Table
B.4: Redescriptions mined by Algorithm 2 from DBLP data set (with DBSCAN binarization routine; IG-impurity measure; min bucket=
P
Li
100 ); LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 - support
J
0.872
E1,1
2014
p-val.
0.052
0.865
198
0.000
0.843
532
0.000
0.841
660
0.000
0.839
0.822
0.821
1929
1900
630
0.026
0.095
0.000
0.816
248
0.000
0.816
624
0.000
0.815
0.809
1876
689
0.054
0.000
0.806
797
0.000
0.802
701
0.000
Redescription
ST OC ≥ 1.5 ∧ F OCS < 5.5 ∨ ST OC < 1.5 ∧ SODA < 0.5 ←→ AviW igderson < 0.5 ∨ AviW igderson ≥ 0.5 ∧
OdedGoldreich < 0.5
SIGM ODConf erence ≥ 0.5 ∧ W W W < 1.5 ∨ SIGM ODConf erence < 0.5 ∧ V LDB ≥ 0.5 ∧ ICDE ≥
13.5 ←→ StanleyB.Zdonik ≥ 0.5 ∧ M anolisKoubarakis < 0.5 ∨ StanleyB.Zdonik < 0.5 ∧ M ichaelJ.Carey ≥
0.5 ∧ RaymondT.N g ≥ 0.5
ST OC ≥ 2.5 ∧ ICDE < 0.5 ∨ ST OC < 2.5 ∧ F OCS ≥ 1.5 ∧ SODA < 5.5 ∨ ICDE ≥ 0.5 ∧ KDD < 2 ←→
M oniN aor ≥ 0.5 ∧ SurajitChaudhuri < 0.5 ∨ M oniN aor < 0.5 ∧ AviW igderson ≥ 0.5 ∧ M osesCharikar <
0.5 ∨ SurajitChaudhuri ≥ 0.5 ∧ SanjayAgrawal < 0.5
ST OC ≥ 2.5∧ICDE < 1.5∨ST OC < 2.5∧F OCS ≥ 0.5∧SODA < 5.5 ←→ RonittRubinf eld ≥ 0.5∧M icahAdler <
0.5 ∨ RonittRubinf eld < 0.5 ∧ AviW igderson ≥ 0.5 ∧ M osesCharikar < 0.5
ST OC < 1.5 ∨ ST OC ≥ 1.5 ∧ F OCS < 2.5 ←→ AviW igderson < 0.5 ∨ AviW igderson ≥ 0.5 ∧ OdedGoldreich < 0.5
KDD < 0.5 ←→ JiaweiHan < 0.5
ST OC ≥ 2.5∨ST OC < 2.5∧F OCS ≥ 0.5∧SODA < 5.5 ←→ M oniN aor ≥ 0.5∨M oniN aor < 0.5∧AviW igderson ≥
0.5 ∧ M osesCharikar < 0.5
SDM < 1.5 ∧ ICDM ≥ 0.5 ∧ EDBT < 0.5 ∨ SDM ≥ 1.5 ∧ KDD ≥ 2.5 ∧ SIGM ODConf erence ≥ 4.5 ←→
P hilipS.Y u < 4.5∧V ipinKumar ≥ 0.5∧P eerKroger < 1.5∨P hilipS.Y u ≥ 4.5∧JiaweiHan ≥ 2∧AidongZhang ≥ 2
F OCS ≥ 1.5 ∨ F OCS < 1.5 ∧ ST OC ≥ 0.5 ∧ SODA < 5.5 ←→ M adhuSudan ≥ 0.5 ∨ M adhuSudan < 0.5 ∧
AviW igderson ≥ 0.5 ∧ M osesCharikar < 0.5
P ODS < 0.5 ←→ AbrahamSilberschatz < 0.5
V LDB ≥ 1.5 ∨ V LDB < 1.5 ∧ SIGM ODConf erence ≥ 0.5 ∧ ICDM < 0.5 ←→ RakeshAgrawal ≥ 0.5 ∨
RakeshAgrawal < 0.5 ∧ HamidP irahesh ≥ 0.5 ∧ JiaweiHan < 0.5
SIGM ODConf erence < 2.5 ∧ V LDB < 0.5 ∧ ICDE ≥ 0.5 ∨ SIGM ODConf erence < 2.5 ∧ V LDB ≥ 0.5 ∧ SODA <
1.5 ∨ SIGM ODConf erence ≥ 2.5 ∧ P ODS ≥ 0.5 ∧ KDD < 0.5 ←→ H.V.Jagadish < 1.5 ∧ M ichaelJ.Carey < 0.5 ∧
Hans−P eterKriegel ≥ 7.5∨H.V.Jagadish < 1.5∧M ichaelJ.Carey ≥ 0.5∧M osesCharikar < 1.5∨H.V.Jagadish ≥
1.5 ∧ CatrielBeeri ≥ 0.5 ∧ JiaweiHan < 2
SIGM ODConf erence ≥ 0.5 ∧ P KDD < 2.5 ∨ SIGM ODConf erence < 0.5 ∧ V LDB ≥ 0.5 ∧ SODA < 0.5 ←→
HectorGarcia − M olina ≥ 2.5 ∨ HectorGarcia − M olina < 2.5 ∧ M ichaelJ.Carey ≥ 0.5 ∧ P iotrIndyk < 0.5
Continued
Appendix B Redescription Sets from experiments with DBLP data Set
Table
B.4: Redescriptions mined by Algorithm 2 from DBLP data set (with DBSCAN binarization routine; IG-impurity measure; min bucket=
P
Li
100 ); LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 - support
97
E1,1
617
p-val.
0.000
0.786
279
0.000
0.734
80
0.000
0.728
560
0.000
0.674
58
0.000
0.663
690
0.000
0.649
0.640
1488
290
0.084
0.000
0.637
1399
0.000
0.620
1412
0.125
0.618
194
0.000
Redescription
ST OC ≥ 1.5 ∨ ST OC < 1.5 ∧ F OCS ≥ 0.5 ∧ SODA < 5.5 ←→ ChristianSohler ≥ 0.5 ∨ ChristianSohler <
0.5 ∧ AviW igderson ≥ 0.5 ∧ M osesCharikar < 0.5
EDBT ≥ 0.5 ∧ F OCS < 4.5 ∨ EDBT < 0.5 ∧ ICDE ≥ 0.5 ∧ SIGM ODConf erence ≥ 8.5 ∨ F OCS ≥ 4.5 ∧ ST OC <
11.5 ←→ P eerKroger ≥ 2 ∧ AvrimBlum < 0.5 ∨ P eerKroger < 2 ∧ SurajitChaudhuri ≥ 0.5 ∧ RakeshAgrawal ≥
0.5 ∨ AvrimBlum ≥ 0.5 ∧ V enkatesanGuruswami < 0.5
ICM L ≥ 3.5 ∧ ICDE < 0.5 ∨ ICM L < 3.5 ∧ ECM L ≥ 1 ∧ P KDD ≥ 1.5 ∨ ICDE ≥ 0.5 ∧ ICDM < 0.5 ←→
DoinaP recup ≥ 1.5 ∧ JiaweiHan < 0.5 ∨ DoinaP recup < 1.5 ∧ SasoDzeroski ≥ 0.5 ∧ Stef anKramer ≥ 0.5 ∨
JiaweiHan ≥ 0.5 ∧ W ei − Y ingM a < 0.5
ST OC ≥ 1.5 ∧ V LDB < 2.5 ∨ ST OC < 1.5 ∧ F OCS ≥ 0.5 ∧ SODA < 5.5 ←→ RanCanetti ≥ 0.5 ∨ RanCanetti <
0.5 ∧ AviW igderson ≥ 0.5 ∧ M osesCharikar < 0.5
ECM L ≥ 2.5 ∧ V LDB < 0.5 ∨ ECM L < 2.5 ∧ ICM L ≥ 0.5 ∧ U AI ≥ 3.5 ∨ V LDB ≥ 0.5 ∧ ICDT < 3 ←→
Stef anKramer ≥ 1.5 ∧ JianP ei < 0.5 ∨ Stef anKramer < 1.5 ∧ SatinderP.Singh ≥ 0.5 ∧ DavidHeckerman ≥
0.5 ∨ JianP ei ≥ 0.5 ∧ M ichaelBenedikt < 0.5
SIGM ODConf erence ≥ 1.5∨SIGM ODConf erence < 1.5∧V LDB < 0.5∧ICDE ≥ 0.5∨SIGM ODConf erence <
1.5 ∧ V LDB ≥ 0.5 ∧ P ODS < 1.5 ←→ Jef f reyF.N aughton ≥ 0.5 ∨ Jef f reyF.N aughton < 0.5 ∧ M ichaelJ.Carey <
0.5 ∧ Hans − P eterKriegel ≥ 7.5 ∨ Jef f reyF.N aughton < 0.5 ∧ M ichaelJ.Carey ≥ 0.5 ∧ CatrielBeeri < 0.5
ICDE < 0.5 ←→ SurajitChaudhuri < 0.5
ST OC ≥ 3.5 ∨ ST OC < 3.5 ∧ F OCS ≥ 2.5 ∧ SODA < 5.5 ←→ IgorShparlinski ≥ 0.5 ∨ IgorShparlinski <
0.5 ∧ AviW igderson ≥ 0.5 ∧ M osesCharikar < 0.5
F OCS < 10.5 ∧ ST OC < 6.5 ∧ SIGM ODConf erence < 0.5 ∨ F OCS < 10.5 ∧ ST OC ≥ 6.5 ∧ ICDT < 0.5 ∨
F OCS ≥ 10.5 ∧ SODA ≥ 2.5 ∧ V LDB ≥ 3 ←→ RichardM.Karp < 1.5 ∧ AviW igderson < 0.5 ∧ HamidP irahesh <
0.5 ∨ RichardM.Karp < 1.5 ∧ AviW igderson ≥ 0.5 ∧ HectorGarcia − M olina < 0.5 ∨ RichardM.Karp ≥ 1.5 ∧
S.M uthukrishnan ≥ 0.5 ∧ JosephN aor ≥ 3.5
F OCS ≥ 5.5 ∨ F OCS < 5.5 ∧ ST OC < 5.5 ∧ SIGM ODConf erence < 0.5 ∨ F OCS < 5.5 ∧ ST OC ≥ 5.5 ∧
SODA < 5.5 ←→ RyanO0 Donnell ≥ 1.5 ∨ RyanO0 Donnell < 1.5 ∧ AviW igderson < 0.5 ∧ HamidP irahesh <
0.5 ∨ RyanO0 Donnell < 1.5 ∧ AviW igderson ≥ 0.5 ∧ M onikaRauchHenzinger < 0.5
SDM ≥ 1.5 ∨ SDM < 1.5 ∧ ICDM ≥ 0.5 ∧ KDD ≥ 0.5 ←→ P hilipS.Y u ≥ 4.5 ∨ P hilipS.Y u < 4.5 ∧ V ipinKumar ≥
0.5 ∧ SunilP rabhakar < 1.5
Continued
Appendix B Redescription Sets from experiments with DBLP data Set
J
0.797
98
Table
B.4: Redescriptions mined by Algorithm 2 from DBLP data set (with DBSCAN binarization routine; IG-impurity measure; min bucket=
P
Li
100 ); LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 - support
J
0.616
E1,1
1397
p-val.
0.105
0.606
43
0.000
0.576
182
0.000
0.564
211
0.000
0.559
170
0.000
0.541
199
0.000
0.537
29
0.000
0.451
32
0.000
0.411
92
0.000
0.400
6
0.000
0.389
7
0.000
Redescription
ST OC ≥ 5.5 ∨ ST OC < 5.5 ∧ F OCS < 4.5 ∧ SIGM ODConf erence < 0.5 ∨ ST OC < 5.5 ∧ F OCS ≥ 4.5 ∧ SODA <
5.5 ←→ JamesR.Lee ≥ 3 ∨ JamesR.Lee < 3 ∧ AviW igderson < 0.5 ∧ HamidP irahesh < 0.5 ∨ JamesR.Lee <
3 ∧ AviW igderson ≥ 0.5 ∧ M osesCharikar < 0.5
ECM L ≥ 0.5 ∧ EDBT < 0.5 ∨ ECM L < 0.5 ∧ ICM L ≥ 0.5 ∧ U AI ≥ 3.5 ∨ EDBT ≥ 0.5 ∧ SIGM ODConf erence <
3 ←→ M icheleSebag ≥ 0.5 ∧ H.V.Jagadish < 0.5 ∨ M icheleSebag < 0.5 ∧ SatinderP.Singh ≥ 0.5 ∧
DavidHeckerman ≥ 0.5 ∨ H.V.Jagadish ≥ 0.5 ∧ M ong − LiLee < 2
SODA ≥ 1.5 ∧ COLT < 5 ∨ SODA < 1.5 ∧ ST OC ≥ 0.5 ∧ F OCS ≥ 6.5 ←→ Y airBartal ≥ 0.5 ∧ LeonardP itt <
0.5 ∨ Y airBartal < 0.5 ∧ AviW igderson ≥ 0.5 ∧ M adhuSudan ≥ 0.5
SODA ≥ 0.5 ∧ ST ACS < 2.5 ∨ SODA < 0.5 ∧ ST OC ≥ 0.5 ∧ F OCS ≥ 6.5 ←→ N icoleImmorlica ≥ 0.5 ∧
HarryBuhrman < 0.5 ∨ N icoleImmorlica < 0.5 ∧ AviW igderson ≥ 0.5 ∧ OdedGoldreich ≥ 0.5
SODA ≥ 0.5∧COLT < 5∨SODA < 0.5∧ST OC ≥ 0.5∧F OCS ≥ 6.5 ←→ BorisAronov ≥ 0.5∧RobertE.Schapire <
0.5 ∨ BorisAronov < 0.5 ∧ AviW igderson ≥ 0.5 ∧ M adhuSudan ≥ 0.5
SODA ≥ 0.5 ∧ COLT < 0.5 ∨ SODA < 0.5 ∧ ST OC ≥ 1 ∧ F OCS ≥ 6.5 ←→ M ichielH.M.Smid ≥
1.5 ∧ M anf redK.W armuth < 1.5 ∨ M ichielH.M.Smid < 1.5 ∧ AviW igderson ≥ 0.5 ∧ M adhuSudan ≥ 0.5
ICM L ≥ 0.5 ∧ V LDB < 0.5 ∨ ICM L < 0.5 ∧ ECM L ≥ 0.5 ∧ P KDD ≥ 1.5 ∨ V LDB ≥ 0.5 ∧ W W W < 3 ←→
P eterA.F lach ≥ 0.5 ∧ KrithiRamamritham < 1 ∨ P eterA.F lach < 0.5 ∧ SasoDzeroski ≥ 0.5 ∧ Stef anKramer ≥
0.5 ∨ KrithiRamamritham ≥ 1 ∧ ArvindHulgeri < 1.5
ICM L ≥ 0.5 ∧ F OCS < 0.5 ∨ ICM L < 0.5 ∧ ECM L ≥ 0.5 ∧ P KDD ≥ 1.5 ←→ F ernandoC.N.P ereira ≥
0.5 ∧ EricAllender < 0.5 ∨ F ernandoC.N.P ereira < 0.5 ∧ SasoDzeroski ≥ 0.5 ∧ Stef anKramer ≥ 0.5
SDM ≥ 0.5 ∧ SIGM ODConf erence < 4 ∨ SDM < 0.5 ∧ ICDM ≥ 0.5 ∧ KDD ≥ 0.5 ∨ SIGM ODConf erence ≥
4 ∧ V LDB < 2.5 ←→ ArindamBanerjee ≥ 2 ∧ M artinL.Kersten < 0.5 ∨ ArindamBanerjee < 2 ∧ V ipinKumar ≥
0.5 ∧ P hilipS.Y u ≥ 0.5 ∨ M artinL.Kersten ≥ 0.5 ∧ JiaweiHan ≥ 2.5
P ODS < 0.5 ∧ ICDT ≥ 0.5 ∧ V LDB ≥ 20.5 ∨ P ODS ≥ 0.5 ∧ ICDE < 0.5 ∧ W W W ≥ 3.5 ←→ RichardA.DeM illo <
1.5∧Y ehoshuaSagiv ≥ 0.5∧SunitaSarawagi ≥ 0.5∨RichardA.DeM illo ≥ 1.5∧RaviKumar ≥ 0.5∧D.Sivakumar ≥
1.5
KDD ≥ 8.5 ∧ SDM ≥ 0.5 ∧ SIGM ODConf erence ≥ 2.5 ←→ KeW ang ≥ 1.5 ∧ P hilipS.Y u ≥ 2.5 ∧ HaixunW ang ≥
0.5
Continued
Appendix B Redescription Sets from experiments with DBLP data Set
Table
B.4: Redescriptions mined by Algorithm 2 from DBLP data set (with DBSCAN binarization routine; IG-impurity measure; min bucket=
P
Li
100 ); LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 - support
99
E1,1
50
p-val.
0.000
0.318
7
0.000
0.300
3
0.000
0.299
38
0.000
0.261
0.260
6
54
0.000
0.000
0.250
22
0.000
0.226
0.226
19
40
0.000
0.000
0.207
35
0.000
0.207
25
0.000
0.205
40
0.000
0.194
7
0.000
Redescription
ST OC ≥ 1.5 ∨ ST OC < 1.5 ∧ F OCS < 0.5 ∧ ICDT ≥ 3.5 ∨ ST OC < 1.5 ∧ F OCS ≥ 0.5 ∧ SODA ≥ 11.5 ←→
AviW igderson ≥ 0.5 ∨ AviW igderson < 0.5 ∧ N ogaAlon < 0.5 ∧ EmmanuelW aller ≥ 0.5 ∨ AviW igderson <
0.5 ∧ N ogaAlon ≥ 0.5 ∧ EstherM.Arkin ≥ 0.5
W W W ≥ 2.5 ∨ W W W < 2.5 ∧ ICDE ≥ 21.5 ∧ P KDD ≥ 1.5 ←→ D.Sivakumar ≥ 1.5 ∨ D.Sivakumar <
1.5 ∧ Xif engY an ≥ 4.5 ∧ W eiW ang ≥ 4
ICDE ≥ 12.5 ∧ EDBT ≥ 4.5 ∧ SIGM ODConf erence ≥ 19.5 ←→ AnthonyK.H.T ung ≥ 1.5 ∧ ShaulDar ≥
1.5 ∧ AlonY.Levy ≥ 0.5
ST OC ≥ 1.5 ∨ ST OC < 1.5 ∧ F OCS < 0.5 ∧ ICDT ≥ 3.5 ∨ ST OC < 1.5 ∧ F OCS ≥ 0.5 ∧ SODA ≥ 11.5 ←→
AviW igderson ≥ 0.5 ∨ AviW igderson < 0.5 ∧ N ogaAlon < 0.5 ∧ EmmanuelW aller ≥ 0.5 ∨ AviW igderson <
0.5 ∧ N ogaAlon ≥ 0.5 ∧ M artinF arach ≥ 2.5
ICDT ≥ 3.5 ←→ EmmanuelW aller ≥ 0.5
SODA ≥ 5.5 ∨ SODA < 5.5 ∧ F OCS ≥ 0.5 ∧ ST OC ≥ 7.5 ←→ M osesCharikar ≥ 0.5 ∨ M osesCharikar <
0.5 ∧ AviW igderson ≥ 0.5 ∧ OdedGoldreich ≥ 0.5
P ODS ≥ 2.5 ∨ P ODS < 2.5 ∧ ICDT ≥ 0.5 ∧ ST OC ≥ 10.5 ←→ CatrielBeeri ≥ 0.5 ∨ CatrielBeeri < 0.5 ∧
W angChiewT an ≥ 0.5 ∧ DonCoppersmith ≥ 0.5
COLT ≥ 3.5 ←→ M anf redK.W armuth ≥ 0.5
SODA < 3.5 ∧ ST OC ≥ 0.5 ∧ F OCS ≥ 6.5 ∨ SODA ≥ 3.5 ∧ P ODS ≥ 0.5 ∧ ICDT < 0.5 ←→ StevenSkiena < 0.5 ∧
AviW igderson ≥ 0.5∧OdedGoldreich ≥ 0.5∨StevenSkiena ≥ 0.5∧EstherM.Arkin ≥ 0.5∧RajmohanRajaraman <
1.5
ST ACS ≥ 2.5 ∨ ST ACS < 2.5 ∧ F OCS ≥ 0.5 ∧ ST OC ≥ 8.5 ←→ U weSchoning ≥ 0.5 ∨ U weSchoning < 0.5 ∧
AviW igderson ≥ 0.5 ∧ M oniN aor ≥ 0.5
ECM L ≥ 2.5 ∨ ECM L < 2.5 ∧ ICM L < 0.5 ∧ COLT ≥ 3.5 ∨ ECM L < 2.5 ∧ ICM L ≥ 0.5 ∧ F OCS ≥ 0.5 ←→
Stef anKramer ≥ 0.5 ∨ Stef anKramer < 0.5 ∧ RobertE.Schapire < 0.5 ∧ Rolf W iehagen ≥ 0.5 ∨ Stef anKramer <
0.5 ∧ RobertE.Schapire ≥ 0.5 ∧ M anf redK.W armuth ≥ 0.5
SODA ≥ 0.5 ∨ SODA < 0.5 ∧ ST OC ≥ 0.5 ∧ F OCS ≥ 6.5 ←→ ClaireKenyon ≥ 0.5 ∨ ClaireKenyon < 0.5 ∧
AviW igderson ≥ 0.5 ∧ M adhuSudan ≥ 0.5
ICDM ≥ 3.5 ∧ ICDE ≥ 4.5 ∨ ICDM < 3.5 ∧ SDM ≥ 0.5 ∧ SIGM ODConf erence ≥ 2.5 ←→ ShengM a ≥
0.5 ∧ RaymondT.N g ≥ 0.5 ∨ ShengM a < 0.5 ∧ P hilipS.Y u ≥ 2.5 ∧ Jef f reyXuY u ≥ 1.5
Continued
Appendix B Redescription Sets from experiments with DBLP data Set
J
0.318
100
Table
B.4: Redescriptions mined by Algorithm 2 from DBLP data set (with DBSCAN binarization routine; IG-impurity measure; min bucket=
P
Li
100 ); LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 - support
J
0.194
E1,1
38
p-val.
0.000
0.191
43
0.000
0.190
30
0.000
0.189
165
0.000
0.183
11
0.000
0.182
68
0.000
0.180
37
0.000
0.162
23
0.000
0.158
40
0.000
101
Redescription
SODA ≥ 4.5 ∨ SODA < 4.5 ∧ ST OC ≥ 0.5 ∧ F OCS ≥ 6.5 ←→ HaimKaplan ≥ 0.5 ∨ HaimKaplan < 0.5 ∧
AviW igderson ≥ 0.5 ∧ OdedGoldreich ≥ 0.5
P ODS ≥ 0.5 ∨ P ODS < 0.5 ∧ ICDT < 0.5 ∧ V LDB ≥ 5.5 ∨ P ODS < 0.5 ∧ ICDT ≥ 0.5 ∧ SIGM ODConf erence ≥
6.5 ←→ AbrahamSilberschatz ≥ 0.5 ∨ AbrahamSilberschatz < 0.5 ∧ Y ehoshuaSagiv < 0.5 ∧ M ichaelJ.Carey ≥
0.5 ∨ AbrahamSilberschatz < 0.5 ∧ Y ehoshuaSagiv ≥ 0.5 ∧ DiveshSrivastava ≥ 0.5
U AI ≥ 0.5 ∨ U AI < 0.5 ∧ ICM L < 0.5 ∧ COLT ≥ 3.5 ∨ U AI < 0.5 ∧ ICM L ≥ 0.5 ∧ F OCS ≥ 0.5 ←→
CraigBoutilier ≥ 0.5 ∨ CraigBoutilier < 0.5 ∧ RobertE.Schapire < 0.5 ∧ Rolf W iehagen ≥ 0.5 ∨ CraigBoutilier <
0.5 ∧ RobertE.Schapire ≥ 0.5 ∧ M anf redK.W armuth ≥ 0.5
SIGM ODConf erence ≥ 4.5∨SIGM ODConf erence < 4.5∧V LDB < 5.5∧ICDE ≥ 0.5∨SIGM ODConf erence <
4.5 ∧ V LDB ≥ 5.5 ∧ W W W < 1.5 ←→ RajeevRastogi ≥ 1.5 ∨ RajeevRastogi < 1.5 ∧ P hilipA.Bernstein <
0.5 ∧ JiaweiHan ≥ 0.5 ∨ RajeevRastogi < 1.5 ∧ P hilipA.Bernstein ≥ 0.5 ∧ KevinChen − ChuanChang < 1
ECM L ≥ 2.5 ∨ ECM L < 2.5 ∧ ICM L ≥ 0.5 ∧ U AI ≥ 3.5 ←→ P eterGrunwald ≥ 1.5 ∨ P eterGrunwald <
1.5 ∧ SatinderP.Singh ≥ 0.5 ∧ DavidHeckerman ≥ 0.5
P ODS ≥ 0.5 ∧ SIGM ODConf erence ≥ 1.5 ∨ P ODS < 0.5 ∧ ICDT < 0.5 ∧ V LDB ≥ 2.5 ∨ P ODS <
0.5 ∧ ICDT ≥ 0.5 ∧ F OCS ≥ 5.5 ∨ SIGM ODConf erence < 1.5 ∧ ICDE < 1.5 ←→ AbrahamSilberschatz ≥
0.5 ∧ AlbrechtSchmidt < 0.5 ∨ AbrahamSilberschatz < 0.5 ∧ SergeAbiteboul < 0.5 ∧ RakeshAgrawal ≥
0.5 ∨ AbrahamSilberschatz < 0.5 ∧ SergeAbiteboul ≥ 0.5 ∧ M ihalisY annakakis ≥ 1.5 ∨ AlbrechtSchmidt ≥
0.5 ∧ GioW iederhold < 0.5
SODA ≥ 2.5 ∨ SODA < 2.5 ∧ ST OC ≥ 0.5 ∧ F OCS ≥ 6.5 ←→ RichardM.Karp ≥ 2.5 ∨ RichardM.Karp <
2.5 ∧ AviW igderson ≥ 0.5 ∧ M adhuSudan ≥ 0.5
ECM L ≥ 1.5 ∨ ECM L < 1.5 ∧ ICM L < 0.5 ∧ COLT ≥ 3.5 ∨ ECM L < 1.5 ∧ ICM L ≥ 0.5 ∧ F OCS ≥ 0.5 ←→
JuhoRousu ≥ 0.5 ∨ JuhoRousu < 0.5 ∧ RobertE.Schapire < 0.5 ∧ Rolf W iehagen ≥ 0.5 ∨ JuhoRousu < 0.5 ∧
RobertE.Schapire ≥ 0.5 ∧ M anf redK.W armuth ≥ 0.5
SIGM ODConf erence ≥ 1.5 ∨ SIGM ODConf erence < 1.5 ∧ V LDB < 0.5 ∧ U AI ≥ 2.5 ∨ SIGM ODConf erence <
1.5 ∧ V LDB ≥ 0.5 ∧ ICDE ≥ 8.5 ←→ HectorGarcia − M olina ≥ 0.5 ∨ HectorGarcia − M olina < 0.5 ∧
M ichaelJ.Carey < 0.5 ∧ CraigBoutilier ≥ 0.5 ∨ HectorGarcia − M olina < 0.5 ∧ M ichaelJ.Carey ≥ 0.5 ∧
BengChinOoi ≥ 0.5
Continued
Appendix B Redescription Sets from experiments with DBLP data Set
Table
B.4: Redescriptions mined by Algorithm 2 from DBLP data set (with DBSCAN binarization routine; IG-impurity measure; min bucket=
P
Li
100 ); LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 - support
E1,1
37
p-val.
0.000
0.157
22
0.000
0.152
20
0.000
0.146
37
0.000
0.143
0.142
2
26
0.000
0.000
0.141
9
0.000
0.125
35
0.000
0.124
11
0.000
0.117
49
0.000
0.115
18
0.000
Redescription
SODA ≥ 1.5 ∨ SODA < 1.5 ∧ ST OC ≥ 0.5 ∧ F OCS ≥ 6.5 ←→ GerthStoltingBrodal ≥ 0.5 ∨ GerthStoltingBrodal <
0.5 ∧ AviW igderson ≥ 0.5 ∧ M adhuSudan ≥ 0.5
U AI ≥ 0.5 ∨ U AI < 0.5 ∧ ICM L < 0.5 ∧ COLT ≥ 3.5 ∨ U AI < 0.5 ∧ ICM L ≥ 0.5 ∧ ST OC ≥ 0.5 ←→ W olf −
T iloBalke ≥ 3 ∨ W olf − T iloBalke < 3 ∧ RobertE.Schapire < 0.5 ∧ Rolf W iehagen ≥ 0.5 ∨ W olf − T iloBalke <
3 ∧ RobertE.Schapire ≥ 0.5 ∧ M anf redK.W armuth ≥ 0.5
ICDE ≥ 3.5 ∨ ICDE < 3.5 ∧ V LDB < 0.5 ∧ ST ACS ≥ 8.5 ∨ ICDE < 3.5 ∧ V LDB ≥ 0.5 ∧ SIGM ODConf erence ≥
13.5 ←→ N arainH.Gehani ≥ 0.5 ∨ N arainH.Gehani < 0.5 ∧ DavidJ.DeW itt < 0.5 ∧ StephenA.F enner ≥ 0.5 ∨
N arainH.Gehani < 0.5 ∧ DavidJ.DeW itt ≥ 0.5 ∧ Jef f reyF.N aughton ≥ 0.5
SODA ≥ 2.5 ∨ SODA < 2.5 ∧ ST OC ≥ 0.5 ∧ F OCS ≥ 6.5 ←→ V ahabS.M irrokni ≥ 0.5 ∨ V ahabS.M irrokni <
0.5 ∧ AviW igderson ≥ 0.5 ∧ OdedGoldreich ≥ 0.5
P KDD ≥ 4.5 ←→ ArnoJ.Knobbe ≥ 0.5
W W W ≥ 2.5 ∨ W W W < 2.5 ∧ V LDB < 0.5 ∧ SIGM ODConf erence ≥ 1.5 ∨ W W W < 2.5 ∧ V LDB ≥ 0.5 ∧
ICDE ≥ 8.5 ←→ DanielGruhl ≥ 0.5 ∨ DanielGruhl < 0.5 ∧ M ichaelJ.Carey < 0.5 ∧ HectorGarcia − M olina ≥
0.5 ∨ DanielGruhl < 0.5 ∧ M ichaelJ.Carey ≥ 0.5 ∧ BengChinOoi ≥ 0.5
ICM L ≥ 4.5 ∨ ICM L < 4.5 ∧ ECM L ≥ 0.5 ∧ P KDD ≥ 1.5 ←→ DavidP age ≥ 1.5 ∨ DavidP age < 1.5 ∧
SasoDzeroski ≥ 0.5 ∧ Stef anKramer ≥ 0.5
EDBT < 0.5 ∧ ICDE < 0.5 ∧ V LDB ≥ 0.5 ∨ EDBT < 0.5 ∧ ICDE ≥ 0.5 ∧ SIGM ODConf erence ≥ 8.5 ∨ EDBT ≥
0.5 ∧ ICDT ≥ 0.5 ∧ F OCS ≥ 0.5 ←→ BernhardSeeger < 2.5 ∧ SurajitChaudhuri < 0.5 ∧ M ichaelJ.Carey ≥ 0.5 ∨
BernhardSeeger < 2.5 ∧ SurajitChaudhuri ≥ 0.5 ∧ RakeshAgrawal ≥ 0.5 ∨ BernhardSeeger ≥ 2.5 ∧ DanSuciu ≥
0.5 ∧ SophieCluet ≥ 0.5
ICM L ≥ 4.5 ∨ ICM L < 4.5 ∧ ECM L ≥ 0.5 ∧ P KDD ≥ 1.5 ←→ T omM.M itchell ≥ 0.5 ∨ T omM.M itchell <
0.5 ∧ SasoDzeroski ≥ 0.5 ∧ Stef anKramer ≥ 1.5
KDD ≥ 3.5 ∨ KDD < 3.5 ∧ ICDM < 1.5 ∧ ICDE ≥ 2.5 ∨ KDD < 3.5 ∧ ICDM ≥ 1.5 ∧ V LDB ≥ 2.5 ←→
SugatoBasu ≥ 2.5 ∨ SugatoBasu < 2.5 ∧ HaixunW ang < 0.5 ∧ DiveshSrivastava ≥ 0.5 ∨ SugatoBasu < 2.5 ∧
HaixunW ang ≥ 0.5 ∧ Jef f reyXuY u ≥ 1.5
SIGM ODConf erence ≥ 1.5 ∨ SIGM ODConf erence < 1.5 ∧ V LDB ≥ 0.5 ∧ ICDE ≥ 11.5 ←→ HamidP irahesh ≥
0.5 ∨ HamidP irahesh < 0.5 ∧ M ichaelJ.Carey ≥ 0.5 ∧ AnthonyK.H.T ung ≥ 1.5
Continued
Appendix B Redescription Sets from experiments with DBLP data Set
J
0.157
102
Table
B.4: Redescriptions mined by Algorithm 2 from DBLP data set (with DBSCAN binarization routine; IG-impurity measure; min bucket=
P
Li
100 ); LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 - support
J
0.114
E1,1
39
p-val.
0.000
0.110
42
0.000
0.108
70
0.000
0.108
93
0.000
0.102
33
0.000
0.099
9
0.000
0.095
20
0.000
0.091
32
0.000
0.089
49
0.000
0.088
77
0.000
103
Redescription
W W W ≥ 1.5 ∨ W W W < 1.5 ∧ ICDE < 0.5 ∧ V LDB ≥ 0.5 ∨ W W W < 1.5 ∧ ICDE ≥ 0.5 ∧ SIGM ODConf erence ≥
8.5 ←→ DanielF.Lieuwen ≥ 5 ∨ DanielF.Lieuwen < 5 ∧ SurajitChaudhuri < 0.5 ∧ M ichaelJ.Carey ≥ 0.5 ∨
DanielF.Lieuwen < 5 ∧ SurajitChaudhuri ≥ 0.5 ∧ RakeshAgrawal ≥ 0.5
SIGM ODConf erence < 4.5∧V LDB < 5.5∧EDBT ≥ 0.5∨SIGM ODConf erence < 4.5∧V LDB ≥ 5.5∧P ODS ≥
1.5 ∨ SIGM ODConf erence ≥ 4.5 ∧ ICDE ≥ 2.5 ∧ SDM < 1.5 ←→ AristidesGionis < 0.5 ∧ H.V.Jagadish <
0.5 ∧ N ickKoudas ≥ 0.5 ∨ AristidesGionis < 0.5 ∧ H.V.Jagadish ≥ 0.5 ∧ Y ehoshuaSagiv ≥ 0.5 ∨ AristidesGionis ≥
0.5 ∧ AbrahamSilberschatz ≥ 0.5 ∧ Kun − LungW u < 1
ICDT ≥ 1.5 ∨ ICDT < 1.5 ∧ P ODS < 3.5 ∧ V LDB ≥ 1.5 ∨ ICDT < 1.5 ∧ P ODS ≥ 3.5 ∧ SIGM ODConf erence ≥
2.5 ←→ GeorgGottlob ≥ 0.5 ∨ GeorgGottlob < 0.5 ∧ CatrielBeeri < 0.5 ∧ M ichaelJ.Carey ≥ 0.5 ∨ GeorgGottlob <
0.5 ∧ CatrielBeeri ≥ 0.5 ∧ AbrahamSilberschatz ≥ 0.5
ICDM ≥ 0.5 ∧ ICM L < 2.5 ∨ ICDM < 0.5 ∧ KDD < 0.5 ∧ ICDE ≥ 0.5 ∨ ICDM < 0.5 ∧ KDD ≥ 0.5 ∧ V LDB ≥
0.5 ←→ V ipinKumar ≥ 0.5 ∧ DmitryP avlov < 2.5 ∨ V ipinKumar < 0.5 ∧ JiaweiHan < 0.5 ∧ GioW iederhold ≥
0.5 ∨ V ipinKumar < 0.5 ∧ JiaweiHan ≥ 0.5 ∧ SurajitChaudhuri ≥ 0.5
EDBT ≥ 0.5 ∨ EDBT < 0.5 ∧ ICDE ≥ 0.5 ∧ SIGM ODConf erence ≥ 8.5 ←→ RakeshAgrawal ≥ 1.5 ∨
RakeshAgrawal < 1.5 ∧ SurajitChaudhuri ≥ 0.5 ∧ RaghuRamakrishnan ≥ 0.5
ICM L ≥ 1.5 ∨ ICM L < 1.5 ∧ ECM L ≥ 0.5 ∧ P KDD ≥ 1.5 ←→ V ladimirV apnik ≥ 0.5 ∨ V ladimirV apnik <
0.5 ∧ SasoDzeroski ≥ 0.5 ∧ Stef anKramer ≥ 0.5
SDM ≥ 0.5 ∨ SDM < 0.5 ∧ ICDM ≥ 0.5 ∧ KDD ≥ 1.5 ←→ HuanLiu ≥ 1.5 ∨ HuanLiu < 1.5 ∧ W eiF an ≥
0.5 ∧ HaixunW ang ≥ 0.5
ICDE ≥ 0.5 ∨ ICDE < 0.5 ∧ V LDB ≥ 0.5 ∧ SIGM ODConf erence ≥ 3.5 ←→ GerhardW eikum ≥ 0.5 ∨
GerhardW eikum < 0.5 ∧ M ichaelJ.Carey ≥ 0.5 ∧ H.V.Jagadish ≥ 1.5
V LDB ≥ 5.5 ∨ V LDB < 5.5 ∧ SIGM ODConf erence < 4.5 ∧ ICDE ≥ 1.5 ∨ V LDB < 5.5 ∧ SIGM ODConf erence ≥
4.5∧P ODS ≥ 0.5 ←→ LuizT ucherman ≥ 4∨LuizT ucherman < 4∧H.V.Jagadish < 0.5∧AhmedK.Elmagarmid ≥
0.5 ∨ LuizT ucherman < 4 ∧ H.V.Jagadish ≥ 0.5 ∧ AbrahamSilberschatz ≥ 0.5
ICDM ≥ 1.5 ∨ ICDM < 1.5 ∧ KDD < 0.5 ∧ ICDE ≥ 0.5 ∨ ICDM < 1.5 ∧ KDD ≥ 0.5 ∧ V LDB ≥ 0.5 ←→
HaixunW ang ≥ 0.5 ∨ HaixunW ang < 0.5 ∧ JiaweiHan < 1.5 ∧ GioW iederhold ≥ 0.5 ∨ HaixunW ang < 0.5 ∧
JiaweiHan ≥ 1.5 ∧ SurajitChaudhuri ≥ 0.5
Continued
Appendix B Redescription Sets from experiments with DBLP data Set
Table
B.4: Redescriptions mined by Algorithm 2 from DBLP data set (with DBSCAN binarization routine; IG-impurity measure; min bucket=
P
Li
100 ); LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 - support
E1,1
8
p-val.
0.000
0.085
17
0.000
0.084
55
0.000
0.084
40
0.000
0.082
36
0.000
0.081
66
0.000
0.080
69
0.000
0.074
32
0.000
0.074
52
0.000
Redescription
ICM L ≥ 0.5 ∨ ICM L < 0.5 ∧ ECM L ≥ 0.5 ∧ U AI ≥ 8 ←→ P eterA.F lach ≥ 0.5 ∨ P eterA.F lach < 0.5 ∧
StephenM uggleton ≥ 0.5 ∧ DaphneKoller ≥ 2.5
P KDD ≥ 0.5 ∨ P KDD < 0.5 ∧ KDD ≥ 0.5 ∧ ICDE ≥ 2.5 ←→ XingquanZhu ≥ 3 ∨ XingquanZhu < 3 ∧
JiaweiHan ≥ 1.5 ∧ RaymondT.N g ≥ 0.5
EDBT ≥ 1.5 ∨ EDBT < 1.5 ∧ ICDE < 2.5 ∧ V LDB ≥ 1.5 ∨ EDBT < 1.5 ∧ ICDE ≥ 2.5 ∧ SIGM ODConf erence ≥
1.5 ←→ ElenaBaralis ≥ 0.5 ∨ ElenaBaralis < 0.5 ∧ H.V.Jagadish < 0.5 ∧ BruceG.Lindsay ≥ 0.5 ∨ ElenaBaralis <
0.5 ∧ H.V.Jagadish ≥ 0.5 ∧ HectorGarcia − M olina ≥ 0.5
V LDB ≥ 5.5∨V LDB < 5.5∧SIGM ODConf erence < 4.5∧P ODS ≥ 0.5∨V LDB < 5.5∧SIGM ODConf erence ≥
4.5 ∧ EDBT ≥ 0.5 ←→ W eiminDu ≥ 0.5 ∨ W eiminDu < 0.5 ∧ AbrahamSilberschatz < 0.5 ∧ Y ehoshuaSagiv ≥
0.5 ∨ W eiminDu < 0.5 ∧ AbrahamSilberschatz ≥ 0.5 ∧ RakeshAgrawal ≥ 1.5
SIGM ODConf erence ≥ 5.5∨SIGM ODConf erence < 5.5∧V LDB < 5.5∧P ODS ≥ 0.5∨SIGM ODConf erence <
5.5 ∧ V LDB ≥ 5.5 ∧ ICDT ≥ 0.5 ←→ RaymondT.N g ≥ 0.5 ∨ RaymondT.N g < 0.5 ∧ RaghuRamakrishnan <
0.5 ∧ AbrahamSilberschatz ≥ 0.5 ∨ RaymondT.N g < 0.5 ∧ RaghuRamakrishnan ≥ 0.5 ∧ Jef f reyD.U llman ≥ 1.5
ICDM < 0.5 ∧ KDD < 0.5 ∧ ICDE ≥ 0.5 ∨ ICDM < 0.5 ∧ KDD ≥ 0.5 ∧ V LDB ≥ 0.5 ∨ ICDM ≥ 0.5 ∧ SDM ≥
0.5 ∧ EDBT ≥ 1.5 ←→ V ipinKumar < 0.5 ∧ JiaweiHan < 1.5 ∧ GioW iederhold ≥ 0.5 ∨ V ipinKumar < 0.5 ∧
JiaweiHan ≥ 1.5 ∧ SurajitChaudhuri ≥ 0.5 ∨ V ipinKumar ≥ 0.5 ∧ P hilipS.Y u ≥ 0.5 ∧ XueminLin ≥ 0.5
P KDD ≥ 3.5 ∨ P KDD < 3.5 ∧ KDD < 0.5 ∧ ICDE ≥ 0.5 ∨ P KDD < 3.5 ∧ KDD ≥ 0.5 ∧ V LDB ≥
0.5 ←→ M arcSebban ≥ 0.5 ∨ M arcSebban < 0.5 ∧ JiaweiHan < 0.5 ∧ GioW iederhold ≥ 0.5 ∨ M arcSebban <
0.5 ∧ JiaweiHan ≥ 0.5 ∧ SurajitChaudhuri ≥ 0.5
SIGM ODConf erence < 3.5 ∧ V LDB < 5.5 ∧ ICDE ≥ 1.5 ∨ SIGM ODConf erence < 3.5 ∧ V LDB ≥ 5.5 ∧ KDD ≥
6.5 ∨ SIGM ODConf erence ≥ 3.5 ∧ P ODS ≥ 0.5 ∧ ICDT ≥ 2 ←→ SurajitChaudhuri < 1.5 ∧ H.V.Jagadish <
0.5 ∧ N ickKoudas ≥ 0.5 ∨ SurajitChaudhuri < 1.5 ∧ H.V.Jagadish ≥ 0.5 ∧ KeW ang ≥ 5 ∨ SurajitChaudhuri ≥
1.5 ∧ RaghuRamakrishnan ≥ 2 ∧ AlbertoO.M endelzon ≥ 0.5
EDBT < 3.5 ∧ ICDE < 4.5 ∧ SIGM ODConf erence ≥ 0.5 ∨ EDBT < 3.5 ∧ ICDE ≥ 4.5 ∧ P ODS ≥ 11 ∨ EDBT ≥
3.5 ∧ V LDB ≥ 6.5 ∧ F OCS < 4.5 ←→ GustavoAlonso < 5 ∧ SurajitChaudhuri < 0.5 ∧ HamidP irahesh ≥
0.5 ∨ GustavoAlonso < 5 ∧ SurajitChaudhuri ≥ 0.5 ∧ S.Sudarshan ≥ 7.5 ∨ GustavoAlonso ≥ 5 ∧ RadekV ingralek ≥
0.5 ∧ M onikaRauchHenzinger < 0.5
Continued
Appendix B Redescription Sets from experiments with DBLP data Set
J
0.086
104
Table
B.4: Redescriptions mined by Algorithm 2 from DBLP data set (with DBSCAN binarization routine; IG-impurity measure; min bucket=
P
Li
100 ); LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 - support
J
0.071
E1,1
53
p-val.
0.000
0.068
20
0.000
0.065
20
0.000
0.063
32
0.000
0.063
50
0.000
Redescription
P ODS ≥ 1.5 ∨ P ODS < 1.5 ∧ ICDT < 0.5 ∧ SIGM ODConf erence ≥ 0.5 ∨ P ODS < 1.5 ∧ ICDT ≥ 0.5 ∧
V LDB ≥ 20.5 ←→ M ichaelKif er ≥ 2.5 ∨ M ichaelKif er < 2.5 ∧ Y ehoshuaSagiv < 0.5 ∧ HamidP irahesh ≥
0.5 ∨ M ichaelKif er < 2.5 ∧ Y ehoshuaSagiv ≥ 0.5 ∧ SunitaSarawagi ≥ 0.5
KDD ≥ 0.5 ∨ KDD < 0.5 ∧ ICDM ≥ 0.5 ∧ V LDB ≥ 0.5 ←→ Y asuhikoM orimoto ≥ 0.5 ∨ Y asuhikoM orimoto <
0.5 ∧ JiaweiHan ≥ 0.5 ∧ JianP ei ≥ 0.5
ICDM ≥ 0.5 ∨ ICDM < 0.5 ∧ KDD ≥ 0.5 ∧ ICDE ≥ 0.5 ←→ Geof f reyI.W ebb ≥ 0.5 ∨ Geof f reyI.W ebb <
0.5 ∧ JiaweiHan ≥ 0.5 ∧ P hilipS.Y u ≥ 3.5
SIGM ODConf erence ≥ 4.5∨SIGM ODConf erence < 4.5∧V LDB < 4.5∧ICDE ≥ 1.5∨SIGM ODConf erence <
4.5 ∧ V LDB ≥ 4.5 ∧ P ODS ≥ 1.5 ←→ SetragKhoshaf ian ≥ 2.5 ∨ SetragKhoshaf ian < 2.5 ∧ DiveshSrivastava <
0.5 ∧ H.V.Jagadish ≥ 0.5 ∨ SetragKhoshaf ian < 2.5 ∧ DiveshSrivastava ≥ 0.5 ∧ Jef f reyD.U llman ≥ 1.5
V LDB ≥ 11.5 ∧ W W W ≥ 0.5 ∨ V LDB < 11.5 ∧ SIGM ODConf erence < 8.5 ∧ ICDE ≥ 0.5 ∨ V LDB < 11.5 ∧
SIGM ODConf erence ≥ 8.5 ∧ P ODS ≥ 3.5 ←→ Alf onsKemper ≥ 7.5 ∧ JunY ang ≥ 0.5 ∨ Alf onsKemper <
7.5 ∧ RakeshAgrawal < 0.5 ∧ GioW iederhold ≥ 0.5 ∨ Alf onsKemper < 7.5 ∧ RakeshAgrawal ≥ 0.5 ∧ F lipKorn ≥ 2
Appendix B Redescription Sets from experiments with DBLP data Set
Table
B.4: Redescriptions mined by Algorithm 2 from DBLP data set (with DBSCAN binarization routine; IG-impurity measure; min bucket=
P
Li
100 ); LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 - support
105
E1,1
7
5
p-val.
0.000
0.000
1.000
1.000
2
3
0.000
0.000
1.000
1.000
1.000
1.000
1.000
1.000
3
2
1
1
1
4
0.000
0.000
0.000
0.000
0.000
0.000
1.000
1.000
1.000
0.969
1
1
1
31
0.000
0.000
0.000
0.000
0.951
0.949
2210
2217
0.067
0.284
0.947
2194
0.021
Redescription
F OCS ≥ 19.5 ∧ ST OC ≥ 15.5 ∧ ICDT < 1.5 ←→ JohanHastad ≥ 2.5 ∧ SilvioM icali ≥ 0.5 ∧ KunalT alwar < 1
V LDB < 20.5 ∧ SIGM ODConf erence ≥ 21 ∧ EDBT ≥ 7 ∨ V LDB ≥ 20.5 ∧ P ODS ≥ 8.5 ∧ F OCS < 0.5 ←→
S.Sudarshan < 7.5 ∧ N arainH.Gehani ≥ 4.5 ∧ ShaulDar ≥ 1.5 ∨ S.Sudarshan ≥ 7.5 ∧ JohannesGehrke ≥ 0.5 ∧
M inosN.Garof alakis < 3
V LDB ≥ 26.5 ∧ KDD ≥ 0.5 ∧ W W W < 3 ←→ N arainH.Gehani ≥ 4.5 ∧ JigneshM.P atel ≥ 2.5 ∧ SaktiP.Ghosh < 1
ICDT < 7.5 ∧ P ODS ≥ 21.5 ∧ V LDB ≥ 4 ∨ ICDT ≥ 7.5 ∧ SODA ≥ 0.5 ∧ SDM < 0.5 ←→ IoanaM anolescu <
3.5∧Y aronKanza ≥ 10.5∧W ernerN utt ≥ 4.5∨IoanaM anolescu ≥ 3.5∧SophieCluet ≥ 6.5∧N eoklisP olyzotis ≥ 0.5
ST ACS ≥ 12.5 ∧ SODA ≥ 1.5 ←→ LeenT orenvliet ≥ 1.5 ∧ DietervanM elkebeek ≥ 0.5
W W W ≥ 6.5 ∧ F OCS ≥ 1 ←→ R.Guha ≥ 2.5 ∧ AnirbanDasgupta ≥ 0.5
P KDD ≥ 11 ←→ ShojiHirano ≥ 15
ICDM ≥ 15.5 ←→ Kun − LungW u ≥ 24
EDBT ≥ 9.5 ←→ M ichaelGillmann ≥ 8
ICDM < 7.5∧SDM ≥ 4.5∧V LDB ≥ 2.5∨ICDM ≥ 7.5∧EDBT ≥ 1.5∧ICDT < 1.5 ←→ Chang −ShingP erng <
5.5∧Kun−LungW u ≥ 6∧AleksandarLazarevic < 1.5∨Chang−ShingP erng ≥ 5.5∧W eiW ang ≥ 3∧KeW ang ≥ 0.5
U AI ≥ 36 ←→ BrendanJ.F rey ≥ 1.5
ECM L ≥ 7.5 ∧ P KDD ≥ 3 ←→ JamesCussens ≥ 1.5 ∧ N adaLavrac ≥ 3.5
ICM L ≥ 16.5 ←→ Jef f G.Schneider ≥ 7.5
F OCS ≥ 20.5 ∧ SODA < 1.5 ∨ F OCS < 20.5 ∧ ST OC < 13.5 ∧ ST ACS ≥ 14.5 ∨ F OCS < 20.5 ∧ ST OC ≥
13.5 ∧ COLT < 3 ←→ SilvioM icali ≥ 1.5 ∧ Shaf iGoldwasser ≥ 1.5 ∨ SilvioM icali < 1.5 ∧ AviW igderson <
0.5 ∧ LeenT orenvliet ≥ 7.5 ∨ SilvioM icali < 1.5 ∧ AviW igderson ≥ 0.5 ∧ DanRoth < 0.5
ST OC < 8.5 ∨ ST OC ≥ 8.5 ∧ F OCS < 7.5 ←→ AviW igderson < 0.5 ∨ AviW igderson ≥ 0.5 ∧ RonaldL.Rivest < 0.5
ST ACS ≥ 2.5 ∨ ST ACS < 2.5 ∧ SIGM ODConf erence ≥ 0.5 ∧ P ODS < 13.5 ∨ ST ACS < 2.5 ∧
SIGM ODConf erence < 0.5 ∧ ICDT < 0.5 ←→ LeszekGasieniec ≥ 2.5 ∨ LeszekGasieniec < 2.5 ∧ HectorGarcia −
M olina ≥ 0.5 ∧ CatrielBeeri < 2.5 ∨ LeszekGasieniec < 2.5 ∧ HectorGarcia − M olina < 0.5 ∧ DilysT homas < 1.5
ST OC < 8.5 ∨ ST OC ≥ 8.5 ∧ F OCS < 7.5 ←→ AviW igderson < 0.5 ∨ AviW igderson ≥ 0.5 ∧ M oniN aor < 0.5
Continued
Appendix B Redescription Sets from experiments with DBLP data Set
J
1.000
1.000
106
Table B.5: Redescriptions mined by Algorithm 2 from DBLP data set (with k-means (5 clusters) binarization routine; Gini-impurity measure;
Li
min bucket = 100
) LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 support ; CON F [a−b] - author submitted from a to b papers for conference CONF .
J
0.945
E1,1
121
p-val.
0.000
0.922
2144
0.170
0.909
110
0.000
0.898
184
0.000
0.880
44
0.000
0.847
116
0.000
0.833
15
0.000
0.826
271
0.000
0.800
4
0.000
107
Redescription
SODA < 7.5 ∧ ST OC < 6.5 ∧ ST ACS ≥ 14.5 ∨ SODA < 7.5 ∧ ST OC ≥ 6.5 ∧ ICDT < 0.5 ∨ SODA ≥ 7.5 ∧ F OCS ≥
8.5 ∧ V LDB ≥ 0.5 ←→ M osesCharikar < 0.5 ∧ AviW igderson < 0.5 ∧ LeenT orenvliet ≥ 7 ∨ M osesCharikar <
0.5 ∧ AviW igderson ≥ 0.5 ∧ HectorGarcia − M olina < 0.5 ∨ M osesCharikar ≥ 0.5 ∧ BonnieBerger ≥ 0.5 ∧
JohnE.Hopcrof t ≥ 0.5
U AI ≥ 2.5 ∨ U AI < 2.5 ∧ ICDE ≥ 0.5 ∧ V LDB < 6.5 ∨ U AI < 2.5 ∧ ICDE < 0.5 ∧ SIGM ODConf erence < 7.5 ←→
T omiSilander ≥ 0.5 ∨ T omiSilander < 0.5 ∧ GioW iederhold ≥ 0.5 ∧ M ichaelJ.Carey < 1.5 ∨ T omiSilander <
0.5 ∧ GioW iederhold < 0.5 ∧ SophieCluet < 7
P KDD ≥ 2.5∧SIGM ODConf erence < 0.5∨P KDD < 2.5∧ICDM ≥ 3.5∧KDD ≥ 1.5∨SIGM ODConf erence ≥
0.5 ∧ V LDB < 11 ←→ StephaneLallich ≥ 0.5 ∧ RaymondT.N g < 1.5 ∨ StephaneLallich < 0.5 ∧ HaixunW ang ≥
3.5 ∧ P hilipS.Y u ≥ 0.5 ∨ RaymondT.N g ≥ 1.5 ∧ W eiW ang < 3.5
SDM ≥ 0.5 ∧ ST OC < 0.5 ∨ SDM < 0.5 ∧ ICDM ≥ 0.5 ∧ EDBT ≥ 1.5 ∨ ST OC ≥ 0.5 ∧ F OCS < 7 ←→
V ipinKumar ≥ 0.5 ∧ SridharRajagopalan < 2 ∨ V ipinKumar < 0.5 ∧ M atthiasSchubert ≥ 0.5 ∧ P eterKunath ≥
0.5 ∨ SridharRajagopalan ≥ 2 ∧ V enkatesanGuruswami < 0.5
SIGM ODConf erence < 11.5 ∧ V LDB ≥ 15.5 ∧ ICDE < 19.5 ∨ SIGM ODConf erence ≥ 11.5 ∧ SODA < 1.5 ∧
P ODS < 6.5 ←→ BanuOzden < 1.5 ∧ JigneshM.P atel ≥ 1.5 ∧ U meshwarDayal < 0.5 ∨ BanuOzden ≥ 1.5 ∧
M arioSzegedy < 0.5 ∧ SophieCluet < 0.5
SDM ≥ 0.5 ∧ P KDD < 1.5 ∨ SDM < 0.5 ∧ ICDM ≥ 2.5 ∧ KDD ≥ 3.5 ∨ P KDD ≥ 1.5 ∧ ICDE < 0.5 ←→
ArindamBanerjee ≥ 2 ∧ GiuseppeM anco < 0.5 ∨ ArindamBanerjee < 2 ∧ HaixunW ang ≥ 0.5 ∧ JiaweiHan ≥
6 ∨ GiuseppeM anco ≥ 0.5 ∧ DinoP edreschi < 7
ICDE ≥ 12.5 ∧ EDBT ≥ 2 ∨ ICDE < 12.5 ∧ SIGM ODConf erence ≥ 19.5 ∧ W W W ≥ 0.5 ←→ F lipKorn ≥
0.5 ∧ KrithiRamamritham < 4.5 ∨ F lipKorn < 0.5 ∧ SudarshanS.Chawathe ≥ 3 ∧ M ayankBawa ≥ 0.5
SODA ≥ 10.5 ∨ SODA < 10.5 ∧ F OCS < 3.5 ∧ ST OC ≥ 6.5 ∨ SODA < 10.5 ∧ F OCS ≥ 3.5 ∧ P ODS <
1.5 ←→ M artinP al ≥ 1.5 ∨ M artinP al < 1.5 ∧ AviW igderson < 0.5 ∧ RichardJ.Anderson ≥ 0.5 ∨ M artinP al <
1.5 ∧ AviW igderson ≥ 0.5 ∧ RonaldF agin < 1.5
P ODS < 13.5 ∧ ICDT ≥ 7.5 ∨ P ODS ≥ 13.5 ∧ V LDB ≥ 4.5 ←→ M arianoP.Consens < 1.5 ∧ SophieCluet ≥
7.5 ∨ M arianoP.Consens ≥ 1.5 ∧ DiveshSrivastava ≥ 0.5
Continued
Appendix B Redescription Sets from experiments with DBLP data Set
Table B.5: Redescriptions mined by Algorithm 2 from DBLP data set (with k-means (5 clusters) binarization routine; Gini-impurity measure;
Li
min bucket = 100
) LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 support ; CON F [a−b] - author submitted from a to b papers for conference CONF .
E1,1
4
p-val.
0.000
0.800
4
0.000
0.800
0.800
0.671
4
4
47
0.000
0.000
0.000
0.667
2
0.000
0.667
0.615
2
40
0.000
0.000
0.605
52
0.000
0.600
3
0.000
0.571
4
0.000
0.556
5
0.000
0.524
43
0.000
0.500
6
0.000
Redescription
V LDB < 14.5 ∧ SIGM ODConf erence ≥ 19.5 ∧ EDBT ≥ 7 ∨ V LDB ≥ 14.5 ∧ ICDE ≥ 13.5 ∧ P ODS <
0.5 ←→ SungRanCho < 1 ∧ N arainH.Gehani ≥ 4.5 ∧ ShaulDar ≥ 2 ∨ SungRanCho ≥ 1 ∧ BengChinOoi ≥
0.5 ∧ ShojiroN ishio ≥ 0.5
V LDB ≥ 26.5 ∧ ICDT < 1 ∨ ICDT ≥ 1 ∧ W W W < 1 ←→ N arainH.Gehani ≥ 4.5 ∧ Y uriBreitbart < 0.5 ∨
Y uriBreitbart ≥ 0.5 ∧ RaymondT.N g ≥ 1
COLT ≥ 22.5 ∧ ICM L ≥ 4.5 ←→ DavidP.Helmbold ≥ 3.5 ∧ SallyA.Goldman ≥ 0.5
KDD ≥ 16.5 ∧ ICDT < 2 ←→ KeW ang ≥ 5.5 ∧ BelaBollobas < 0.5
F OCS ≥ 11.5 ∧ COLT < 19.5 ∨ F OCS < 11.5 ∧ ST OC ≥ 13.5 ∧ SODA < 1.5 ←→ SantoshV empala ≥ 2.5 ∧
W ayneEberly < 0.5 ∨ SantoshV empala < 2.5 ∧ SilvioM icali ≥ 0.5 ∧ Shaf iGoldwasser ≥ 1.5
SODA ≥ 25 ∧ ICDT ≥ 0.5 ∧ ST OC ≥ 4.5 ←→ M arkH.Overmars ≥ 3.5 ∧ RajmohanRajaraman ≥ 1 ∧
HerbertEdelsbrunner < 2.5
SDM ≥ 5.5 ∨ SDM < 5.5 ∧ ICDM ≥ 15.5 ←→ HuiXiong ≥ 6.5 ∨ HuiXiong < 6.5 ∧ Kun − LungW u ≥ 22
SIGM ODConf erence < 7.5 ∧ V LDB ≥ 10.5 ∧ P ODS ≥ 0.5 ∨ SIGM ODConf erence ≥ 7.5 ∧ ICDE ≥ 2.5 ∧
SDM ≥ 1.5 ←→ M ichaelJ.Carey < 0.5 ∧ H.V.Jagadish ≥ 2.5 ∧ ErichJ.N euhold < 1.5 ∨ M ichaelJ.Carey ≥
0.5 ∧ P hilipS.Y u ≥ 1.5 ∧ JiaweiHan ≥ 0.5
ST OC < 8.5 ∧ F OCS ≥ 8 ∧ COLT < 1.5 ∨ ST OC ≥ 8.5 ∧ SODA ≥ 5.5 ∧ V LDB ≥ 2.5 ←→ M oniN aor <
0.5 ∧ AviW igderson ≥ 0.5 ∧ Jef f reyC.Jackson < 0.5 ∨ M oniN aor ≥ 0.5 ∧ M osesCharikar ≥ 0.5 ∧ N inaM ishra ≥ 1
ICDT ≥ 5.5 ∨ ICDT < 5.5 ∧ P ODS ≥ 19.5 ∧ V LDB ≥ 10 ←→ F rankN even ≥ 3.5 ∨ F rankN even < 3.5 ∧
SophieCluet ≥ 8.5 ∧ T irthankarLahiri ≥ 1
V LDB ≥ 18.5 ∧ EDBT ≥ 4.5 ∨ V LDB < 18.5 ∧ SIGM ODConf erence ≥ 22 ∧ P ODS ≥ 9.5 ←→
AbrahamSilberschatz ≥ 0.5 ∧ ShaulDar ≥ 1.5 ∨ AbrahamSilberschatz < 0.5 ∧ P raveenSeshadri ≥ 5.5 ∧
JohannesGehrke ≥ 1.5
COLT ≥ 16.5 ∧ F OCS ≥ 1.5 ∨ COLT < 16.5 ∧ ICM L ≥ 9.5 ∧ U AI ≥ 1.5 ←→ JohnLangf ord ≥ 1.5 ∧
SallyA.Goldman ≥ 0.5 ∨ JohnLangf ord < 1.5 ∧ DavidCohn ≥ 0.5 ∧ JustinA.Boyan ≥ 1
SODA ≥ 17.5 ∨ SODA < 17.5 ∧ F OCS ≥ 10.5 ∧ ST OC ≥ 9.5 ←→ RichardCole ≥ 0.5 ∨ RichardCole < 0.5 ∧
LaszloLovasz ≥ 0.5 ∧ JurisHartmanis < 0.5
ICDM ≥ 6.5 ∨ ICDM < 6.5 ∧ SDM ≥ 10.5 ←→ P hilipS.Y u ≥ 11.5 ∨ P hilipS.Y u < 11.5 ∧ Kun − LungW u ≥ 22
Continued
Appendix B Redescription Sets from experiments with DBLP data Set
J
0.800
108
Table B.5: Redescriptions mined by Algorithm 2 from DBLP data set (with k-means (5 clusters) binarization routine; Gini-impurity measure;
Li
min bucket = 100
) LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 support ; CON F [a−b] - author submitted from a to b papers for conference CONF .
J
0.500
E1,1
39
p-val.
0.000
0.500
0.418
1
133
0.001
0.000
0.409
199
0.000
0.375
18
0.000
0.353
0.337
6
33
0.000
0.000
0.333
4
0.000
0.333
0.333
0.328
1
2
19
0.001
0.000
0.000
0.314
11
0.000
109
Redescription
U AI ≥ 9.5 ∧ V LDB < 0.5 ∨ U AI < 9.5 ∧ ICM L < 0.5 ∧ COLT ≥ 7.5 ∨ U AI < 9.5 ∧ ICM L ≥ 0.5 ∧ ST OC ≥ 0.5 ←→
M axHenrion ≥ 0.5 ∧ DmitryP avlov < 1 ∨ M axHenrion < 0.5 ∧ RobertE.Schapire < 0.5 ∧ P eterL.Bartlett ≥
1.5 ∨ M axHenrion < 0.5 ∧ RobertE.Schapire ≥ 0.5 ∧ M anf redK.W armuth ≥ 0.5
KDD ≥ 20.5 ∧ P KDD ≥ 3 ←→ AnthonyK.H.T ung ≥ 7 ∧ ShojiroN ishio ≥ 1
U AI < 23 ∧ SDM < 0.5 ∧ KDD ≥ 1.5 ∨ U AI < 23 ∧ SDM ≥ 0.5 ∧ ICDM ≥ 0.5 ∨ U AI ≥ 23 ∧ ICM L <
1.5 ∧ COLT < 0.5 ←→ DavidHeckerman < 2.5 ∧ V ipinKumar < 0.5 ∧ King − IpLin ≥ 0.5 ∨ DavidHeckerman <
2.5 ∧ V ipinKumar ≥ 0.5 ∧ P ascalP oncelet < 1 ∨ DavidHeckerman ≥ 2.5 ∧ M oisesGoldszmidt < 0.5 ∧ BlaiBonet <
0.5
ST ACS ≥ 5.5 ∧ V LDB < 1 ∨ ST ACS < 5.5 ∧ F OCS < 5.5 ∧ ST OC ≥ 2.5 ∨ ST ACS < 5.5 ∧ F OCS ≥ 5.5 ∧
SODA < 5.5 ←→ RiccardoSilvestri ≥ 1.5 ∧ KennethW.Regan < 0.5 ∨ RiccardoSilvestri < 1.5 ∧ AviW igderson <
0.5 ∧ M oniN aor ≥ 0.5 ∨ RiccardoSilvestri < 1.5 ∧ AviW igderson ≥ 0.5 ∧ M osesCharikar < 0.5
COLT ≥ 7.5 ∨ COLT < 7.5 ∧ ICM L < 3.5 ∧ U AI ≥ 14.5 ∨ COLT < 7.5 ∧ ICM L ≥ 3.5 ∧
F OCS ≥ 0.5 ←→ M anf redK.W armuth ≥ 1.5 ∨ M anf redK.W armuth < 1.5 ∧ DavidA.M cAllester < 0.5 ∧
DavidM axwellChickering ≥ 0.5 ∨ M anf redK.W armuth < 1.5 ∧ DavidA.M cAllester ≥ 0.5 ∧ Y oavF reund ≥ 1.5
P ODS ≥ 11.5 ∨ P ODS < 11.5 ∧ ICDT ≥ 7.5 ←→ F rankN even ≥ 0.5 ∨ F rankN even < 0.5 ∧ SophieCluet ≥ 7.5
ST OC ≥ 8.5 ∨ ST OC < 8.5 ∧ F OCS ≥ 8.5 ∧ SODA < 1.5 ←→ AviW igderson ≥ 0.5 ∨ AviW igderson < 0.5 ∧
SalilP.V adhan ≥ 4.5 ∧ Shaf iGoldwasser ≥ 0.5
W W W ≥ 4.5 ∨ W W W < 4.5 ∧ V LDB ≥ 25.5 ∧ EDBT ≥ 4 ←→ RudiStuder ≥ 6.5 ∨ RudiStuder < 6.5 ∧
N arainH.Gehani ≥ 4.5 ∧ ShivakumarV enkataraman < 0.5
P KDD ≥ 8 ←→ EllaBingham ≥ 2.5
V LDB ≥ 28 ←→ JigneshM.P atel ≥ 1.5
ECM L < 2.5 ∧ ICM L < 0.5 ∧ COLT ≥ 7.5 ∨ ECM L < 2.5 ∧ ICM L ≥ 0.5 ∧ ST OC ≥ 0.5 ∨ ECM L ≥ 2.5 ∧ KDD ≥
2.5 ∧ U AI ≥ 2.5 ←→ SasoDzeroski < 1.5 ∧ RobertE.Schapire < 0.5 ∧ P eterL.Bartlett ≥ 1.5 ∨ SasoDzeroski <
1.5 ∧ RobertE.Schapire ≥ 0.5 ∧ M anf redK.W armuth ≥ 0.5 ∨ SasoDzeroski ≥ 1.5 ∧ T omiSilander ≥ 3.5 ∧
P eterGrunwald ≥ 1
P ODS ≥ 5.5 ∨ P ODS < 5.5 ∧ ICDT ≥ 2.5 ∧ SIGM ODConf erence ≥ 11 ←→ LeonidLibkin ≥ 0.5 ∨ LeonidLibkin <
0.5 ∧ M arcGyssens ≥ 0.5 ∧ ChrisClif ton ≥ 0.5
Continued
Appendix B Redescription Sets from experiments with DBLP data Set
Table B.5: Redescriptions mined by Algorithm 2 from DBLP data set (with k-means (5 clusters) binarization routine; Gini-impurity measure;
Li
min bucket = 100
) LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 support ; CON F [a−b] - author submitted from a to b papers for conference CONF .
E1,1
5
p-val.
0.000
0.258
16
0.000
0.228
18
0.000
0.200
0.192
1
14
0.004
0.000
0.175
11
0.000
0.167
0.167
0.166
1
4
28
0.003
0.000
0.000
0.155
43
0.000
0.139
5
0.000
0.132
48
0.000
Redescription
ICM L ≥ 8.5 ∨ ICM L < 8.5 ∧ KDD ≥ 8.5 ∧ ICDM ≥ 7.5 ←→ T homasHof mann ≥ 0.5 ∨ T homasHof mann <
0.5 ∧ KeW ang ≥ 5.5 ∧ W eiW ang ≥ 6
COLT ≥ 3.5 ∨ COLT < 3.5 ∧ ICM L < 3.5 ∧ U AI ≥ 14.5 ∨ COLT < 3.5 ∧ ICM L ≥ 3.5 ∧ F OCS ≥ 2.5 ←→
N aderH.Bshouty ≥ 1.5 ∨ N aderH.Bshouty < 1.5 ∧ DavidA.M cAllester < 0.5 ∧ DavidM axwellChickering ≥
0.5 ∨ N aderH.Bshouty < 1.5 ∧ DavidA.M cAllester ≥ 0.5 ∧ SallyA.Goldman ≥ 0.5
SIGM ODConf erence ≥ 2.5 ∧ SDM ≥ 1.5 ∨ SIGM ODConf erence < 2.5 ∧ V LDB ≥ 4.5 ∧ P ODS ≥ 1.5 ←→
HectorGarcia − M olina ≥ 0.5 ∧ P hilipS.Y u ≥ 2.5 ∨ HectorGarcia − M olina < 0.5 ∧ H.V.Jagadish ≥ 0.5 ∧
CatrielBeeri ≥ 0.5
ST ACS ≥ 12.5 ←→ StephenA.F enner ≥ 3
P KDD ≥ 4.5 ∨ P KDD < 4.5 ∧ KDD < 14.5 ∧ SDM ≥ 1.5 ∨ P KDD < 4.5 ∧ KDD ≥ 14.5 ∧ ICDM ≥ 4.5 ←→
ArnoJ.Knobbe ≥ 0.5∨ArnoJ.Knobbe < 0.5∧KeW ang < 5.5∧P hilipS.Y u ≥ 4.5∨ArnoJ.Knobbe < 0.5∧KeW ang ≥
5.5 ∧ DmitryP avlov < 1
ECM L ≥ 0.5∧P KDD ≥ 0.5∨ECM L < 0.5∧ICM L ≥ 0.5∧U AI ≥ 3.5 ←→ SasoDzeroski ≥ 0.5∧Stef anKramer ≥
0.5 ∨ SasoDzeroski < 0.5 ∧ SatinderP.Singh ≥ 0.5 ∧ CraigBoutilier ≥ 0.5
ICDM ≥ 10 ∨ ICDM < 10 ∧ SDM ≥ 10.5 ←→ ShengM a ≥ 1.5 ∨ ShengM a < 1.5 ∧ Kun − LungW u ≥ 24
KDD ≥ 7.5 ∨ KDD < 7.5 ∧ ICDE ≥ 24.5 ←→ T omF awcett ≥ 0.5 ∨ T omF awcett < 0.5 ∧ KeW ang ≥ 6.5
SIGM ODConf erence ≥ 4.5 ∨ SIGM ODConf erence < 4.5 ∧ V LDB ≥ 8.5 ∧ P ODS ≥ 2.5 ←→ H.V.Jagadish ≥
0.5 ∨ H.V.Jagadish < 0.5 ∧ S.Seshadri ≥ 0.5 ∧ Y uriBreitbart ≥ 0.5
P ODS ≥ 5.5 ∧ F OCS ≥ 13 ∨ P ODS < 5.5 ∧ ICDT < 2.5 ∧ SIGM ODConf erence ≥ 3.5 ∨ P ODS < 5.5 ∧ ICDT ≥
2.5 ∧ V LDB ≥ 9.5 ←→ M osheY.V ardi ≥ 11.5 ∧ M.R.Garey ≥ 0.5 ∨ M osheY.V ardi < 11.5 ∧ EmmanuelW aller <
0.5 ∧ M ichaelJ.Carey ≥ 0.5 ∨ M osheY.V ardi < 11.5 ∧ EmmanuelW aller ≥ 0.5 ∧ SophieCluet ≥ 4.5
SIGM ODConf erence < 4.5 ∧ V LDB ≥ 5.5 ∧ ICDT ≥ 3.5 ∨ SIGM ODConf erence ≥ 4.5 ∧ P ODS ≥ 0.5 ∧
ICDE ≥ 5.5 ←→ H.V.Jagadish < 0.5 ∧ DiveshSrivastava ≥ 0.5 ∧ AlonY.Levy ≥ 2.5 ∨ H.V.Jagadish ≥ 0.5 ∧
RaghuRamakrishnan ≥ 2.5 ∧ SurajitChaudhuri ≥ 2.5
SODA ≥ 2.5 ∨ SODA < 2.5 ∧ ST OC ≥ 1.5 ∧ F OCS ≥ 6.5 ←→ KurtM ehlhorn ≥ 0.5 ∨ KurtM ehlhorn <
0.5 ∧ AviW igderson ≥ 0.5 ∧ M oniN aor ≥ 0.5
Continued
Appendix B Redescription Sets from experiments with DBLP data Set
J
0.294
110
Table B.5: Redescriptions mined by Algorithm 2 from DBLP data set (with k-means (5 clusters) binarization routine; Gini-impurity measure;
Li
min bucket = 100
) LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 support ; CON F [a−b] - author submitted from a to b papers for conference CONF .
J
0.130
E1,1
6
p-val.
0.000
0.122
5
0.000
0.102
5
0.000
0.065
7
0.000
0.060
4
0.000
0.056
1
0.020
Redescription
SDM ≥ 1.5 ∨ SDM < 1.5 ∧ ICDM < 4.5 ∧ ICDT ≥ 6.5 ∨ SDM < 1.5 ∧ ICDM ≥ 4.5 ∧ ICDE ≥ 20 ←→
SrujanaM erugu ≥ 1.5 ∨ SrujanaM erugu < 1.5 ∧ JianyongW ang < 0.5 ∧ SophieCluet ≥ 8.5 ∨ SrujanaM erugu <
1.5 ∧ JianyongW ang ≥ 0.5 ∧ W eiW ang ≥ 6
W W W ≥ 2.5 ∨ W W W < 2.5 ∧ EDBT ≥ 5.5 ∧ V LDB ≥ 26.5 ←→ LyleH.U ngar ≥ 1.5 ∨ LyleH.U ngar < 1.5 ∧
N arainH.Gehani ≥ 4.5 ∧ ShaulDar ≥ 1.5
W W W ≥ 1.5 ∨ W W W < 1.5 ∧ ICDE ≥ 21.5 ∧ P KDD ≥ 1.5 ←→ DanielF.Lieuwen ≥ 5 ∨ DanielF.Lieuwen <
5 ∧ CharuC.Aggarwal ≥ 3.5 ∧ W eiW ang ≥ 4
EDBT ≥ 1.5 ∨ EDBT < 1.5 ∧ V LDB ≥ 11.5 ∧ F OCS ≥ 0.5 ←→ ElenaBaralis ≥ 0.5 ∨ ElenaBaralis < 0.5 ∧
H.V.Jagadish ≥ 2.5 ∧ Y ossiM atias ≥ 1.5
ICDE ≥ 6.5 ∨ ICDE < 6.5 ∧ V LDB ≥ 12.5 ∧ P KDD ≥ 3 ←→ HenryF.Korth ≥ 7.5 ∨ HenryF.Korth <
7.5 ∧ LaksV.S.Lakshmanan ≥ 9 ∧ ShojiroN ishio ≥ 1.5
F OCS ≥ 27 ←→ F riedhelmM eyerauf derHeide ≥ 1.5
Appendix B Redescription Sets from experiments with DBLP data Set
Table B.5: Redescriptions mined by Algorithm 2 from DBLP data set (with k-means (5 clusters) binarization routine; Gini-impurity measure;
Li
min bucket = 100
) LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 support ; CON F [a−b] - author submitted from a to b papers for conference CONF .
111
E1,1
1
2239
p-val.
0
0.022
0.959
745
0
0.959
1557
0
0.957
707
0
0.949
2197
0.015
0.949
636
0
0.919
711
0
0.904
2094
0.091
0.879
2034
0.091
Redescription
ICDM ≥ 15.5 ←→ Kun − LungW u ≥ 24
U AI < 2.5 ∧ ICDE < 0.5 ∨ U AI ≥ 2.5 ∧ F OCS < 1.5 ∨ ICDE ≥ 0.5 ∧ SIGM ODConf erence < 8.5 ∨ F OCS ≥
1.5 ∧ ST OC < 8.5 ←→ T omiSilander < 0.5 ∧ SurajitChaudhuri < 0.5 ∨ T omiSilander ≥ 0.5 ∧ AviW igderson <
0.5 ∨ SurajitChaudhuri ≥ 0.5 ∧ RakeshAgrawal < 0.5 ∨ AviW igderson ≥ 0.5 ∧ SridharRajagopalan < 1.5
SIGM ODConf erence < 1.5 ∧ V LDB ≥ 0.5 ∧ SODA < 1.5 ∨ SIGM ODConf erence ≥ 1.5 ∧ ICDE ≥ 3.5 ∧ KDD ≥
3 ←→ H.V.Jagadish < 0.5 ∧ M ichaelJ.Carey ≥ 0.5 ∧ M osesCharikar < 1.5 ∨ H.V.Jagadish ≥ 0.5 ∧ JiaweiHan ≥
2.5 ∧ BengChinOoi ≥ 0.5
ST OC ≥ 1.5 ∧ F OCS < 0.5 ∨ ST OC < 1.5 ∧ SODA < 0.5 ∨ F OCS ≥ 0.5 ∧ SIGM ODConf erence ≥ 0.5 ∨ SODA ≥
0.5 ∧ P ODS ≥ 1.5 ←→ F rankT homsonLeighton ≥ 0.5 ∧ AviW igderson < 0.5 ∨ F rankT homsonLeighton < 0.5 ∧
SergeA.P lotkin < 0.5 ∨ AviW igderson ≥ 0.5 ∧ CatrielBeeri ≥ 0.5 ∨ SergeA.P lotkin ≥ 0.5 ∧ M ayurDatar ≥ 0.5
ST OC < 0.5 ∧ F OCS ≥ 0.5 ∧ V LDB < 1.5 ∨ ST OC ≥ 0.5 ∧ SODA ≥ 4.5 ∧ SIGM ODConf erence ≥
1 ←→ F riedhelmM eyerauf derHeide < 1.5 ∧ AviW igderson ≥ 0.5 ∧ AlbertoO.M endelzon < 0.5 ∨
F riedhelmM eyerauf derHeide ≥ 1.5 ∧ SantoshV empala ≥ 0.5 ∧ M ayurDatar ≥ 2
W W W ≥ 1.5 ∨ W W W < 1.5 ∧ F OCS < 0.5 ∨ F OCS ≥ 0.5 ∧ ST OC < 8.5 ←→ SridharRajagopalan ≥ 0.5 ∨
SridharRajagopalan < 0.5 ∧ AviW igderson < 0.5 ∨ AviW igderson ≥ 0.5 ∧ M adhuSudan < 0.5
ST OC < 1.5 ∧ F OCS ≥ 0.5 ∧ V LDB < 1.5 ∨ ST OC ≥ 1.5 ∧ SODA ≥ 4.5 ∧ COLT ≥ 2.5 ←→ M ichaelE.Saks < 0.5 ∧
AviW igderson ≥ 0.5 ∧ AlbertoO.M endelzon < 0.5 ∨ M ichaelE.Saks ≥ 0.5 ∧ AmosF iat ≥ 0.5 ∧ N aderH.Bshouty ≥
0.5
ST OC < 0.5 ∧ F OCS ≥ 0.5 ∧ SIGM ODConf erence < 0.5 ∨ ST OC ≥ 0.5 ∧ SODA ≥ 0.5 ∧ ICDE <
0.5 ←→ T omasF eder < 0.5 ∧ AviW igderson ≥ 0.5 ∧ CatrielBeeri < 0.5 ∨ T omasF eder ≥ 0.5 ∧ AmosF iat ≥
0.5 ∧ SergeA.P lotkin < 1.5
W W W ≥ 0.5 ∨ W W W < 0.5 ∧ F OCS < 0.5 ∨ F OCS ≥ 0.5 ∧ ST OC < 8.5 ←→ AmelieM arian ≥ 1.5 ∨
AmelieM arian < 1.5 ∧ AviW igderson < 0.5 ∨ AviW igderson ≥ 0.5 ∧ M oniN aor < 0.5
COLT ≥ 3.5 ∨ COLT < 3.5 ∧ V LDB ≥ 0.5 ∧ SIGM ODConf erence < 7.5 ∨ COLT < 3.5 ∧ V LDB < 0.5 ∧ ICDE <
0.5 ←→ N icoloCesa − Bianchi ≥ 0.5 ∨ N icoloCesa − Bianchi < 0.5 ∧ M ichaelJ.Carey ≥ 0.5 ∧ DiveshSrivastava <
0.5 ∨ N icoloCesa − Bianchi < 0.5 ∧ M ichaelJ.Carey < 0.5 ∧ Hans − P eterKriegel < 7.5
Continued
Appendix B Redescription Sets from experiments with DBLP data Set
J
1
0.965
112
Table B.6: Redescriptions mined by Algorithm 2 from DBLP data set (with hierarchical (5 clusters) binarization routine; IG-impurity measure;)
LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 - support.
J
0.878
E1,1
361
p-val.
0
0.872
2014
0.052
0.864
1996
0.101
0.861
1992
0.157
0.842
717
0
0.839
0.822
0.815
0.789
1929
1900
1876
830
0.026
0.095
0.054
0
0.734
80
0
0.733
641
0
0.698
1586
0.02
0.676
606
0
113
Redescription
ST OC ≥ 2.5 ∧ ICDE < 1.5 ∨ ST OC < 2.5 ∧ F OCS ≥ 1.5 ∧ SODA ≥ 9.5 ∨ ICDE ≥ 1.5 ∧ EDBT < 0.5 ←→
M oniN aor ≥ 0.5 ∧ N ickKoudas < 0.5 ∨ M oniN aor < 0.5 ∧ M adhuSudan ≥ 0.5 ∧ SudiptoGuha ≥ 0.5
ST OC ≥ 1.5 ∧ F OCS < 5.5 ∨ ST OC < 1.5 ∧ SODA < 0.5 ←→ AviW igderson < 0.5 ∨ AviW igderson ≥ 0.5 ∧
OdedGoldreich < 0.5
COLT ≥ 1.5 ∨ COLT < 1.5 ∧ ICDE ≥ 0.5 ∧ SIGM ODConf erence < 8.5 ∨ COLT < 1.5 ∧ ICDE < 0.5 ∧ V LDB <
0.5 ←→ LeonardP itt ≥ 0.5 ∨ LeonardP itt < 0.5 ∧ SurajitChaudhuri ≥ 0.5 ∧ RakeshAgrawal < 0.5 ∨ LeonardP itt <
0.5 ∧ SurajitChaudhuri < 0.5 ∧ M ichaelJ.Carey < 0.5
COLT ≥ 1.5 ∨ COLT < 1.5 ∧ V LDB ≥ 0.5 ∧ SIGM ODConf erence < 7.5 ∨ COLT < 1.5 ∧ V LDB < 0.5 ∧ ICDE <
0.5 ←→ StephenA.F enner ≥ 3 ∨ StephenA.F enner < 3 ∧ M ichaelJ.Carey ≥ 0.5 ∧ DiveshSrivastava < 0.5 ∨
StephenA.F enner < 3 ∧ M ichaelJ.Carey < 0.5 ∧ Hans − P eterKriegel < 7.5
V LDB ≥ 1.5 ∨ V LDB < 1.5 ∧ SIGM ODConf erence ≥ 0.5 ∧ ICDM < 0.5 ←→ RakeshAgrawal ≥ 0.5 ∨
RakeshAgrawal < 0.5 ∧ HamidP irahesh ≥ 0.5 ∧ JiaweiHan < 0.5
ST OC < 1.5 ∨ ST OC ≥ 1.5 ∧ F OCS < 2.5 ←→ AviW igderson < 0.5 ∨ AviW igderson ≥ 0.5 ∧ OdedGoldreich < 0.5
KDD < 0.5 ←→ JiaweiHan < 0.5
P ODS < 0.5 ←→ AbrahamSilberschatz < 0.5
SIGM ODConf erence ≥ 3.5∨SIGM ODConf erence < 3.5∧V LDB < 0.5∧ICDE ≥ 0.5∨SIGM ODConf erence <
3.5 ∧ V LDB ≥ 0.5 ∧ P KDD < 0.5 ←→ F lipKorn ≥ 0.5 ∨ F lipKorn < 0.5 ∧ M ichaelJ.Carey < 0.5 ∧ Hans −
P eterKriegel ≥ 7.5 ∨ F lipKorn < 0.5 ∧ M ichaelJ.Carey ≥ 0.5 ∧ JinyanLi < 1
ICM L ≥ 3.5 ∧ ICDE < 0.5 ∨ ICM L < 3.5 ∧ ECM L ≥ 1 ∧ P KDD ≥ 1.5 ∨ ICDE ≥ 0.5 ∧ ICDM < 0.5 ←→
DoinaP recup ≥ 1.5 ∧ JiaweiHan < 0.5 ∨ DoinaP recup < 1.5 ∧ SasoDzeroski ≥ 0.5 ∧ Stef anKramer ≥ 0.5 ∨
JiaweiHan ≥ 0.5 ∧ W ei − Y ingM a < 0.5
ST OC ≥ 1.5 ∨ ST OC < 1.5 ∧ F OCS < 0.5 ∧ SODA ≥ 1.5 ∨ ST OC < 1.5 ∧ F OCS ≥ 0.5 ∧ P ODS < 0.5 ←→
AviW igderson ≥ 0.5 ∨ AviW igderson < 0.5 ∧ F rankT homsonLeighton < 0.5 ∧ P iotrIndyk ≥ 0.5 ∨ AviW igderson <
0.5 ∧ F rankT homsonLeighton ≥ 0.5 ∧ P hokionG.Kolaitis < 0.5
ST OC ≥ 1.5 ∧ F OCS < 0.5 ∨ ST OC < 1.5 ∧ SODA < 1.5 ∨ SODA ≥ 1.5 ∧ V LDB < 1.5 ←→ M ichaelE.Saks ≥
0.5 ∧ AviW igderson < 0.5 ∨ M ichaelE.Saks < 0.5 ∧ SudiptoGuha < 0.5
V LDB ≥ 1.5 ∨ V LDB < 1.5 ∧ SIGM ODConf erence ≥ 0.5 ∧ SDM < 0.5 ←→ W olf gangKaf er ≥ 0.5 ∨
W olf gangKaf er < 0.5 ∧ HamidP irahesh ≥ 0.5 ∧ P hilipS.Y u < 3.5
Continued
Appendix B Redescription Sets from experiments with DBLP data Set
Table B.6: Redescriptions mined by Algorithm 2 from DBLP data set (with hierarchical (5 clusters) binarization routine; IG-impurity measure;)
LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 - support.
E1,1
58
p-val.
0
0.649
0.637
1488
1399
0.084
0
0.621
195
0
0.571
4
0
0.564
211
0
0.558
373
0
0.417
0.411
5
92
0
0
0.342
13
0
0.333
31
0
0.313
5
0
Redescription
ECM L ≥ 2.5 ∧ V LDB < 0.5 ∨ ECM L < 2.5 ∧ ICM L ≥ 0.5 ∧ U AI ≥ 3.5 ∨ V LDB ≥ 0.5 ∧ ICDT < 3 ←→
Stef anKramer ≥ 1.5 ∧ JianP ei < 0.5 ∨ Stef anKramer < 1.5 ∧ SatinderP.Singh ≥ 0.5 ∧ DavidHeckerman ≥
0.5 ∨ JianP ei ≥ 0.5 ∧ M ichaelBenedikt < 0.5
ICDE < 0.5 ←→ SurajitChaudhuri < 0.5
F OCS < 10.5 ∧ ST OC < 6.5 ∧ SIGM ODConf erence < 0.5 ∨ F OCS < 10.5 ∧ ST OC ≥ 6.5 ∧ ICDT < 0.5 ∨
F OCS ≥ 10.5 ∧ SODA ≥ 2.5 ∧ V LDB ≥ 3 ←→ RichardM.Karp < 1.5 ∧ AviW igderson < 0.5 ∧ HamidP irahesh <
0.5 ∨ RichardM.Karp < 1.5 ∧ AviW igderson ≥ 0.5 ∧ HectorGarcia − M olina < 0.5 ∨ RichardM.Karp ≥ 1.5 ∧
S.M uthukrishnan ≥ 0.5 ∧ JosephN aor ≥ 3.5
SDM ≥ 1.5 ∨ SDM < 1.5 ∧ ICDM ≥ 0.5 ∧ KDD ≥ 0.5 ←→ P hilipS.Y u ≥ 4.5 ∨ P hilipS.Y u < 4.5 ∧ V ipinKumar ≥
0.5 ∧ SunilP rabhakar < 1.5
ICDM ≥ 8.5 ∧ SIGM ODConf erence ≥ 3.5 ∧ ICDT < 1.5 ←→ JiongY ang ≥ 2.5 ∧ JianP ei ≥ 2.5 ∧
ChristianBohm < 11
SODA ≥ 0.5 ∧ ST ACS < 2.5 ∨ SODA < 0.5 ∧ ST OC ≥ 0.5 ∧ F OCS ≥ 6.5 ←→ N icoleImmorlica ≥ 0.5 ∧
HarryBuhrman < 0.5 ∨ N icoleImmorlica < 0.5 ∧ AviW igderson ≥ 0.5 ∧ OdedGoldreich ≥ 0.5
SODA ≥ 5.5 ∨ SODA < 5.5 ∧ F OCS < 2.5 ∧ ST OC ≥ 1.5 ∨ SODA < 5.5 ∧ F OCS ≥ 2.5 ∧ P ODS <
1.5 ←→ M osesCharikar ≥ 0.5 ∨ M osesCharikar < 0.5 ∧ AviW igderson < 0.5 ∧ F rankT homsonLeighton ≥
0.5 ∨ M osesCharikar < 0.5 ∧ AviW igderson ≥ 0.5 ∧ P hokionG.Kolaitis < 0.5
U AI ≥ 17 ←→ DavidM axwellChickering ≥ 0.5
SDM ≥ 0.5 ∧ SIGM ODConf erence < 4 ∨ SDM < 0.5 ∧ ICDM ≥ 0.5 ∧ KDD ≥ 0.5 ∨ SIGM ODConf erence ≥
4 ∧ V LDB < 2.5 ←→ ArindamBanerjee ≥ 2 ∧ M artinL.Kersten < 0.5 ∨ ArindamBanerjee < 2 ∧ V ipinKumar ≥
0.5 ∧ P hilipS.Y u ≥ 0.5 ∨ M artinL.Kersten ≥ 0.5 ∧ JiaweiHan ≥ 2.5
P ODS ≥ 2.5 ∨ P ODS < 2.5 ∧ ICDT ≥ 0.5 ∧ ST ACS ≥ 0.5 ←→ CatrielBeeri ≥ 0.5 ∨ CatrielBeeri < 0.5 ∧
LeonidLibkin ≥ 0.5 ∧ T homasSchwentick ≥ 0.5
ST OC ≥ 1.5 ∨ ST OC < 1.5 ∧ F OCS < 0.5 ∧ ICDT ≥ 3.5 ∨ ST OC < 1.5 ∧ F OCS ≥ 0.5 ∧ SODA ≥ 12.5 ←→
AviW igderson ≥ 0.5 ∨ AviW igderson < 0.5 ∧ N ogaAlon < 0.5 ∧ EmmanuelW aller ≥ 0.5 ∨ AviW igderson <
0.5 ∧ N ogaAlon ≥ 0.5 ∧ M artinF arach ≥ 2.5
ICDT ≥ 4.5 ←→ T ovaM ilo ≥ 2.5
Continued
Appendix B Redescription Sets from experiments with DBLP data Set
J
0.674
114
Table B.6: Redescriptions mined by Algorithm 2 from DBLP data set (with hierarchical (5 clusters) binarization routine; IG-impurity measure;)
LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 - support.
J
0.253
E1,1
19
p-val.
0
0.245
39
0
0.224
24
0
0.22
29
0
0.212
25
0
0.203
41
0
0.2
26
0
0.2
0.194
2
7
0
0
0.192
37
0
Redescription
SODA < 5.5 ∧ F OCS ≥ 0.5 ∧ ST OC ≥ 10 ∨ SODA ≥ 5.5 ∧ V LDB ≥ 2 ∧ W W W < 4 ←→ M osesCharikar <
0.5 ∧ AviW igderson ≥ 0.5 ∧ OdedGoldreich ≥ 0.5 ∨ M osesCharikar ≥ 0.5 ∧ SureshV enkatasubramanian ≥ 1.5 ∧
T.S.Jayram < 4
SODA ≥ 5.5 ∨ SODA < 5.5 ∧ F OCS ≥ 0.5 ∧ ST OC ≥ 8.5 ←→ M osesCharikar ≥ 0.5 ∨ M osesCharikar <
0.5 ∧ AviW igderson ≥ 0.5 ∧ M oniN aor ≥ 0.5
ECM L ≥ 2.5 ∨ ECM L < 2.5 ∧ ICM L < 0.5 ∧ COLT ≥ 3.5 ∨ ECM L < 2.5 ∧ ICM L ≥ 0.5 ∧ F OCS ≥ 0.5 ←→
Stef anKramer ≥ 0.5 ∨ Stef anKramer < 0.5 ∧ RobertE.Schapire < 0.5 ∧ Rolf W iehagen ≥ 0.5 ∨ Stef anKramer <
0.5 ∧ RobertE.Schapire ≥ 0.5 ∧ M anf redK.W armuth ≥ 0.5
SIGM ODConf erence ≥ 2.5 ∨ SIGM ODConf erence < 2.5 ∧ V LDB < 0.5 ∧ ST ACS ≥ 8.5 ∨
SIGM ODConf erence < 2.5 ∧ V LDB ≥ 0.5 ∧ ICDE ≥ 8.5 ←→ DiveshSrivastava ≥ 0.5 ∨ DiveshSrivastava <
0.5 ∧ DavidJ.DeW itt < 0.5 ∧ StephenA.F enner ≥ 0.5 ∨ DiveshSrivastava < 0.5 ∧ DavidJ.DeW itt ≥ 0.5 ∧
JosephM.Hellerstein ≥ 0.5
U AI ≥ 0.5 ∨ U AI < 0.5 ∧ ICM L < 0.5 ∧ COLT ≥ 3.5 ∨ U AI < 0.5 ∧ ICM L ≥ 0.5 ∧ F OCS ≥ 0.5 ←→ JudeaP earl ≥
0.5 ∨ JudeaP earl < 0.5 ∧ RobertE.Schapire < 0.5 ∧ Rolf W iehagen ≥ 0.5 ∨ JudeaP earl < 0.5 ∧ RobertE.Schapire ≥
0.5 ∧ M anf redK.W armuth ≥ 0.5
ST ACS ≥ 0.5 ∨ ST ACS < 0.5 ∧ F OCS ≥ 0.5 ∧ ST OC ≥ 8.5 ←→ HarryBuhrman ≥ 0.5 ∨ HarryBuhrman <
0.5 ∧ AviW igderson ≥ 0.5 ∧ M oniN aor ≥ 0.5
P ODS ≥ 3.5 ∨ P ODS < 3.5 ∧ ICDT < 0.5 ∧ W W W ≥ 3.5 ∨ P ODS < 3.5 ∧ ICDT ≥ 0.5 ∧ ST OC ≥ 13 ←→
LeonidLibkin ≥ 0.5 ∨ LeonidLibkin < 0.5 ∧ Y ehoshuaSagiv < 0.5 ∧ SridharRajagopalan ≥ 0.5 ∨ LeonidLibkin <
0.5 ∧ Y ehoshuaSagiv ≥ 0.5 ∧ M ihalisY annakakis ≥ 1.5
KDD ≥ 8.5 ∧ W W W ≥ 0.5 ∧ V LDB ≥ 2.5 ←→ KeW ang ≥ 1.5 ∧ Kun − LungW u ≥ 1 ∧ JaySethuraman ≥ 0.5
ICDM ≥ 3.5 ∧ ICDE ≥ 4.5 ∨ ICDM < 3.5 ∧ SDM ≥ 0.5 ∧ SIGM ODConf erence ≥ 2.5 ←→ ShengM a ≥
0.5 ∧ RaymondT.N g ≥ 0.5 ∨ ShengM a < 0.5 ∧ P hilipS.Y u ≥ 2.5 ∧ Jef f reyXuY u ≥ 1.5
W W W ≥ 2.5∨W W W < 2.5∧EDBT < 5.5∧P ODS ≥ 2.5∨W W W < 2.5∧EDBT ≥ 5.5∧SIGM ODConf erence <
14.5 ←→ AndrewT omkins ≥ 0.5 ∨ AndrewT omkins < 0.5 ∧ RadekV ingralek < 0.5 ∧ Y ehoshuaSagiv ≥ 0.5 ∨
AndrewT omkins < 0.5 ∧ RadekV ingralek ≥ 0.5 ∧ AshishGupta < 0.5
Continued
Appendix B Redescription Sets from experiments with DBLP data Set
Table B.6: Redescriptions mined by Algorithm 2 from DBLP data set (with hierarchical (5 clusters) binarization routine; IG-impurity measure;)
LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 - support.
115
E1,1
24
p-val.
0
0.165
44
0
0.162
23
0
0.154
49
0
0.143
0.143
2
11
0
0
0.138
30
0
0.132
23
0
0.125
57
0
0.115
0.114
7
39
0
0
0.114
39
0
Redescription
ECM L ≥ 2.5 ∨ ECM L < 2.5 ∧ ICM L < 0.5 ∧ COLT ≥ 3.5 ∨ ECM L < 2.5 ∧ ICM L ≥ 0.5 ∧ F OCS ≥ 0.5 ←→
P eterGrunwald ≥ 1.5 ∨ P eterGrunwald < 1.5 ∧ RobertE.Schapire < 0.5 ∧ Rolf W iehagen ≥ 0.5 ∨ P eterGrunwald <
1.5 ∧ RobertE.Schapire ≥ 0.5 ∧ M anf redK.W armuth ≥ 0.5
SODA ≥ 0.5 ∨ SODA < 0.5 ∧ ST OC ≥ 0.5 ∧ F OCS ≥ 6.5 ←→ ErikD.Demaine ≥ 0.5 ∨ ErikD.Demaine <
0.5 ∧ AviW igderson ≥ 0.5 ∧ M adhuSudan ≥ 0.5
ECM L ≥ 1.5 ∨ ECM L < 1.5 ∧ ICM L < 0.5 ∧ COLT ≥ 3.5 ∨ ECM L < 1.5 ∧ ICM L ≥ 0.5 ∧ F OCS ≥ 0.5 ←→
JuhoRousu ≥ 0.5 ∨ JuhoRousu < 0.5 ∧ RobertE.Schapire < 0.5 ∧ Rolf W iehagen ≥ 0.5 ∨ JuhoRousu < 0.5 ∧
RobertE.Schapire ≥ 0.5 ∧ M anf redK.W armuth ≥ 0.5
EDBT ≥ 2.5∨EDBT < 2.5∧ICDE < 12.5∧SIGM ODConf erence ≥ 3.5∨EDBT < 2.5∧ICDE ≥ 12.5∧P ODS ≥
3.5 ←→ Y uriBreitbart ≥ 0.5 ∨ Y uriBreitbart < 0.5 ∧ F lipKorn < 0.5 ∧ M ichaelJ.Carey ≥ 0.5 ∨ Y uriBreitbart <
0.5 ∧ F lipKorn ≥ 0.5 ∧ RakeshAgrawal ≥ 1.5
P KDD ≥ 4.5 ←→ ArnoJ.Knobbe ≥ 0.5
ICM L ≥ 4.5 ∨ ICM L < 4.5 ∧ ECM L ≥ 0.5 ∧ P KDD ≥ 1.5 ←→ T omM.M itchell ≥ 0.5 ∨ T omM.M itchell <
0.5 ∧ SasoDzeroski ≥ 0.5 ∧ Stef anKramer ≥ 1.5
ICDE ≥ 0.5 ∨ ICDE < 0.5 ∧ V LDB ≥ 0.5 ∧ SIGM ODConf erence ≥ 7.5 ←→ SurajitChaudhuri ≥ 0.5 ∨
SurajitChaudhuri < 0.5 ∧ M ichaelJ.Carey ≥ 0.5 ∧ RaymondT.N g ≥ 0.5
P ODS ≥ 0.5 ∨ P ODS < 0.5 ∧ ICDT ≥ 0.5 ∧ V LDB ≥ 15.5 ←→ AbrahamSilberschatz ≥ 0.5 ∨
AbrahamSilberschatz < 0.5 ∧ Y ehoshuaSagiv ≥ 0.5 ∧ JosephM.Hellerstein ≥ 0.5
ICDT ≥ 0.5 ∨ ICDT < 0.5 ∧ P ODS < 0.5 ∧ SIGM ODConf erence ≥ 1.5 ∨ ICDT < 0.5 ∧ P ODS ≥ 0.5 ∧ V LDB ≥
9.5 ←→ DanSuciu ≥ 0.5 ∨ DanSuciu < 0.5 ∧ AbrahamSilberschatz < 0.5 ∧ HamidP irahesh ≥ 0.5 ∨ DanSuciu <
0.5 ∧ AbrahamSilberschatz ≥ 0.5 ∧ H.V.Jagadish ≥ 2.5
ST ACS ≥ 4.5 ←→ RiccardoSilvestri ≥ 0.5
ICDM ≥ 2.5 ∨ ICDM < 2.5 ∧ SDM < 3.5 ∧ ICDE ≥ 3.5 ∨ ICDM < 2.5 ∧ SDM ≥ 3.5 ∧ V LDB ≥
11 ←→ QiangY ang ≥ 1.5 ∨ QiangY ang < 1.5 ∧ JianyongW ang < 0.5 ∧ H.V.Jagadish ≥ 0.5 ∨ QiangY ang <
1.5 ∧ JianyongW ang ≥ 0.5 ∧ W eiW ang ≥ 6
W W W ≥ 1.5 ∨ W W W < 1.5 ∧ ICDE < 0.5 ∧ V LDB ≥ 0.5 ∨ W W W < 1.5 ∧ ICDE ≥ 0.5 ∧ SIGM ODConf erence ≥
8.5 ←→ DanielF.Lieuwen ≥ 5 ∨ DanielF.Lieuwen < 5 ∧ SurajitChaudhuri < 0.5 ∧ M ichaelJ.Carey ≥ 0.5 ∨
DanielF.Lieuwen < 5 ∧ SurajitChaudhuri ≥ 0.5 ∧ RakeshAgrawal ≥ 0.5
Continued
Appendix B Redescription Sets from experiments with DBLP data Set
J
0.185
116
Table B.6: Redescriptions mined by Algorithm 2 from DBLP data set (with hierarchical (5 clusters) binarization routine; IG-impurity measure;)
LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 - support.
J
0.108
E1,1
70
p-val.
0
0.108
93
0
0.102
33
0
0.101
42
0
0.095
20
0
0.089
40
0
0.085
17
0
0.084
55
0
0.084
12
0
0.081
37
0
0.081
9
0
Redescription
ICDT ≥ 1.5 ∨ ICDT < 1.5 ∧ P ODS < 3.5 ∧ V LDB ≥ 1.5 ∨ ICDT < 1.5 ∧ P ODS ≥ 3.5 ∧ SIGM ODConf erence ≥
2.5 ←→ GeorgGottlob ≥ 0.5 ∨ GeorgGottlob < 0.5 ∧ CatrielBeeri < 0.5 ∧ M ichaelJ.Carey ≥ 0.5 ∨ GeorgGottlob <
0.5 ∧ CatrielBeeri ≥ 0.5 ∧ AbrahamSilberschatz ≥ 0.5
ICDM ≥ 0.5 ∧ ICM L < 2.5 ∨ ICDM < 0.5 ∧ KDD < 0.5 ∧ ICDE ≥ 0.5 ∨ ICDM < 0.5 ∧ KDD ≥ 0.5 ∧ V LDB ≥
0.5 ←→ V ipinKumar ≥ 0.5 ∧ DmitryP avlov < 2.5 ∨ V ipinKumar < 0.5 ∧ JiaweiHan < 0.5 ∧ GioW iederhold ≥
0.5 ∨ V ipinKumar < 0.5 ∧ JiaweiHan ≥ 0.5 ∧ SurajitChaudhuri ≥ 0.5
EDBT ≥ 0.5 ∨ EDBT < 0.5 ∧ ICDE ≥ 0.5 ∧ SIGM ODConf erence ≥ 8.5 ←→ RakeshAgrawal ≥ 1.5 ∨
RakeshAgrawal < 1.5 ∧ SurajitChaudhuri ≥ 0.5 ∧ RaghuRamakrishnan ≥ 0.5
SIGM ODConf erence ≥ 1.5 ∨ SIGM ODConf erence < 1.5 ∧ V LDB ≥ 0.5 ∧ ICDE ≥ 4.5 ←→ HamidP irahesh ≥
0.5 ∨ HamidP irahesh < 0.5 ∧ M ichaelJ.Carey ≥ 0.5 ∧ RaymondT.N g ≥ 0.5
SDM ≥ 0.5 ∨ SDM < 0.5 ∧ ICDM ≥ 0.5 ∧ KDD ≥ 1.5 ←→ HuanLiu ≥ 1.5 ∨ HuanLiu < 1.5 ∧ W eiF an ≥
0.5 ∧ HaixunW ang ≥ 0.5
ICDE ≥ 1.5 ∨ ICDE < 1.5 ∧ V LDB ≥ 2.5 ∧ SIGM ODConf erence ≥ 11.5 ←→ N ickKoudas ≥ 0.5 ∨ N ickKoudas <
0.5 ∧ RakeshAgrawal ≥ 0.5 ∧ Jef f reyF.N aughton ≥ 0.5
P KDD ≥ 0.5 ∨ P KDD < 0.5 ∧ KDD ≥ 0.5 ∧ ICDE ≥ 2.5 ←→ XingquanZhu ≥ 3 ∨ XingquanZhu < 3 ∧
JiaweiHan ≥ 1.5 ∧ RaymondT.N g ≥ 0.5
EDBT ≥ 1.5 ∨ EDBT < 1.5 ∧ ICDE < 2.5 ∧ V LDB ≥ 1.5 ∨ EDBT < 1.5 ∧ ICDE ≥ 2.5 ∧ SIGM ODConf erence ≥
1.5 ←→ ElenaBaralis ≥ 0.5 ∨ ElenaBaralis < 0.5 ∧ H.V.Jagadish < 0.5 ∧ BruceG.Lindsay ≥ 0.5 ∨ ElenaBaralis <
0.5 ∧ H.V.Jagadish ≥ 0.5 ∧ HectorGarcia − M olina ≥ 0.5
KDD ≥ 2.5 ∨ KDD < 2.5 ∧ SDM ≥ 0.5 ∧ ICDE ≥ 4.5 ←→ XiongW ang ≥ 2 ∨ XiongW ang < 2 ∧ HaixunW ang ≥
0.5 ∧ W eiW ang ≥ 3
SIGM ODConf erence ≥ 5.5∨SIGM ODConf erence < 5.5∧V LDB < 4.5∧ICDE ≥ 1.5∨SIGM ODConf erence <
5.5 ∧ V LDB ≥ 4.5 ∧ P ODS ≥ 7.5 ←→ RaymondT.N g ≥ 0.5 ∨ RaymondT.N g < 0.5 ∧ DiveshSrivastava <
0.5 ∧ N ickKoudas ≥ 0.5 ∨ RaymondT.N g < 0.5 ∧ DiveshSrivastava ≥ 0.5 ∧ InderpalSinghM umick ≥ 0.5
ICM L ≥ 1.5 ∨ ICM L < 1.5 ∧ ECM L ≥ 0.5 ∧ P KDD ≥ 1.5 ←→ AndrewW.M oore ≥ 0.5 ∨ AndrewW.M oore <
0.5 ∧ N adaLavrac ≥ 0.5 ∧ RakeshAgrawal ≥ 0.5
Continued
Appendix B Redescription Sets from experiments with DBLP data Set
Table B.6: Redescriptions mined by Algorithm 2 from DBLP data set (with hierarchical (5 clusters) binarization routine; IG-impurity measure;)
LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 - support.
117
E1,1
66
p-val.
0
0.079
70
0
0.068
10
0
0.068
48
0
0.068
20
0
0.065
20
0
0.062
52
0
0.058
33
0
Redescription
ICDM < 0.5 ∧ KDD < 0.5 ∧ ICDE ≥ 0.5 ∨ ICDM < 0.5 ∧ KDD ≥ 0.5 ∧ V LDB ≥ 0.5 ∨ ICDM ≥ 0.5 ∧ SDM ≥
0.5 ∧ EDBT ≥ 1.5 ←→ V ipinKumar < 0.5 ∧ JiaweiHan < 1.5 ∧ GioW iederhold ≥ 0.5 ∨ V ipinKumar < 0.5 ∧
JiaweiHan ≥ 1.5 ∧ SurajitChaudhuri ≥ 0.5 ∨ V ipinKumar ≥ 0.5 ∧ P hilipS.Y u ≥ 0.5 ∧ XueminLin ≥ 0.5
P KDD ≥ 2.5 ∨ P KDD < 2.5 ∧ KDD < 0.5 ∧ ICDE ≥ 0.5 ∨ P KDD < 2.5 ∧ KDD ≥ 0.5 ∧ V LDB ≥ 0.5 ←→
StephaneLallich ≥ 0.5 ∨ StephaneLallich < 0.5 ∧ JiaweiHan < 0.5 ∧ GioW iederhold ≥ 0.5 ∨ StephaneLallich <
0.5 ∧ JiaweiHan ≥ 0.5 ∧ SurajitChaudhuri ≥ 0.5
ICM L ≥ 0.5 ∨ ICM L < 0.5 ∧ ECM L ≥ 0.5 ∧ U AI ≥ 3.5 ←→ P eterA.F lach ≥ 0.5 ∨ P eterA.F lach < 0.5 ∧
StephenM uggleton ≥ 0.5 ∧ N irF riedman ≥ 0.5
V LDB ≥ 1.5 ∨ V LDB < 1.5 ∧ SIGM ODConf erence < 0.5 ∧ ICDT ≥ 0.5 ∨ V LDB < 1.5 ∧ SIGM ODConf erence ≥
0.5 ∧ ICDE ≥ 0.5 ←→ RakeshAgrawal ≥ 0.5 ∨ RakeshAgrawal < 0.5 ∧ HectorGarcia − M olina < 0.5 ∧
DilysT homas ≥ 1.5 ∨ RakeshAgrawal < 0.5 ∧ HectorGarcia − M olina ≥ 0.5 ∧ JiaweiHan ≥ 0.5
KDD ≥ 0.5 ∨ KDD < 0.5 ∧ ICDM ≥ 0.5 ∧ V LDB ≥ 0.5 ←→ Y asuhikoM orimoto ≥ 0.5 ∨ Y asuhikoM orimoto <
0.5 ∧ JiaweiHan ≥ 0.5 ∧ JianP ei ≥ 0.5
ICDM ≥ 0.5 ∨ ICDM < 0.5 ∧ KDD ≥ 0.5 ∧ ICDE ≥ 0.5 ←→ Geof f reyI.W ebb ≥ 0.5 ∨ Geof f reyI.W ebb <
0.5 ∧ JiaweiHan ≥ 0.5 ∧ P hilipS.Y u ≥ 3.5
V LDB ≥ 5.5 ∨ V LDB < 5.5 ∧ SIGM ODConf erence < 7.5 ∧ ICDE ≥ 0.5 ∨ V LDB < 5.5 ∧ SIGM ODConf erence ≥
7.5 ∧ P ODS ≥ 3.5 ←→ P hilipA.Bernstein ≥ 0.5 ∨ P hilipA.Bernstein < 0.5 ∧ RakeshAgrawal < 0.5 ∧
GioW iederhold ≥ 0.5 ∨ P hilipA.Bernstein < 0.5 ∧ RakeshAgrawal ≥ 0.5 ∧ F lipKorn ≥ 2
ST OC < 1.5 ∧ F OCS ≥ 0.5 ∨ ST OC ≥ 1.5 ∧ SODA ≥ 0.5 ∧ W W W ≥ 2.5 ←→ AviW igderson < 0.5 ∧ Y ossiAzar ≥
0.5 ∨ AviW igderson ≥ 0.5 ∧ M osesCharikar ≥ 0.5
Appendix B Redescription Sets from experiments with DBLP data Set
J
0.081
118
Table B.6: Redescriptions mined by Algorithm 2 from DBLP data set (with hierarchical (5 clusters) binarization routine; IG-impurity measure;)
LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 - support.