Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Universität des Saarlandes Max-Planck-Institut für Informatik Redescription Mining Over non-Binary Data Sets Using Decision Trees Masterarbeit im Fach Informatik Master’s Thesis in Computer Science von / by Tetiana Zinchenko angefertigt unter der Leitung von / supervised by Dr. Pauli Miettinen begutachtet von / reviewers Dr. Pauli Miettinen Prof. Dr. Gerhard Weikum Saarbrücken, November 2014 Eidesstattliche Erklärung Ich erkläre hiermit an Eides Statt, dass ich die vorliegende Arbeit selbstständig verfasst und keine anderen als die angegebenen Quellen und Hilfsmittel verwendet habe. Statement in Lieu of an Oath I hereby confirm that I have written this thesis on my own and that I have not used any other media or materials than the ones referred to in this thesis. Einverständniserklärung Ich bin damit einverstanden, dass meine (bestandene) Arbeit in beiden Versionen in die Bibliothek der Informatik aufgenommen und damit veröffentlicht wird. Declaration of Consent I agree to make both versions of my thesis (with a passing grade) accessible to the public by having them added to the library of the Computer Science Department. Saarbrücken, November 2014 Tetiana Zinchenko Acknowledgements First of all, I would like to thank Dr. Pauli Mittienen for the opportunity to write my Master thesis under his supervision and for his support and encouragement during the work on this thesis. I would like to thank the International Max Planck Research School for Computer Science for giving me the opportunity to study at Saarland University and their constant support during all the time of my studies. And special thanks I want to address to my husband for being the most supportive and inspiring person in my life. He was the first trigger for me to start and finish this degree. v Abstract Scientific data mining is aimed to extract useful information from huge data sets with the help of computational efforts. Recently, scientists encounter an overload of data which describe domain entities from different sides. Many of them provide alternative means to organize information. And every alternative data set offers a different perspective onto the studied problem. Redescription mining is tool with a goal of finding various descriptions of the same objects, i.e. giving information on entity from different perspectives. It is a tool for knowledge discovery which helps uniformly reason across data of diverse origin and integrates numerous forms of characterizing data sets. Redescription mining has important applications. Mainly, redescriptions are useful in biology (e.g. to find bio niches for species), bioinformatics (e.g. dependencies in genes can assist in analysis of diseases) and sociology (e.g. exploration of statistical and political data), etc. We initiate redescription mining with data set consisting of 2 arrays with Boolean and/or real-valued attributes. In redescription mining we are looking for such queries which would describe nearly the same objects from both given arrays. Among all redescription mining algorithms there exist approaches which exploits alternating decision tree induction. Only Boolean variables were involved there so far. In this thesis we extend these approaches to non-Boolean data and adopt two methods which allow redescription mining over non-binary data sets. Contents Acknowledgements v Abstract vii Contents ix 1 Introduction 1.1 1 Outline of Document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Preliminaries 3 5 2.1 The Setting for Redescription Mining . . . . . . . . . . . . . . . . . . . . . 5 2.2 Query Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 Propositional Queries, Predicates and Statements . . . . . . . . . . . . . . 8 2.3.1 Predicates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3.2 Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4 Exploration Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4.1 Mining and Pairing . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4.2 Greedy Atomic Updates . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4.3 Alternating Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3 Related research 15 3.1 Rule Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2 Decision Trees and Impurity Measures . . . . . . . . . . . . . . . . . . . . 16 3.3 Redescription Mining Algorithms . . . . . . . . . . . . . . . . . . . . . . . 21 4 Contributions 25 4.1 Redescription Mining Over non-Binary Data Sets . . . . . . . . . . . . . . 25 4.2 Algorithm 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.3 Algorithm 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.4 Stopping Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 ix x CONTENTS 4.5 Extraxting Redescriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.6 Extending to Fully non-Boolean setting . . . . . . . . . . . . . . . . . . . 35 4.6.1 4.7 Data Discretization . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Quality of Redescriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.7.1 Support and Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.7.2 Assessing of Significance . . . . . . . . . . . . . . . . . . . . . . . . 38 5 Experiments with Algorithms for Redescription Mining 41 5.1 Finding Planted Redescriptions 5.2 The Real-World Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.3 Experiments With Algorithms on Bio-climatic Data Set . . . . . . . . . . 44 5.3.1 5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Experiments With Algorithms on Conference Data Set . . . . . . . . . . . 57 5.4.1 5.5 . . . . . . . . . . . . . . . . . . . . . . . 41 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Experiments against ReReMi algorithm . . . . . . . . . . . . . . . . . . . 66 6 Conclusions and Future Work 69 Bibliography 71 A Redescription Sets from experiments with Bio Data Set 75 B Redescription Sets from experiments with DBLP data Set 91 Chapter 1 Introduction Nowadays we encounter massive amounts of data everywhere and increased capabilities accelerate the generation and acquisition of it. This data can be of different origin and describe diverse objects which provides the stage for active data mining in the scientific domain. There are numerous techniques and approaches to find useful tendencies, dependencies or underlying patterns in it. The data derived from scientific domains is usually less homogeneous and more massive than the one stemming from business domain. Despite the fact that a lot of data mining techniques applied in business return nice results for the science as well, some more sophisticated and tailored methods are needed to meet needs arising in science. According to Craford [12] there are two types of analytic tasks for science that can be supported by data mining. Firstly, discovery driven mining used to deriving hypothesizes. Secondly, verification driven mining used to support (or discourage) hypothesis, i.e. experiments. In this setting hypothesis formation requires more exquisite approaches and deeper domain-specific knowledge. Facing imposing data volumes, scientist experience overload of data for describing domain entities. The issue which comes along with it is the fact that all these data sets can offer alternative (or even sometimes contradictory) perspective on a studied data. Thus, a universal tool which is suitable for data analysis is a necessary option to have on hand. Moreover, identifying correspondences between interesting aspects of studied data is a natural task in many domains. It is well known that viewing the data from different prospective is useful for better understanding of a whole concept. Redescription mining is aimed to embody this. The ultimate goal of it is finding different ways of looking at data and extracting alternative characteristics of the same (or nearly the same) objects. As it can be concluded from the name, redescription mining is aimed to learn model from data in order to describe it and help with interpretability of investigated results. Redescription is a way of finding objects that can be described from at least two different sides. The number of views can be larger than two, but the setting with double-sided data is more common. To assist in understanding of a redescription mining concept the following example can be used: Example 1. We consider a set of nine countries as our objects of interest, namely Canada, Mexico, Mozambique, Chile, China, France, Russia, the United Kingdom and the USA. Simple toy data set [48, 43, 63] consisting of four properties characterizing 1 2 Chapter 1 Introduction these countries, represented as a Venn diagram in Figure 1.1. is also included. Consider the couple of statements below: 1. Country outside the Americas with land area more than 8 billion square kilometers. 2. Country is a permanent member of the UN security council with a history of state communism. Figure 1.1: Geographic and geopolitical characteristics of countries represented as a Venn diagram. Adapted from [48]. Blue - Located in the Americas Green - History of state communism Yellow- Land area above 8 Billion square kilometers Red - Permanent member of the UN security council Two countries (Russia and China) satisfy both statements. They show alternative characterizations of the same subset of countries from geographical and geopolitical properties. Thus, the redescription is formed. The strength of it is given by symmetric Jaccard coefficient (1/1=1). Descriptors of any side of derived redescription can contain more than one entity. This simple example provides an intuition in understanding concept of redescription. Thus, we are given multi-view data set (in our case consisting of two sub-sets describing same objects with different features). For example, in a setting of niche-finding problem for mammals studied in [23, 49], we can be provided with the one set containing species which live in particular regions. Another set will contain climatic data about same regions. Thus, the redescription mined for such a problem, can be a statement that some mammal resides in a terrain where the average June temperature is in a particular range, etc. Very often extracting such rules is very laborious if done manually, because require picking up particular species and investigating their peculiarities. Application of redescription mining in Bioinformatics can be associated with genes. In such a case, the task to find such dependencies without suitable tool seems to be Chapter 1 Introduction 3 unfeasible. Because the amount of data is enormous and very often it is not complete. But mined redescriptions using one of existing methods are more informative and can reveal unexpected useful information in a domain. Of course, usage of redescription mining techniques is not limited to only these two domains. However, to make use of received redescriptions knowledge of the domain is highly recommended. Currently redescription mining techniques are able to handle non-Boolean data without pre-processing. This is claimed to be a better option against previous transformation of data sets [18]. In a setting with one side of a data set to be real-valued or categorical redescription mining performed meaningful outcomes. And in case if both data sets contain real-valued entries the exhaustive search is inevitable. This, in turn, might put unwanted computational burden. Beside this, redescription mining using decision trees with a modification such that it can work with numerical entries (at least on the one side) might perform well and become a competitive alternative to aforementioned techniques. However, it is not implemented so far. Thus, this is a starting point for work conducted within thesis. A stretch goal for the project can be defined as a resulting algorithm which allows both sides of data set to be non-binary. Finally, the comparison of received outcomes with the redescription mining conducted by existing methods is to be performed. Also it is useful to test new method in synthetic setting to study the behavior and performance of the algorithms. After this, conclusions about the quality of the method can be made. 1.1 Outline of Document This Thesis is organized as follows: • Chapter 1 provides introduction to the topic • In Chapter 2 the problem of redescription mining is formalized. Section 2.2 and 2.4 cover Query languages and Exploration strategies that can be used within algorithms for redescription mining. • Chapter 3 is devoted to related research. Namely, cover other algorithms, which share some features with redescription mining. Section 3.2 describes in detail decision tree induction methods together with impurity measures. Section 3.3 is dedicated to other existing algorithms to mine redescription. • Chapter 4 describes contributions made within this Thesis to the topic. In particular, Sections 4.2 and 4.3 provide explanation of two elaborated algorithms for redescription mining over non-binary data sets using using decision trees. In Section 4.7 we outline the way we evaluate our results. • In Chapter 5 all experiments are covered. In particular, Section 5.1 is responsible for synthetic setting and Sections 5.3 and 5.4 report the results and discussion of experiments on the real-world data sets: biological and bibliographic respectively. In addition, here in Section 5.5 we compare results of our algorithms to the ReReMi algorithm [18]. • Finally, Chapter 6 contains conclusions to the Thesis. Chapter 2 Preliminaries 2.1 The Setting for Redescription Mining We denote O as a set of elementary objects and A a set of attributes which characterize properties of the objects or relations between them. The attributes originate from different sources and terminologies are denoted as a set of views V . Function v maps an attribute to the corresponding view: v : A → V . The data set can be represented in a form of triplet: (O, A, v ). Redescriptions are composed with several queries. Definition 1. An expression formed with logical operators, expressed over attributes in A and evaluated against the data set is called a query. Q denotes a set of valid queries and called - query language. In order to assess any statement against a data set, it is necessary to conduct expression replacement of the variables in this statement with objects from the data set and identify the substitutions where the ground formula holds. Support of a query q is this subset of objects of nonempty tuples. We denote it as supp(q). All feasible substitutions for queries in a query language are called as a set of entities and denoted as E. By att(q) we denote set of attributes which can be found in a query q. Function v also denotes their view’s unions: v(q) = ∪A∈att(q) v(A). To make sure that two queries describe data from different view their attribute sets are required to be disjoint. Similarity in support is provided by symmetric binary relation ∼ as a Boolean indicator. Finally, set C can denote arbitrary constraints that can be applied to redescription. For example, to ensure ease of interpretation the maximal length of set-theoretic expressions is to be provided or only conjunctions are used. Thus, having this formalism, a redescription can be defined as following: Definition 2. Given a data set (O,A,v ), a query language Q over A and a binary relation ∼, a redescription is a pair of queries (qA ; qB )∈ Q× Q such that v(qA )∩v(qB ) = ∅ and supp(qA ) ∼ supp(qB ). Redescription mining is a process of these pairs discovery. The problem of redescription mining: Find all redescriptions that satisfy constraints from C, given a data set (O,A,v ) with query language Q, and the binary relation ∼. Example 2. (Based on Figure 1.1.) Here ten counties (UK, France, USA, Mexico, Chile, Canada, China, Russia, Mozambique, France) form a set of objects. For attributes (Blue, Yellow, Red, Green - equivalently (B, Y, R, G)) are split into two views: G geography (includes B and Y) and P - geopolitics (includes R and G). Thus, a set 5 6 Chapter 2 Preliminaries of attributes is written as A = {B, Y, R, G}. For example, v(B)=G. First query over geographical attributes can be written as: qG = B ∧ Y . In our data set this query is supported by two countries: supp(qG ) = {Russia, China}. Next step is a query over geopolitics. That is, qP = R ∧ G. Again, when evaluated against our data set, support is provided by the same two countries. Hence, supp(qP ) ∼ supp(qG ). Moreover, v(qP ) ∩ v(qG ) = {G} ∩ {P } = ∅. Then, based on Definition 1, (qG , qp ) is a redescription. As it can be derived from its name, redescription mining is the analysis which is focused on describing. It is not supposed to predict unknown data, but rather, describe properly available data. In addition, the extent of expressiveness and interpretability of the outcome really matters. Expressiveness can be determined through the variety of concepts that a language can represent. At the same time interpretability is more difficult to measure, since it implies the ease with which the associated meaning can be grasped. Nevertheless, simpler queries facilitate interpretability of an element of the language. While solving any redescription mining task collection O (which consists of elementary objects/samples) is considered. Attributes in A characterize the properties of these objects. The set of views V denotes the various sources, domains or terminologies from which the data originate. If talking about particular tasks, for example in case in biological niche finding problem. Climate data on one side and fauna data on the other side create to fully diverse sets of attributes that fit a setting for given problem. In case when we have medically related problem these sets can be formed by personal information about patient’s background, elements of diagnosis and symptoms. Since redescriptions are focused to find characteristics of the same (or nearly the same objects), we require that the attributes over which both queries of a redescription are expressed come from disjoint sets of views. As it was already mentioned we will stick to two-sided setting. This means, there will be to data sets, which are denoted by L (for left) and R (for right) such that AL ∪AR = A. In case we have multiple views the correspondence between the elementary objects across the views might not be available. This can be caused by the fact that the sets of objects occurring in distinct views do not coincide completely. Or, some objects might have many observations in one view and single in another. Setting up of these correspondences appear to be a non-trivial task, which formulates a research question on its own [54]. The purpose of redescription mining is to find alternative characterizations of almost the same objects. This means that the similarity of the supports of the queries determines the quality of a redescription derived. It is said, that a couple of queries are accurate if they have similar supports. More general, similarity relation between support sets is determined by similarity function f . In addition to that, a threshold σ such that the following holds: Ea ∼ Eb ←→ f (Ea ; Eb ) ≥ σ The function f is usually chosen to be Jaccard’s coefficient [27]. We use this coefficient as our measure of choice for accuracy, but it can easily be replaced with another set similarity function. We consider similarity between the supports of the queries of a redescription to be a main property of a redescription and call it accuracy. Thus, the Chapter 2 Preliminaries 7 pair of two queries can be called accurate if their supports are similar. By similar we imply they pass the given threshold. Moreover, similarity coefficient is 1 when two queries are identical. This means we have a perfect redescription. In practice, redescriptions with the similarity coefficient less than 1 are also useful in many domains. A chain of these redescriptions can be used to connect independent entities (i.e. applicable in story telling). Or, if we talk about bioinformatics, trying to find genes responsible for a particular disease. For a pair of queries (ql ; qr ), we denote by several subsets of entities: 1. E1,1 - entities that support both queries (i.e. E1,1 = supp(qL ) ∪ supp(qR )) 2. E1,0 - entities that support only first query 3. E0,1 - entities that support only second query 4. E0,0 - entities that do not support any query. As an example of similarity function the following can be applied: • matching number |E1,1 | + |E0,0 | • matching ratio |E1,1 |+|E0,0 | |E1,0 |+|E1,1 +|E0,1 |+|E0,0 | • Russell & Rao coefficient • Jaccard’s coefficient |E1,1 | |E1,0 |+|E1,1 +|+|E0,1 |+|E0,0 | |E1,1 | |E1,0 |+|E1,1 |+|E0,1 | • Rogers & Tanimoto coefficient • Dice coefficient |E1,1 |+|E0,0 | |E1,0 |+2|E1,1 |+|E0,1 |+|E0,0 | 2|E1,1 | |E1,0 |+2|E1,1 |+|E0,1 | The choice of Jaccrad coeficient is more common when talking about evaluation of redescriptions. This caused due to its simplicity and its agreement with the symmetric approach adopted in redescription mining. Jaccard coefficient includes the support of the two queries equally. Moreover, it is scaled to the unit interval without involving the set of entities that support neither queries E0,0 . 8 2.2 Chapter 2 Preliminaries Query Languages The way we represent the results of redescription mining is determined by the query languages. They are an essential part of the whole redescription mining technique. Queries are the logical statements that are evaluated against given data set. These statements are obtained after a combination of distinct predicates using Boolean operators. We can replace predicate variables with objects from given data set and verify whether the conditions of the predicates are satisfied returns the truth value. The objects which satisfy the given query are considered to be a support of this query. In this part we cover different types of query languages. In particular we determine the query structures which are used for redescription mining. They offer a representation of logical combinations of constraints on the variety of individual attributes. Previous papers which cover redescription mining also discussed diverse formal representations of queries and query languages [48, 20]. 2.3 Propositional Queries, Predicates and Statements The queries are formed by logical statements evaluated against the data set. These statements are derived by atomic predicates built from individual attributes using Boolean operators. Substituting predicate variables with objects from the data set and verifying whether the conditions of the predicates are satisfied returns a truth value. The object tuples in substitutions satisfying the statement form the support of the query. We define a query language as a compound of acceptable queries, dependent on the supported types of attributes and the principles for building predicates. Also, syntactic rules for combining them into statements belong to the query language we use. In this thesis we focus on propositional data sets. They contain attributes characterizing properties of individual objects. Sets of objects are deemed to be homogeneous, i.e. attribute applies to all objects. The set is called propositional, if it contains attributes which characterize properties of distinct objects. In this setting, a value which attributes from A take form a matrix D. This matrix contains |O| rows. One attribute correspond to one object. There are |A| columns, each of them correspond to an attribute. Thus, the value of an attribute Aj ∈ A is defined as D(i, j) = Aj (oi ) for objects oi ∈ O. Let’s consider an example from [17] to exemplify query languages. Data set from Table 2.1 contains countries as objects. Each column represent some property of a county (geographical details). This data can be expressed as matrix G with 7 columns. G = {G1 , G2 , . . . , G7 }, where Gn - is a vector, which corresponds to some property, i.e. maximal elevation, continent, etc. Chapter 2 Preliminaries 9 Table 2.1: Example data set. World countries with their attributes. Country Canada Chile China France Great Britain Mexico Mozambique Russia USA G1 0 1 0 0 0 0 1 0 0 G2 1 1 0 1 1 1 0 0 1 G3 0 0 0 0 0 0 1 0 0 G4 1 1 1 0 0 1 0 1 1 G5 N.America S.America Asia Europe Europe N.America Africa Asia, Europe N.America G6 9.98 0.76 9.71 0.64 0.24 1.96 0.79 17.1 9.63 G7 5959 6893 8850 4810 1343 5636 2436 5642 6194 Here we have 7 vectors, constituting the following features: 1. G1 - Location in South Hemisphere 2. G2 - Border with Atlantic Ocean 3. G3 - Border with Indian Ocean 4. G4 - Border with Pacific Ocean 5. G5 - Localization on a continent 6. G6 - Land area(109 km2 ) 7. G7 - Maximal elevation of the surface in meters 2.3.1 Predicates Attributes take values which compose a range. We restrict the values to selected subset of the range of it and construct a predicate from an attribute. Let’s consider some attribute Aj ∈ A from a range R. Having fixed a subset Rs ⊆ R, it is possible to transform an associated data column into truth value assignment. This is we turn it into a Boolean vector which indicates which values are placed within the fixed range. This is denoted as [Aj ∈ RS ]. As a consequence, it includes a subset of objects each of them has an attribute Aj with the value RS . Membership in such a sub set can be then written as follows: s(Aj , Rs ) = {oi ∈ O, Aj (oi ) ∈ RS } and [Aj ∈ RS ]. Based on range, all attributes can be segregated into types: Boolean, nominal and real-valued. Boolean predicates. Boolean attributes can be only in two values: true of false. Or, equivalently either 0, or 1. Interpretation of a Boolean variable can naturally create a predicate. For simplicity, a true value assignment (i.e. [A = true]) is written simply as A. Thus, [A = f alse] is a complementary assignment, which can be written with negation (i.e ¬A). From the example above vector G3, a Boolean attribute corresponding to a predicate with the following truth assignment for this data: h0, 0, 0, 0, 0, 1, 0, 0i 10 Chapter 2 Preliminaries Thus, one country (i.e. Mozambique) which has a border with Indian ocean is selected. Nominal predicates. Some attribute Acan be called a nominal attribute when its range is non-ordered set C or its power set. Categories (which reside in C) are considered to be categories on an attribute A. To ensure truth value assignment, the subset of the categories CS ⊆ C is chosen. Alternatively, a single category c ∈ C is selected. Thus, nominal predicates are written as follows: [A ∈ CS ] and [A = c]. In practice, we consider only those nominal attributes which take a single value. In case there are nominal attributes with multiple values, we represent them with a help of multiple Boolean attributes, i.e. one attribute for each category. From the above example, six countries have borders with Pacific Ocean: G4 ∈ {P acif ic Ocean} Is satisfied by truth assignment: h1, 1, 1, 0, 0, 1, 0, 1, 1i. If we look on location on a continent vector (G1 ), the attribute becomes multi-valued, because Russia falls into two categories: Asia and Europe. In practice, multi-valued attributes are expressed via several Boolean attributes, one per category. Real-valued predicates. Some attribute A is considered to be a real-valued attribute, if its range is formed from real numbers R ⊆ R. The truth value assignment is derived from selecting of any subset of R. Nevertheless, for ease of interpretation truth value assignment is made based on some particular adjacent subset of R. That is, we use A ∈ [a; b] to denote an interval [a, b] ⊆ R. In addition, for any given real-valued attribute there are infinite possible intervals. All of them will result in truth value assignment. Thus, the measurement of query language consistency must involve also a criterion to select one of such equal intervals. Fox exemplification, let’s consider the following: G7 = h5959, 6893, 8850, 4810, 1343, 5636, 2436, 5642, 6194i For a pair (a, b), with a ∈ (2000, 2200) and b ∈ (5000, 5500) the truth value assignment will look like: [a ≤ G7 ≤ b] = h0, 0, 0, 1, 0, 0, 1, 0, 0i Thus, as a result we get several equivalent intervals for truth value assignment. Fox example, [2200 ≤ G7 ≤ 5000] and [2436 ≤ G7 ≤ 4810]. Thus, decision depends on the belief whether rounded bound are considered to have better interpretability or not. This in turn depends on a task or problem we work with. For instance, usage of rounded bounds can be adopted in case we work with big data sets involving many countries, when the range in values is big enough. In case of smaller data sets (e.g. like the one we consider here with 9 countries) exact bounds might be more desirable, because they provide more precise description og each studied country. 2.3.2 Statements Predicates, discussed previously, are used as a pieces to construct statements. Propositional predicates are joined with the help of Boolean operators: Chapter 2 Preliminaries 11 1. Negation - 0 ¬0 ; 2. Conjunction - 0 ∧0 ; 3. Disjunction - 0 ∨0 ; The truth assignment for the a query is derived via combination of the truth assignment of the individual predicates. The resulting subset of objects is the support of the query. Namely, support of query q on D, suppD (q), is a set {o ∈ O : q is true f or o}. For example, the query which is satisfied by countries with Atlantic Ocean borders, but without borders with Pacific ocean with maximal elevation less than 4500 meters looks as follows: q1 = G2 ∧ ¬G4 ∧ (G7 < 4500) Size of the support of this query is 1, since only Great Britain from our data set is characterized by these features. Now let’s move to possible query languages which deploy predicates and statement from above. Firstly, one of the most limited and restricted query languages is monotone conjunctions. That is, all predicates are allowed to be combined only with conjunction operator. For example, the following query from the running example is a monotone conjunctive query: q2 = G1 ∧ G4 ∧ [2000 ≤ G7 ≤ 5000] First query can not be called the member of this query language because it is not monotone. These type of queries (monotone conjunctions) correspond to itemsets in which every predicate represent an item. Itemsets are vigorously studied in the literature. Algorithms to mine frequent itemsets received an increased interest [24, 11]. For example, it is possible to partially arrange them in order on the inclusion principle to verify the downward closeness property. Which means if some query qi is a subset of some query qj , then support of qi is a superset of qj ’s support. Thus, a search space in such a case can be explored more efficiently. Monotone conjunctions are easy to find and interpret, at the same time restriction for disjunctions and negations affects expressiveness of the queries mined. The opposed type to monotone conjunctions is unrestricted queries. Here predicates are allowed to be combined using any above mentioned operators without any restrictions. Nevertheless, this extreme case provides full expressiveness for the queries. Examples of unrestricted queries can look as follows: q3 = (G2 ∧ G4 ∧ G1 ) ∨ ¬(G3 ) q4 = G2 ∧ [G6 < 1.9] q5 = ((¬G1 ) ∧ ([G5 = Asia] ∧ G3 ) ∧ [1.9 ≤ G6 ≤ 7.6] ∧ ¬G4 q6 = [2000 ≤ G7 ] ∧ G1 ∧ [1200 ≤ G7 ≤ 8000] 12 Chapter 2 Preliminaries Both queries mentioned before belong to this query language as well. Expressiveness of queries without restrictions is maximal but queries can become more difficult to interpret. For example, they can contain deeply nested structures, which means, we have a query which involves numerous attributes with a complex structure. Despite the fact that the support of this query matches the support of other query very well (e.g. the redescription formed by these queries will be highly accurate), interpretation of this redescription will be obstructing. This is caused by many entangled conditions. As a consequence this redescription losses its intiristingness. Moreover, the final space formed by redescriptions looks disordered and becomes difficult to search. Here we can observe a rich structure of queries and full expressiveness, while nested structures make queries hard to interpret. Hence, a balance between expressiveness and interpretability is the most desirable feature. A compromise between these two languages can be linearly parsable query language. Here queries are formed with the help of the simple formal grammar. Moreover, to ensure ease of inteprebility it is possible to apply some moderate restrictions. For example, allow every attribute to appear only once, etc. Selection of a query language theoretically should be performed ahead adopting the algorithm. Practical constraints very often influence the choice. That is, the adopted algorithm might naturally result is a particular query language. For example, linearly parsable queries are more natural for the algorithms with iterative atomic extensions which append on each iteration new literal to a query [20]. In this Thesis we exploit decision tree induction to mine redescriptions which affects the query language we use. We stick to the data set with Boolean predicates on the one side and real-valued - on the other. We avoid usage of negations by flipping the sign. For example, for Boolean predicate instead of ¬G1 we would have G1 < 0, 5 - meaning ’0’ (i.e. ’false’) and G1 ≥ 0.5 meaning ’true’ or ’1’. But, if necessary, negations can be used as well. Also, we allow both: conjunctions and disjunctions to provide expressiveness of the resulting queries and there is no restriction for the predicate to appear only once in a query. Chapter 2 Preliminaries 2.4 13 Exploration Strategies There exist several strategies for redescription mining. Basically, there are few approaches how one can find redescriptions given a query language and a space of possible queries. Also different constraint on the redescriptions might be used as well. Thus, combination of these parameters results in different search spaces. Some properties (such as anti-monotonicity) assists in more effective redescription mining process. There are three main generic exploration strategies for redescription mining. 2.4.1 Mining and Pairing This simple strategy includes two main steps for redescription mining. Firstly, individual queries are found from different data sets. Secondly, these queries are combined into pairs based on similarity in their supports. Thus, a redescription is formed from two similar queries from different data sets. Within recent times several authors devised algorithms to mine queries over fixed set of propositional predicates [6, 11, 62]. This approach has some treats which make it suitable for data sets which include small amount of views because finding separate queries and pairing them later can be performed very effectively. In contrast, when data sets contain imposing amounts of views, this exploration strategy result in queries over all predicates pooled together. When combining them, the queries with similar supports might appear to have disjoint predicates. This scheme is advantageous because it allows adaptation of frequent itemset mining algorithms for mining redescriptions. As an extension of this independent mining and further pairing the second step can be replaced with a splitting procedure. This includes pooling together all predicates for the first mining step with future splitting the queries depending on views. Nevertheless, the fact that the query exist does not guarantee that it can be split into several smaller ones. When we have data coming from two different views, we can mine monotone conjunctive redescriptions in a level-wise fashion. This is similar to the algorithm from [6, 38], which is called Apriori. Support of queries and their intersection is used and can be used safely for pruning since they are anti-monotonic. Finally, this exploration strategy finds its best applications in case of exhaustive search. Hence, when the sets are not big enough to cause an undesirable computational burden. 2.4.2 Greedy Atomic Updates Next exploration strategy is based on iterative search of the best atomic update to the current query pair. That is, one tries to apply atomic operations to one of the queries such that a resulting redescription becomes better. This process is continued until no improvements further possible. Atomic updates imply operations which include addition, deletion and edition of predicates. Hence, a new predicate can be added, removed or changed (for example, negated). In order to prevent the algorithm to form cycles, it is possible to remember the queries which already has been explored. As a starting point a couple of perfectly matching queries from distinct views can be selected. This approach was firstly proposed by Gallo 14 Chapter 2 Preliminaries et al. [20] and it used only addition operations to update the query. Later it was extended to non-Boolean setting with ReReMi algorithm [18]. More to say, the issue of missing entries was also covered, since it is a highly relevant aspect when working with real data. 2.4.3 Alternating Scheme One more approach to build redescriptions is an alternating scheme. We use it as the main exploration strategy in this Thesis, because the algorithms we elaborate are based on decision tree induction. The main idea behind this strategy is to find one query and then find another one which matches good to it. Then the first query is replaced with a new one, which makes a better match. Alternations are continued until no better match can be made or the stopping criteria is met. (0) For example, we start with query from a left-hand side qL and search for a good (1) matching query from the right-hand side qR . Now, we proceed again to the left-hand (2) side and try finding another query qL that matches the one derived from the right. Hence, the algorithm runs in this manner until termination. If one hand side of the redescription is fixed, the task of finding an optimal query for the other side can be defined as binary classification task. Entities that belong to the support of the fixed query are positive examples, while the elements not in the support are negative examples. Hence, the redescription mining task can be potentially solved with the help of any feature-based classification technique along with query language. Finding the proper starting point for the alternating scheme is a question of the quality of the method on its own. The simplest option is to randomly split data into examples and use this partition for initialization. Or, start with a queries which consist only of one predicate. Having fixed the number of starting points and the number of allowed alternations, the complexity of such an approach depends mainly on the complexity of the chosen classification algorithm used for alternations. In this thesis we focus on the alternating scheme for redescription mining task. And as a classification algorithm we use decision tree induction. This idea is not new. Firstly it was adopted by CARTWheels algorithm [48] which is able to process binary data sets and mine redescriptions by matching the terminal nodes (leaf nodes) into pairs of queries. Chapter 3 Related research 3.1 Rule Discovery The main feature inherent to redescription mining is ’multi-views’. This implies description of entities with the help different set of variables. Nevertheless, this ’multi-views’ feature is not unique for only redascription mining. One of the most common similar approaches is supervised classification [57], yet it is not always perceived as such. In classification entities are characterized by the observations on one hand an by the class label on the other hand. The starting point of viewing same object from different angles was introduced by Yarowsky [60]. He initiated aforementioned multi-view learning approaches. This was followed by Blum and Mitchell [7] and evoked high interest to the topic. Mining single query can be treated as a classification task. When we fix one query, we get binary class labels and we are looking for a good classifier for it. A particular example where we have Boolean attributes and targets is Logical Analysis of Data [8]. It has on purpose finding an optimal classifier of pre-determined form (e.g. DNF, CNF, a horn clause, etc.) Multi-label classification has a bit more resemblance with redescription mining as well [55]. Here classifiers are supposed to to be learned for conjunction of labels. This restriction only to conjunctions and prediction (not description) are the main differences of this approach to redescription mining. Moreover, there are several more instances that can be covered as somehow similar ones to redescription mining. Emerging Patterns [41] is targeted at Boolean data and item sets (monotone conjunctive queries). Thus, it tries to detect those itemsets, whose presence depends statistically on negative or positive label assignment of the objects. In case of perfect outcome the itemset will reside solely in positive example and will compound a perfect classifier for the given data set. One more approach that can be related to redescription mining is Contrast Set Mining [41]. It can be used to detect monotone conjunctive queries which gives the best discrimination of some distinct class from all other objects from data. Subgroup Discovery [56] can also be mentioned here. It is aimed to find a query such that all objects from determined subgroup posses atypical values for target attribute compared to other objects. 15 16 Chapter 3 Related research Taking everything into account, it can underlined that the main differences between redescription mining and these approaches are: the goal of redescription mining is finding simultaneously multiple descriptions for a subset of entities which were not previously determined. It selects only several relevant variables among big variety. Moreover, we have one-dimensional redescription mining problem despite there are two sets of describing attributes. Queries are constructed over one set of attributes, determining subgroups of a quality that is measured as their ability to describe queries from the second set of attributes. 3.2 Decision Trees and Impurity Measures Decision trees. Regardless the domain where decision trees are used they are aimed to use a given set of attributes to classify data into a set of predefined classes. Firstly, a training data set is used to help tree learn about the specific data. Thus, we run the algorithm to split the source set into several subsets based on attribute value. This process is repeated on each resulting subset in a recursive manner and has name recursive partitioning. This recursion is considered complete when the subset which fall into the same node carry the same class label. Or, when further splitting does not result in adding the value to the predictions. Secondly, test data sets are used to evaluate the accuracy of built tree, to determine weather it is able to classify data properly. By properly, we mean placing each attribute into a correct class (i.e. minimize instances of misclassifications). A decision tree that has multiple discrete class labels is called a classification tree. Tree-based model have variety of uses starting from spam filtering [16] going even to astrophysics [28]. The concept of decision trees is not new it was invented in 1966 by Hunt, Marin, and Stone [13]. In this thesis we mainly concentrate on Classification Trees aspect because trees are used redescriptions of the same (or nearly the same) objects. For example, in biological niche finding problem we do not focus on predicting climatic conditions of any species. On the contrary, the idea is to find specific information about a mammals which already live in a particular surroundings. Decision trees were one of the earliest methods used to build classifiers [34]. They have several advantages: they are easily interpretable by human experts; they provide effective induction and accuracy; they are comparatively easy to be built, etc. When using decision trees, it is important to determine the algorithm used to actually build the tree. This includes investigation of different splitting rules used, because the quality of the result might be highly dependable on the choice of the parameters. For example, Information Gain, Entropy, Gini [34, 10], etc. There exist numerous implementations which are scalable and effective [9]. Some of them more suitable for smaller data and vice versa. Thus, the mechanism used to build a decision tree is to be studied in detail in order to provide a strong support for redescriptions mining based on this approach. In general, having a given set of attributes, there are exponential many decision trees can be constructed from it. Resulting trees will differ by their accuracy. However, the finding of optimal tree is usually unfeasible, since the search space is of exponential size. Nevertheless, there are numerous efficient algorithms that can produces decision trees of reasonable consistency within the acceptable time span. They mainly use greedy strategy that deepens a tree by making a succession of locally optimal decisions. Chapter 3 Related research 17 One of the most known algorithms of this type is Hunt’s algorithm [42]. It is used as a base in many common algorithms, e.g. ID3 [46], C4.5 [47], and CART [34]. Hunt’s algorithm [13] grows a tree in a recursive fashion by partitioning training set into several ordered purer subsets. Any algorithm which is used for decision trees induction must deal with two main aspects. One of them, how to split the training set. On each recursive step of growing the tree the algorithm must split the training data into smaller subsets. To embody this algorithm must provide a method which specify the test condition for attributes of diverse types. In addition, the way of measuring the goodness of every test condition should be defined. These ’goodness measures’ are commonly called impurity measures and discussed further. One more aspect, the stopping criterion should be determined as well. The easiest approach to stop the process of tree-growing is to terminate it whenever all of the entries in nodes belong to the corresponding classes (i.e. nodes are pure) or all entries have identical attribute values. These two points are enough to terminate any algorithm which builds decision trees. However, early termination has some advantages. In this thesis we focus on the most famous algorithm for decision tree induction called CART [34]. Classification and Regression Tree (CART) CART was firstly introduced in by Breiman et al. [34]. CART was invented independently within same time span as ID3 [46], both use similar approach for learning decision tree from training tuples. CART is a non-parametric decision tree training technique which returns classification or regression trees. CART is the most popular data mining technique for classification purposes in the world. It revolutionized the entire field of advanced analytics and allowed data mining to move to the new level [1]. It is a statistical approach that allows selecting from a huge number of explanatory variables those, which are most important for determining the response variable to be explained. Decision trees partition (split) the data into mutually exclusive nodes (groups). Nodes are supposed to be maximally pure. Building process begins with a root node, which contains all objects in it. Further they are split into nodes by recursive binary splitting. Each split is determined by a simple rule based on a single explanatory variable. Steps done in CART to grow a classifier can be expressed in the following [34]: 1. All objects assigned to a root node; 2. All possible splits of an exploratory variable and attribute values (splitting rules) are generated; 3. For each split from previous stage separate objects from the parent into two child nodes based on the value (lower or higher than it); 4. Pick up a the variable and a value from 2 which return highest reduction of impurity. Impurity measures are discussed in the Section 3.2; 5. Conduct the split into two child nodes according to selected splitting rule; 6. Repeat steps 2-5 applying then to all child nodes as if they are parents until the tree has a maximal size. 18 Chapter 3 Related research 7. Prune the tree with a help of cross-validation [31] to return the tree of an optimal size. The pruning algorithm here attempts to balance optimistic estimate of empirical risk with the help of addition of a complexity term. This complexity term penalizes bigger sub-trees. In cross-validation some objects are randomly removed from the data and than they are used to assess predictive power of the tree. The one of the common ideas is to stop building the tree early (early termination) can result in not sufficient coverage of interactions between explanatory variables [51]. That is why, in CART it is chosen to allow tree to grow to a maximal size. In these maximal trees all nodes will be either small (will contain a single object or a desirable, predetermined number of objects) or the resulting nodes will be pure (i.e. no further split is needed). This type of a tree is overfitted: not only does it fit the data, but also a noise and idiosyncrasy of a training set. Hence, next steps are dedicated to pruning. The branches which lead to the smallest decrease in accuracy comparing to pruning of the other branches are pruned. For each sub-tree T, a cost-complexity measure Rα (T ) is [44]: Rα (T ) = R(T ) + α|T | Here |T | is a number of terminal nodes (or complexity of a sub tree), α is a complexity parameter; R(T ) is an overall miss-classification rate for classification trees or the total residual sum of squares for regression trees. Every value of alpha has a corresponding unique smallest tree that minimizes the cost complexity measure. Complexity parameter increases from 0. This returns a nested sequence of trees which become smaller in size [44]. All of them has the best size size and selection of the best option can be transferred into a problem of selection of the best size. Cross-validation defines this optimal size. The data set we work with is randomly divided into N subsets (commonly set to 10). One of these subsets is used as a test. All other N − 1 sets a grouped together and used as learning a data set. Tree is grown and pruned N times and in every new case with a usage of different subset these subsets play role of a test sets. A prediction error (sum of squared differences between the observations) is calculated for every size of a decision trees. Then, it is averaged over all subsets and matched with the sub-trees of the complete data set using the values of alpha. The optimal sized tree is the one with the lowest cost-complexity measure [31]. CART implies the assumption that samples are independent in computing of classification rules [36]. Models produced by CART have positive features, such as: input data is not supposed to convey the normal distribution; predictor variables do not have to be independent. It is possible to model non-linear relations between predictor variables and observed data. CART enables evaluation of the importance of the diverse explanatory variables to define a splitting rule and splitting value. The technique used for this is ’variable ranking method’ [45]. Variables which do not show up in the resulting tree can be called less important for data set description. CART has numerous implementations that undergo continues changes, extensions and improvements. New units are written to make it more convenient or specific in distinct Chapter 3 Related research 19 domains. In implementations of our algorithms for redescription mining we use available packages in R, namely rpart [52] and rattle [59]. Impurity measures. The main aspect in decision tree building is to decide how to split the data set. The ’goodness’ of split is evaluated by impurity measures. Which in fact a function which assesses how well the particular split separates the data into classes. Impurity measure is an objective to to by minimized on each intermediate stage of a decision tree building. In general, impurity measure should satisfy: 1. Is should be the largest when data is split evenly for attribute values 2. Should be 0 when all data belongs to the same class Quinlan’s information measure (Entropy). Originally, Quinlan offered to measure this ’goodness’ based on on a classic formula from information theory: − P pi pi log(pi ) With pi - probability of i -th message. Thus, the outcome depends entirely on a probability of possible messages. If their probability is equal, there is the greatest amount of uncertainty. Thus, gained information will be the greatest. Consequently, if they are now uniform, less information will be gained. Also, the value of this objective function depend on the amount of messages. Thus, entropy of a pure node is zero, because then the probability becomes 1 and log(1) = 0. And vise versa Entropy is maximal when all classes have equal probability to appear. Information gain. One of the most common impurity measures used while building decision trees in various implementations is Information Gain (IG). Which, in fact, is a difference in Entropy (i.e. also involves computation of entropy for the nodes). Information Gain, popularized by Quinlan in [46], is the expected reduction in entropy provoked by partitioning the objects according to a given attribute. Let’s denote set C which has p objects on one class (P) and n objects on another class (N). If the decision tree is accurate, it is supposed to classify there objects in the same proportion as they present in C. As a root of a decision tree a partition A (which contains values {A1 , A2 , · · · , Av } is picked, so it would partition C into {C1 , C2 , · · · , Cv }. If Ci contains pi objects of class P and ni of class N and the expected required for the sub-tree for Ci information is denoted as I(pi , ni ), then expected information needed for the tree with A as root is defined as the weighted average: E(A) = v X p i + ni i=i p+n I(pi , ni ) The information gained by using A as a root is defined as follows: IG(A) = I(p, n) − E(A) 20 Chapter 3 Related research Whenever Information Gain is used as impurity measure in decision tree algorithms, all candidate attributes are investigated. The one, which maximizes information gain is chosen. The process is continued further with the residual subsets {C1 , C2 , · · · , Cv }. Classification Error. As an impurity measure the classification error can be used as well. It is also aimed to determine the ’goodness’ of a split on a terminal nodes with defying the entries which go to a children nodes. It measures a missclassification error made by a node and looks as follows (for a node t): Error(t) = 1 − max P (i|t). 1 Thus, classification error is maximal (from 1 to N classes ). When all entries are evenly distributed across the classes. This means, we gain the least interesting information. Classification error becomes minimal when all entries represent the same class (Error = 0). The GINI index (Gini). Very similar to Quinlan’s impurity measure was presented by Breiman [34] is called Gini index. Gini measures of how often a randomly chosen element from the data would be erroneously labeled if it were randomly labeled according to the distribution of labels in the subset. Gini impurity measure is computed as a sum of the probabilities of each item being chosen multiplied with the probability of a mistake in categorizing this item. Thus, it equals to zero when all entries of a node belong to the same class. Formally, Gini looks as follows: Gini = 1 − X p2i With (pi ) - probabilities for each class to appear. In case we have a pure class in each node, the probability becomes 1 (Gini = 1 − 12 = 0). Similarly to Entropy, Gini becomes maximal, when all classes carry equal probability. Originally, Gini measures the probability of missclassification of a set of objects, rather than the impurity of a split. Gini index together with Information Gain are the most commonly used measures in classifiers built with a help of decision trees. However, Gini index behaves with data a bit different. As it was mentioned before, information gain tries split the data into distinct classes, whereas Gini seeks for the largest class and extracts it at first. Than in the residual data looks for the next attribute which would also help in extracting the next largest class. This continues till the final tree is built. If the data was such that splitting into classes was quite clear, the tree will result with pure nodes (i.e. leafs nodes that contain objects of the only class). In practice, pure decision trees are attainable only in very rare circumstances. In our Algorithms for redescription mining we use Information Gain and Gini index as impurity measures. We discuss the affect of them for different real-world data sets in Subsection 5.3.1. Chapter 3 Related research 3.3 21 Redescription Mining Algorithms Since redescription mining was firstly introduced by Ramakrishnan et al. [48], there have been several other contributions to the topic. In particular, both Kumar [33] and Ramakrishnan et al. [48] were working on redescription mining using decision trees. They introduced alternating approach to grow decision trees, which are later used to derive redescriptions. Beside decision trees, redescription mining was presented through such ideas as Karnaugh maps in [63], where they depicted it as a simple game with only 2 rules. A pair of two identical maps contains variables on its sides, and the blocks inside maps represent intersection of these variables. Blocks can be uncolored (if there is no intersection of corresponding variables in data set) or colored, if they are non-empty. Rules are: A colored cell can be removed as long as it is removed from both maps and an uncolored cell can be removed from either (or both) maps. If one thinks of objects as transactions and descriptors as items, then a colored cell in the Karnaugh map corresponds to a closed item set from the association mining literature [61]. Redescription mining algorithms based on frequent itemsets are covered in [20]. Here it is considered as the task of finding subgroups having several descriptions. Authors offer algorithms, based on heuristic methods, that produce pairs of formulae that are almost equivalent on the given data set. Methods also use different pruning strategies to avoid useless paths in a search space. Another algorithm for redescription mining, based on co-clusters, was presented in [43]. Redescription mining task is viewed as a form of conceptual clustering, where the goal is to identify clusters that afford dual characterizations, i.e. mined clusters are required to have two meaningful descriptions. Greedy search algorithm used in [20] to mine redescriptions was extended with efficient on-the-fly discretization by ReReMi algorithm, introduced in [18]. Here the algorithm defines initial pairs of variables from a data set and update them until no further improvements can be made. Updates can include addition, deletion and edition of predicates. Since in this thesis we use decision tree induction to elaborate algorithms for redescription mining over non-binary data sets, existing approach which exploits the same idea is to be covered. Previously CART was incorporated into the CARTwheels algorithm [48] to mine redescriptions. CARTwheels. The main contribution to make redescription mining a relevant research topic was made by Ramakrishnan et. al. in 2004 [48]. Here CARTwheels algorithm was presented which derives redescriptions with a help of decision trees grown in opposite directions and then matched at their leaves. Further, authors in [43] explicitly showed applications for redescriptions and structured the formalist for the topic. More to say, Kumar in his PhD thesis [33] exploits decision trees as well to characterize sets of genes. He also extended CARTwheels by presenting a theoretical framework which allowed systematical exploration of the redescription space, involvement of redescription across domains, etc. CARTwheels was the first introduced algorithm for redescription mining which involves decision trees induction [48]. It uses two binary data sets to grow two decision trees 22 Chapter 3 Related research which match in the end with their leaves. When they are matched, the paths which led to a similar leaf nodes can be written as queries to form redescriptions. In original paper [48] authors use data set consisting of 2 matrices, where they assign class labels based on a greedy set covering of the objects using entities of left-hand side part. The decision conditions from one tree are combined with the corresponding decision conditions from the second tree. Hence, the paths which lead to the same class can be treated as queries to form a redescription. Algorithm returns as many redescription as there are matching paths in a pair of resulting trees. This approach selects a paths in the grown trees and combines splitting rules of the corresponding terminal nodes via Boolean operators. Negations are also involved whenever the path belong to the ’no’ side of the decision tree. As an visual example, a pair of resulting trees, which are used to form redescriptions is depicted on Figure 3.1. Figure 3.1: Tree growing with alternations by CARTwheels. (left) tree defines settheoretic expressions to be matched. (middle) bottom tree is grown to match the first one. (right) bottom tree is fixed and the top tree is re-grown again to match leaves. Arrows represent matching paths which form redescriptions.(Following [48]) Fig. 3.1 shows three frames of tree growing process. The right-most frame depicts final version of trees that form redescriptions. Matching paths can be written as following redescriptions: (X3 ∩ X1 ) ∪ (X4 − X3 ) ←→ Y4 (O − X3 − X4 ) ←→ (Y3 − Y4 ) (X3 − X1 ) ←→ (O − Y3 − Y4 ) These alternations can be continued until leaves match good enough or until the maximal number of unsuccessful alternations is reached. However, it is important to notice, that in this approach authors set the depth as a constant (in this example (d = 2)) and re-grown in each iteration trees of the same depth. Chapter 3 Related research 23 CARTwheels algorithm uses duality between path partition and class partition. Thus, the crucial issue here is to combine paths into redescriptions only when they lead to the same class label. Further evaluation of this partition determines the quality of the result. In CARTwheels results in a single pair of trees, which are re-grown with the same, previously fixed by the user, depth and cover the whole data set. In this Thesis we, inspired by the idea of usage decision trees for redescription mining, elaborate two algorithms which grow decision trees to match in their leaves gradually increasing the depth. Also, we enable them to: firstly, work with real-valued data one side. Secondly, extend to them process both sides of real-valued attributes, using data discretization routine described in Section 4.6. Chapter 4 Contributions 4.1 Redescription Mining Over non-Binary Data Sets Redescription mining techniques based on decision tree induction previously were able to handle solely Boolean data and were not able to handle other cases without data preprocessing. However, techniques which use other redescription mining approaches, for example, presented by Galbrun et al. [18] are able to handle numerical and categorical data directly. In this Thesis we extend redescription mining techniques which exploit decision tree induction to non-Binary setting, apply in on real-world data, test its ability to find planted redescriptions and compare with existing redescription mining techniques. In particular, we work with two methods which both have decision tree induction as a basis. As a result we expect our algorithms to return interesting and informative redescriptions, which are useful in a particular domain or can assist in solution of existing problem. Except this, these method can be applied and be tested in other domains as well, since redescription mining might be useful for them as well. For example, a good choice is bioinformatics. As long as data sets have determined form, our approaches can be exploited in any domain. Very often domain knowledge is essential for to make conclusions regarding outcomes. For example, one possible domain is biological niche-finding problem. Here we are looking for the rules which in detail determine specific conditions for the species. It is comparatively easy for a layman to assess the quality of for redescription mined in such a domain. For example, if we get the rule which says that a Polar Bear lives in the places where average January temperature is below 2 degrees Celsius, this statement is quite understandable ever for a person without profound knowledge in biology. Nevertheless, user might encounter more specific cases, where background knowledge in domain becomes crucial. Also, configuration of parameters, which very often is a key to success in data mining, might involve some extent of consideration of the data and domain we work with. Redescriptions are aimed to bring new interesting insight on data. Thus, it is crucial for the method to deliver not only intuitively expected rules, but also reveal some specific treats which assist in niche finding problem or any other one. We introduce two algorithms for redescription mining over non-binary data sets. Both 25 26 Chapter 4 Contributions of them involve decision tree induction. In particular, we grow trees in opposite directions to match in the end by gradually increasing the depth. As an input a data set (O; A; v) consisting of two matrices is used. One side contains binary attributes, another side is composed with real-valued attributes. Not all realworld data sets meet this requirements, thus in Section 4.6 we discuss a way to overcome this restriction. Target vectors needed for each step are formed starting from binary data set. Then further they are formed based on the previous split result. Thus, every next iteration is adjusted based on the previous one. In the end we get pairs of decision trees, grown in parallel to match at their leaves. Then queries are derived from resulting trees for future analysis. As an accuracy of the redescription we use Jaccard’s coefficient which is chosen for this due to its computational simplicity and ability to provide a reasonable assessment of similarity for two queries that form a redescription. Statistical significance of the result is determined with the help of p-value computation, since we want the results to be not only informative, but also carry a statistically meaningful information. Chapter 4 Contributions 4.2 27 Algorithm 1 Algorithm 1 extends redescription mining based on decision tree induction to nonBoolean world. As it already mentioned, the starting point of the algorithm with alternation scheme is an important aspect to be defined. Algorithm expects data two arrays (e.g. matrices) as income data. Left matrix (L) is expected to contain Boolean data, right matrix (R) contains numerical data. To initialize tree induction, the algorithm needs to have a data set with a target vector (the vector based on which the tree would be built). A target vector consists of all entries from left matrix. Namely, each column from the left-hand side is used as a target vector for one run of the Algorithm 1. Here we initiate tree induction (CART) with the right data set and build the tree with the depth 1. Thus, we have a short classifier which uses some parameter from right-hand side matrix as a splitting rule. Further, we form a new target vector based on the first split. After dividing data we get two child nodes with the class labels which correspond to the majority class in it. In our case: 0 and 1. Having that, we proceed to grow the second tree to match the first one. To do so, the new target vector is formed based on the right hand-side split. And the algorithm is run on the left side with the depth 2. This process of forming new targets and building deeper tress is continued until the one of the stopping criteria is met. Algorithm outline. Figure 4.1 represents steps undertaken of the first algorithm. As initialization stage, we use target vector from left matrix (binary matrix) and perform a split with the depth 1 on the right matrix, which possibly contains real-valued data. Figure 4.1 depicts trees (left and right) with the maximal depth 2. Enumeration for the nodes is used in the following manner: every parent node is marked as n; every left child node is enumerated as 2n; every right child node is enumerated as 2n+1. This holds for both trees and both algorithms. First frame (d=1) depicts the initial split of the both data arrays with the depth 1, where the split of the right array is made with a target vector from the left (an arrow). Further, the algorithm forms a target vector based on the right split and proceeds to split the initial left matrix (but with newly modified target vector) with the depth 2. Thus, every time the tree is re-grown from a scratch using CART algorithm using target vector. It in turn is formed based on the previous split result (i.e. class labels are assigned depending on the leaf nodes they fall into). New targets are formed and the depth is increased until the termination. As a result we get a pair of trees: left tree classifies binary data, right tree classifies real-valued data. For instance, if we work with biological niche finding problem one includes attributes from the animal data; another consists of climatic data. At each terminal node algorithm picks up a splitting parameter and splitting value (together called a splitting rule) which both maximize purity of the resulting nodes (i.e. purity measure). The actual impurity function used for this does play a crucial role for now. Splitting rules on terminal nodes from both trees will be further used to build redescriptions. 28 Chapter 4 Contributions Figure 4.1: Tree-growing process in Algorithm 1 Algorithmic framework of Algorithm 1. Table 1 describes the Algorithm’s 1 s algorithmic framework in detail. Firstly, the data set suitable for CART induction is formed. Construct tree creates a decision tree with provided parameters using one part of the data set, either left of right. That is, target vector formed based on the previous split result and min bucket parameter which is responsible for the minimal size of tree nodes. M in bucket is an important tunable parameter which controls a trade off between redundancy and interpretability. We pay attention to this parameter, since it prevents overfitting and help to terminate the tree induction earlier. This makes resulting queries less massive and more interpretable. In particular, in problems bound to search of biological niche finding user might be interested for the nodes, which include the majority of particular species because for a reasonable redescription we expect majority of animals share similar living conditions. If set to high might not give any insight for rare in Europe animals such as Polar Bear or Moose. In other cases min bucket parameter is also crucial. It helps to adjust CART to split a data set in such a way that every node contains at least of defined number of entities. Function construct target vector forms a vector based on the result of previous split to be given to the next split of the data. In the end, the list of redescriptions is formed Chapter 4 Contributions and each of them is evaluated by Jaccard’s coefficient. Algorithm 1: Algorithmic framework Data: Descriptor sets {Li }, {Ri } Result: redescriptions Rd , Θ - Jaccard’s Coefficients Parameters: d - maximal depth of the tree min bucket - minimal number of entries in the node md - maximal depth Initialization: Set answer set Rd = {} Set Jaccard’s Coefficients set Θ= {} Set left matrix L = {Li }, Set right matrix R = {Ri } Alternations: foreach column i in L do Set all paths tl = {} Set all paths tr = {} Set target vector tv = construct target vector(Li ) Set tree tr = construct tree(R, tv , max depth = 1, min bucket) tv = construct target vector(tr ) if all entries in tv are of the same class then Rdi = N U LL; Θi = N U LL; flag = false else flag = true; depth = 2 while (flag) do if (depth ≤ d) then tl = construct tree(L, tv, depth, min bucket) tv = construct target vector(tl ) tr = construct tree(R, tv, depth, minb ucket) tv = construct target vector(tr ); if (tl current = tl previous&&tr current! = tr previous) then depth = depth+ 1 else flag = false; end else flag = false; end end Rdi = paths tl +0 ←→0 +paths tr Θi = Jaccard(tl , tr ) end end 29 30 4.3 Chapter 4 Contributions Algorithm 2 Configuration of the Algorithm 2. Algorithm 2 is also based on alternating decision tree induction as well and starts with the same initialization. It contains some specific features: instead of increasing depth with every next step and re-building trees from a scratch, Algorithm 2 continues to build trees and make them deeper and deeper on every iteration. Every new depth of the tree is built on either left or right matrix using the target vectors from either right or left tree respectively. This procedure is continued until the stopping criteria is met. Thus, we start again with initial target vector taken from left-hand side data and build the decision tree with the depth 1 using CART algorithm. It picks up a splitting rule which maximizes the purity of the nodes. After this, new target is formed, based on the class label assignment from first split. Left-hand side data is then split with the depth 1 using formed target. Trees are grown in level-wise fashion until the stopping criteria is met. As a result we get two tree structures (left - for binary matrix classification; right - for real-valued matrix) where each new depth is build based on previous split result from other tree. Final trees are used to extract and evaluate redescriptions. Algorithm’s 2 outline. Figure 4.2 depicts sequence of frames representing steps undertaken by algorithm. Firstly, initial split is performed with a Target vector 1 on the right matrix. Split is performed with CART algorithm. Features used (impurity measure, node size) are to be set based on preferences. Further algorithm proceeds to the split on a left matrix with the Target Vector 2 which is formed from the previous split (on the right part). This process continues until any further split is able to be performed by CART algorithm (i.e. splitting result in more pure nodes) or any of other termination criterion are met. Hence, each branch in the trees receives a target vector formed after the split from previous depth. As an outcome, we get two tree-like structures. These structures in practice are parts of a decision trees built with CART algorithm (several decision trees with the depth 1) can be assembled into a final trees. At the end, we move to query extraction from them. Thus, each tree in a pair returns one query. The extent of their correspondence is evaluated via Jaccard’s coefficient. This coefficient is computed between two resulting vectors which are formed based on the final trees (Figure 4.3). Finally, two queries, mined from final trees, form a redescription. All redescriptions mined from given data set are to be analyzed further. Chapter 4 Contributions 31 Figure 4.2: Tree-growing process in Algorithm 2 Algorithmic framework of Algorithm 2. Table 2 describes the Algorithm’s 2 algorithmic framework in detail. As previously, we form a data set consisting of two views and construct in step-wise manner decision trees. Maximal depth is parameter defined by the user. 32 Chapter 4 Contributions After the whole data set is processed, the redescription set is formed and returned. Each of the redescriptions is to be evaluated for future interpretation. Algorithm 2: Algprithmic framefork Data: Descriptor sets {Li }, {Ri } Result: redescriptions Rd , Θ - Jaccard’s Coefficients Parameters: min bucket - minimal number of entries in the node md - maximal depth Initialization: Set answer set Rd = {} Set Jaccard’s Coefficients set Θ= {} Set left matrix L = {Li }, Set right matrix R = {Ri } Alternations: foreach column i in L do Set all paths tl = {} Set all paths tr = {} Set tl = {} Set tr = {} Set target vector tv = construct target vector(Li ) flag = true; count = 0 tr = construct tree(R, tv, md = 1, min bucket) while (count ≤ depth(tl ))&&(count ≤ depth(tr ))&&(count ≤ d)) do foreach leaf in the tree tr do tv leaf = construct target vector(tr , leaf ) tl (construct tree(L leaf, tv leaf, md = 1, min bucket)) end foreach leaf in the tree tl do tv leaf = construct target vector(tl , leaf ) tr .(construct tree(R leaf, tv leaf, md = 1, min bucket)) end count = count+ 1 end Rdi = paths tl +0 ←→0 +paths tr Θi = Jaccard(tl , tr ) end 4.4 Stopping Criterion While building the decision trees there a very common issue is over-fitting, when trees grow too large and tend to make the whole redescription mining process ineffective. In the end we might get huge trees and redescription derived from them will be massive and will contain big variety of variables. Hence, interpretation of long queries is difficult and not desirable. We adopted several aspects to help early termination of the process. Firstly, user is able to determine the maximal depth of the resulting trees. These enables algorithm to be tailored for many domains. Given flexibility allows building different trees for user to compare the result and discover the suitable depth parameter. Chapter 4 Contributions 33 However, in practice experiments users do not need trees longer than maximal depth=3 since redescription derived from them would be difficult to interpret. Secondly, (min bucket) is a parameter which is responsible for the minimal amount of entries in the node. Limitation of this parameter is very useful. Usually, the lower the minimal number of entries on the node, the bigger tree will be returned. The data set we work with should be taken into consideration when setting this parameter. Thus, we suggest using small (min bucket) parameters at the beginning and gradually increase it until resulting redescriptions will have an optimal size, depending on the data set and problem that are being solved. Moreover, a logical choice to stop the tree building process is when further splitting will not provide any changes. Thus, we check weather the performed split (on the next depth) has resulted in data reorganization across the nodes comparing with the previous result. If yes, we continue to split the data until no change occurs. Both our algorithms contain this check as an built-in feature. However, this stopping criterion in practice sometimes produce quite deep trees (i.e. does not prevent overfitting entirely). Impurity measure used within this algorithm is not principal. Within our experiments with real-world data sets (discussed in Section 5.2) Gini index and Information Gain were used. Hence, user is able to pick the one which is more suitable or try all of the to select the most prolific. As an outcome we get final trees, namely, a set of duplet trees which are used to derive redescriptions. 34 4.5 Chapter 4 Contributions Extraxting Redescriptions We use trees to mine one-dimensional rules and partition them with each other to form a redescription. The Figure 4.3 exemplifies a pair of trees derived after algorithms are run. To extract a redescription from it we use splitting parameters (with splitting values) and Boolean operators: ’OR’ and ’AND’. To partition paths into a query from any of these trees we join splitting rules within one path via ’AND’ operator, and these paths are joined with each other with ’OR’ operator. Labels which correspond to ’yes’ and ’no’ assignment are also taken into consideration by flipping of the sign. For instance, a query corresponding to the left-most given tree would look like: (1 ∧ 2 < 0.5) ∨ (1 < 0.5 ∧ 3 < 0.5). Or, if we use negations: (1 ∧ ¬2) ∨ (¬1 ∧ ¬3) The Figure 4.3 also depicts an example how we assess mined redescriptions with Jaccard’s coefficient derived from two trees grown by any of two presented algorithms. In fact, when having one side with a Boolean data and another with real-valued, after processing with Algorithm 1 or 2 we get nodes which belong to class ’0’ or ’1’. Leaf nodes from resulting trees then are grouped into two binary vectors (left and right) and Jaccard’s coefficient is computed. Figure 4.3: Redescription extraction and evaluation Jaccard’s coefficient is equal to 1 when we have a perfect match. In theory, we should be interested only in such redescriptions. However, in practice redescriptions even with lower Jaccard coefficient pose an interest. In addition, support of two queries is especially important (i.e. E1,1 - where both queries hold), since we do not want to get a redescription which covers all attributes. This would mean that a redescription does not provide any interesting insight. Or vice versa, if the support is really low, it holds for almost no entries from the data set. Chapter 4 Contributions 4.6 35 Extending to Fully non-Boolean setting Before this we were considering data sets which contain one binary and one real-valued matrices. However, this setting poses quite imposing constraint when solving real-world redescription mining problems, because many domains produce real-valued data. Thus, data discretization is to be performed. This issue has been studied by in a clue of Association Rule Discovery so far by Srikant and Agrawal in [50]. Their methods are based on a priori bucketing, but they are very specific to association rule discovery making them inappropriate for redescription mining. Thus, we adopt discretization routine of real-valued hand side of our data set based on clustering. Further this binarized matrix is used for initialization for both of our Algorithms. Having used each column from it as a target vector for the very first slit of the data set, we can further use initial (before binarization) left-hand side matrix, because CART requires only target vectors to binary. 4.6.1 Data Discretization It is possible to apply Algorithms 1 and 2 to fully non-Boolean data set as well. To enable them working with both-sided real-valued we apply a binarization routine for one of the sides. This routine can be considered as a pre-processing step which prepares the data set to look exactly the way algorithms to expect it to be. This kind of pre-processing is applicable to real-valued matrices before the algorithms’ application. To implement this we use three of available clustering techniques. However, this list is not limited to only those three. A good example of the data that can be transformed from real-valued to binary can be DBLP data [2] which contains information about computer science bibliography (more details in Section 5.4). Here left matrix corresponds to the conferences and the number of papers published by some author within it. For example, author N has submitted 4 papers for FOCS conference. The right hand side matrix contains same authors and the number of co-authoring between them. Thus, it describes how often each author co-worked on a paper with other author. Left hand side matrix can be transformed into binary with the help of clustering techniques covered in Section 5.4 to allow application of elaborated methods for redescription mining. Regardless the the clustering method used, the binarization routine is conducted with the following steps. 1. Select the first column as initial point; 2. Perform clustering of this column into several columns using one of the available clustering techniques; 3. Split taken column into several based on clustering result (initial attribute values are split into several intervals); 4. Assign to the attributes new values ’0’ or ’1’, according to initial values; 5. Repeat the procedure until all columns from initial matrix are split into several intervals and filled with ’0’ or ’1’. 36 Chapter 4 Contributions Algorithmic framework for Data Discretization is in a Table 3 Algorithm 3: Algorithmic framework for data discretization Data: Real-valued descriptor set {L} of a size i × j(real-valued) Result: Real-valued descriptor set {Lnew } of a size i · n × j (Boolean) Parameters: Cluster - function to perform clusterization, {DBSCAN, hclust, hclust} parms - parameters for selected cluterization method n - number of clusters Rangecluster - range of values which fall into cluster n Algorithm: Set {Lnew }= {} Set Cluster = {DBSCAN, hclust, k − means} foreach column {Lj } in {L} do Clusterparms (Lj ) into n clusters and Split into n columns with Rangecluster foreach Li,j do if valueLi,j ∈ Rangecluster then set Lnewi,j = 1 else Lnewi,j = 0 Return {Lnew } end end end As a result we would get the binarized matrix with increased number of columns. This matrix can be easily used with both methods to find redescriptions. The parameters used within clustering routine are mostly determined by the user and the data we have on hand. Some treats and peculiarities are discussed in Section 5.4 on the real-world data experiments. Chapter 4 Contributions 4.7 37 Quality of Redescriptions The quality of the redesctiption is more abstract term. It is a compound of several characteristics which we try to evaluate with objective criteria. For example, a good redescription can be the one which is easy to interpret, have reasonable support, and it is statistically significant. 4.7.1 Support and Accuracy One of the most defining feature for a redescription is support (E1,1 ). There is no strict bounds on support, which make a redescription good or bad. This depends on the data set we work with. Intuitively, we are not interested in redescriptions which are supported by either one row or by almost all rows from data set. It might be desirable to fix lower or upper bounds on the support cardinality of the queries and possibly on that of the individual involved predicates in each case individually. In our experiments we adopt Jaccard measure [27] to assess the accuracy of mined redescriptions. It provides a nice balance between the simplicity of computation and its agreement with the symmetric approach. Weights in Jaccard’s coefficient take into account the support of both queries equally. It is also scaled to the unit interval and does not involve entities that are not supported by any of two queries. Jaccard’s coefficient for both methods (Algorithm 1 and 2) is computed analogously. Two resulting vectors are formed based on final structure of the decision trees. The entities which fall into corresponding nodes are arranged into resulting vectors: l -vector for the left tree and r -vector for the right tree. Since rows from both sides of data are keyed by id, it becomes convenient to compute indexes we need for final assessment. This depicted on the Figure 4.3. Green arrows indicate paths in the trees that match. Namely, they compound a redescription mined from a particular pair of trees. Then, based on resulting vectors we compute the following to be plugged into Jaccard’s similarity function: 1. E1,1 - number of entries where both queries holds (i.e paths leading to ’1 class’ assignment; 2. E1,0 - number of entries where only the first (left) query holds; 3. E0,1 - number of entries where only the second (right) query holds. Depending on the purpose, the user is able to determine the minimal Jaccard coefficient associated with each redescription to make it relevant for future analysis. In many domains, queries with similarity lower than 1 are also desirable and pose scientific interest. The quality of queries involved in a redescription also determine the its expressiveness and interestingness For instance, long and nested expressions are hard to interpret and, hence, they carry minor interest for data mining tasks. Nevertheless, very strong restrictions applied to the syntactic complexity of queries might severely limit the expressiveness. Thus, the balance between these two partly conflicting characteristics, which at the same time are difficult to assess, is needed. The expressiveness of the language and the extent of interpretability of individual elements of it are highly defined by syntactic restrictions applied within the construction 38 Chapter 4 Contributions of queries (rules). One way to keep queries interpretable is to limit the maximal length of them. In our algorithms we limit them by adopting the maximal depth of decision trees. We combine paths in the trees to form a one hand side of a redescription and avoid negations by flipping the sign represents the node of the tree which is connected with its child with ’no’ label on the tree’s edge. 4.7.2 Assessing of Significance It is important to have an ability to determine how significant mined redescriptions are. Statistical significance is a crucial feature to assess the quality of returned results. The present-day concept of statistical significance, originated from R. Fisher [32], is being widely used in statistical analysis and we exploit it in our experiments as well. The simplest constraint applied to the mined redescription can be an accuracy, leaving out the redescriptions which do not overcome the given accuracy threshold. Nevertheless, statistical significance of the redescriptions is also important. That is, the support of a redescription (ql , qr ) should have some information, given the support of the queries. To measure this, we test against null-model when these two queries could be independent [21]. Statistical significance plays a vital role in statistical hypothesis testing, where it is used to determine whether a null hypothesis should be rejected or retained [39]. The intuition if as follows: a redescription should not be likely to appear at random from the underlying data distribution. That is, accuracy of redescriptions should not be readily deducible from the support of its queries. In particular, if both queries that form a redescription cover almost all objects, the overlap of their supports is definitely large as well. Thus, the high accuracy of this redescription is naturally predictable. P-value is computed to represent the probability that two random queries with marginal probabilities equal to ql and qr have an intersection equal or greater than |supp(ql , qr )|. Binomial distribution [58] is used by this probability, given as follows: pvalM (qL ; qr ) = |E| X s=|(ql ;qR )| |E| (pR )s (1 − pR )|E|−s s with pR = |supp(qL )||supp(qr )/|E|2 . This is the probability to obtain a set of same cardinality |E1;1 | or greater, if each element of a set of size |E| has a probability equal to the product of marginals pL and pR to be selected, according to the independence assumption. Authors is [18] used the same approach to evaluate statistical significance of redescriptions. The higher the p-value, the more likely it is to encounter the same support of two independent variables. Thus, the query becomes less significant, i.e. the null hypothesis cannot be rejected and the redescription becomes less significant. This theoretical p-value computation relies on underlying data distribution assumption. Namely, that all elements of the population can be sampled with equal probability out pre-defined distribution. The sampling distribution is calculated based only on expectation in the past, while the future relies on the stronger assumption of fixed marginals. But real data sets can differ from these assumptions, which make these Chapter 4 Contributions 39 significance tests weaker. These questions do not compound the main idea of our contribution, so we do not discuss them here in detail. Instead we refer reader to relevant literature [35, 14]. Chapter 5 Experiments with Algorithms for Redescription Mining 5.1 Finding Planted Redescriptions To asses the power of any elaborated method or technique it is essential study its behavior on synthetic data, where we have a complete control on the data format and parameters. Thus, we crate synthetic data sets to imitate the real-world setting in order to assess our algorithms’ ability to find previously planted redescriptions which would give an insight into the performance of algorithms. To implement this, it is necessary to make sure that planted redescriptions consist of queries in both sides of data set have perfect correspondence i.e. their Jaccard’s is 1. For Algorithm 1 the size of each matrix is set to be 300 × 5. Further, two queries that involve 3 parameters are planted in this pair in such a way to form an exact correspondence. For Algorithm 2 the size of each matrix is set to be 300 × 10, since it builds several decision trees with the depth 1 and every new depths is restricted from picking up the splitting rule which has been already used. Thus, planting of redescription involving 6 variables is vital for the algorithm to be able to reach the maximal allowed depth. Planting queries into such massive data array, especially when using randomized procedure to turn right-hand side into real-valued, results in a noisy data set. However, to study the behavior and ability of the algorithm to deal with the noise, we can track a the accuracy in the same manner as we did in Algorithm 1. In total we planted different looking redescription with support from 30 to 50 rows for both algorithms. Later the random noise is added with different density ranging between 0.01 and 0.1. Noise can be both: constructive (not interfering the actual query) and destructive (damaging queries). To generate a real-valued side of a data set it is possible to substitute values in one matrix. To implement this, we substitute each 0 by values uniformly distributed on the interval [0, 0.25], and each 1 by value on interval [0.75, 1]. In data sets without noise, Algorithm 1 were able to find planted redescriptions with the highest accuracy. Having applied constructive noise Algorithm 1 were able to find planted redescription up to density 0.03. In other cases it returned redescriptions which had better accuracy, than the planted one in a ’noisy’ matrices. This confirms anticipated behavior of the Algorithm. 41 42 Chapter 5 Experiments with Algorithms for Redescription Mining Figure 5.1 illustrates comparison of Jaccards’ coefficients for Algorithm 1 (a) and 2 (b). Red line on a chart shows Jaccard’s coefficients for the planted redescription in matrices with noise, (x-axis determines the density of applied noise - from 0.01 to 0.1). While blue line represents Jaccard’s coefficients of mined redescriptions returned by the algorithms on noisy data. Here can be seen, that Jaccard’s coefficient of planted is lower (or equal) than Jaccard’s of mined redescription. This happens because the effect of applied noise managed to form a better match and it was naturally mined by algorithms. Thus, found redescriptions in ’noisy’ data possess greater accuracy than the planted one, so algorithms are not to be blamed. (a) Algorithm 1 (b) Algorithm 2 Figure 5.1: Jaccard’s coefficients of planted and mined redescriptions on ’noisy’ data Note, due to the character of input data for Algorithm 2, is was able to mine redescription with Jaccard’s coefficient of 0.67. The reason for that, is that generated synthetic matrix with query involving 6 variables naturally contain noise. Hence, more detailed tests are needed fo overcome this issue. With destructive noise we were destroying planted redescriptions gradually and were able to find planted redescriptions up to density 0.09. For example, when planted the redescription of a form (with support of 30 rows and Jaccard 1): (x1 ≥ 0.5 ∧ x3 < 0.5) ←→ (x3 ≥ 0.7602 ∧ x2 < 0.4984) with destructive noise of density 0.9 the Algorithm 1 mined (support 26 and Jaccard 0.838): (x1 ≥ 0.5 ∧ x3 < 0.5) ←→ (x3 < 0.7602 ∧ x3 ≥ 0.2132 ∧ x3 < 0.2137) ∨(x3 ≥ 0.7602 ∧ x2 < 0.4984) Note, CART is eligible to select the same splitting parameter several times within one decision tree, uncovering the context dependency of the effects of certain variables [3]. Thus, sometimes discovered redescription involved additional branches, which were composed with the same splitting rule. The red part of the mined redescription is formed by additional branch of the tree induced by CART. Here x3 variable is picked by several times as it is claimed in CART algorithm [3]. Yet, the planted redescription was mined accurately taking into account noise level. These kind of ’additional branches’ arise in experimental runs and caused solely by peculiarities of CART. Tree building procedure Chapter 5 Experiments with Algorithms for Redescription Mining 43 in Algorithm 2 prevent CART from usage of the same splitting rule. Thus, it managed to mine planted redescriptions precisely up to level noise 0.4 (depending on a particular run). For both Algorithms additional tests would be advantageous for more profound assessment. In Section 5.5 we provide comparison to ReReMi algorithm to give a better insight into performance of our contributions. 44 Chapter 5 Experiments with Algorithms for Redescription Mining 5.2 The Real-World Data Sets In order to actually test and evaluate devised methods in practical conditions it is important to apply them into real-world data. For this we use two data sets: Bio for biological niche finding problem and DBLP - mining redescription from data for Computer Science bibliography. Both our algorithms where initially implemented in R. In this section we exemplify mined results using available tool for interactive redescription mining, called Siren [19]. Its plotting capacities provide some visual impression. As an input data set we use two matrices. Data is composed from publicly available sources: Bio 1 data set uses data from European mammals atlas [40] and climatic data from [26]. DBLP 2 data set is formed from [2], where left matrix contains the conferences and the number of papers published by some author within it, and right authors and the frequency of their cooperation with each other. Table 5.1 describes the distinct data sets used in experiments with real-world data. Table 5.1: Real-world data sets used in experiments Data set Bio DBLP 5.3 Descriptions Locations×Mammals Locations×Climate Authors×Conferences Authors×Authors Dimensions 2575 × 194 2575 × 48 2345 × 19 2345 × 2345 Type Boolean Real values Integer Integer Experiments With Algorithms on Bio-climatic Data Set Algorithm 1. Firstly experiments are run on the biological data from Table 5.1, called Bio. In biology, for the species to survive, a terrain where they live should maintain certain bio-climatic constrains which forms that species’ bioclimatic envelope (or niche) [23]. Finding this constrains with algorithms for redescription mining assists in determining of bio-climatic envelopes. In Bio data set left side is represented by the matrix, which contain locations in Europe and mammals living there. If an animal is present in a particular area, there is a 1, and vise versa: 0 for the places where this animal does not live. Thus, the left matrix contains only Boolean data. The right side (matrix R) consists of the same locations (keyed by IDs) and climatic data. In particular, we take into consideration minimal, maximal and average temperature in each month and average rainfall measurements (in millimeters per months). Algorithm 1 was run on Bio data with different parameters (impurity measures and min bucket) and returned redescription sets for each of them. Example redescriptions are indicated in Tables 5.2 and 5.3. Each of them has been composed with several redescriptions mined by Algorithm 1 with different parameters (indicated in a tables’ 1 2 http://www.worldclim.org. http://www.informatik.uni-trier.de/∼ ley/db/ Chapter 5 Experiments with Algorithms for Redescription Mining 45 headers). Resulting p-value makes these redescription statistically significant with the highest level (99%), we did not encounter any redescription with p-value higher than 0.0003 with all selected parameters on Bio data set. Table 5.2: Redescriptions mined by Algorithm 1 from Bio data set (with Gini impurity measure and min bucket=20.) LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 - support; max avg tmin X ; tX ; tX stand for minimum, maximum, and average temperature of month X in degrees Celsius, and pavg X stands for average precipitation of month X in millimeters. LHS RHS J E1,1 (P olar.bear ≥ 0.5) (Arctic.F ox < 0.5 ∧ P olar.bear < 0.5) (M oose ≥ 0.5 ∧ W ood.mouse < 0.5) (tmax M ar < −7.05) min (tmax Aug < 14.65 ∧ tJul ≥ 6.35) max (tAug ≥ 14.65) max (tmax F eb ≥ −1.15 ∧ tApr < 7.55 max max tJul ≥ 14.05) ∨ (tF eb < −1.15 pavg Aug ≥ 58.85) 0.947 0.979 36 2379 0.801 449 ∨ ∧ ∧ Thus, we have a scope of rules (redescriptions) which can be interpreted by analysts to find environmental envelopes for species. If analyzing a bit Table 5.2, we can see that rule: (Polar.bear≥0.5) ←→ (tmax M ar <-7.05) or equivalently: Polar.bear ←→ (tmax M ar <-7.05) implies that Polar Bear lives areas where maximal temperature in March is lower than -7.05 degrees Celsius with the support of 36 rows and high Jaccard’s coefficient (above 0.9). This redescription outlines logical conditions for Polar Bear to live in (cold climate). Decision tree pair which resulted in this redescription is depicted on the Figure 5.2. The redescription and trees in this particular case look simple and interpretable. Yet, very often user may encounter more complex cases (exemplified further), so the visualization becomes more essential. Figure 5.2: A pair of decision trees returned by the Algorithm The second redescription from Table 5.2: 46 Chapter 5 Experiments with Algorithms for Redescription Mining min max (¬Arctic.F ox ∧ ¬P olar.bear) ←→ (tmax Aug < 14.65 ∧ tJul ≥ 6.35) ∨ (tAug ≥ 14.65) Can be formulated as follows: In places where neither Arctic Fox, nor Polar Bear live, maximal temperature in August is below 14.65 and minimal temperature in July is greater or equal to 6.35 or maximal temperature in August is greater or equal than 14.64 degrees Celsius. This redescription rather describes living conditions which are not suitable for Polar Bear and Arctic Fox, since these mammals in the left query are negated. The corresponding pair of decision trees are depicted of the Figure 5.3 Figure 5.3: A pair of decision trees returned by the Algorithm The final redescription from Table 5.2 is longer and has more complex structure: max max max (M oose ∧ ¬W ood.mouse) ←→ (tmax F eb ≥ −1.15 ∧ tApr < 7.55 ∧ tJul ≥ 14.05) ∨ (tF eb < avg −1.15 ∧ pAug ≥ 58.85) can be expressed as follows: Moose lives in places without Wood Mouse where maximal temperature in February is above -1.15, in April - below 7.55 and in July - above 14.05 degrees, or in a places where maximal temperature in February is below -1.15 degrees and average rainfall in August is greater than 58.85 millimeters. Two decision trees, from which this redescription was formed are depicted on the Figure 5.4 Figure 5.4: A pair of decision trees returned by the Algorithm Chapter 5 Experiments with Algorithms for Redescription Mining 47 In similar manner all results can be interpreted. With parameters, indicated in Table 5.2, Algorithm 1 found 91 unique redescriptions, 55 of them have Jaccard’s coefficient above 0.8. Redescriptions vary in support size: some of them cover only small part of the data (below 200 rows), while others cover almost whole data (above 2000 rows out of 2575). Yet, all of them have high accuracy and statistically significant. We limited maximal depth of the trees to 3, because longer redescriptions have more nested structure and harder to interpret. However, for many instances Algorithm 1 terminated earlier since there either were no changes comparing to previous depth, or resulting leaf nodes were pure, consequently, resulting in shorter redescriptions. Table 5.3 presents one more run of Algorithm 1 on Bio data set, e.g. we use Information Gain as an impurity measure, and indicate min bucket = 100, meaning we limit underlying decision tree induction algorithm perform splits in such a way, that there are at least 100 entries in each node. Table 5.3: Redescriptions mined by Algorithm 1 from Bio data set (with IG impurity measure and min bucket=100.) LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 - support; max avg tmin X ; tX ; tX stand for minimum, maximum, and average temperature of month X in degrees Celsius, and pavg X stands for average precipitation of month X in millimeters. LHS RHS J E1,1 (Arctic.F ox < 0.5) max (tavg Jun <10.25 ∧tSep ≥ 10.75) ∨ avg (tJun ≥10.25) max (tmax Oct < 10.85 ∧ tF eb < −1.45 ∧ avg tJul ≥ 10.65) avg (pavg Oct < 45.15 ∧ pJun ≥ 61.85 ∧ pavg Apr < 48.25) 0.965 2347 0.701 353 0.483 151 (W ood.mouse < ∧M ountain.Hare ≥ 0.5) (European.Hamster ≥ 0.5) 0.5 Let’s discuss redescription from this experimantal run as well. The first redescription here is: avg max (¬Arctic.F ox) ←→ (tavg Jun <10.25 ∧tSep ≥ 10.75) ∨ (tJun ≥ 10.25) can be expressed as follows: Arctic Fox does not live in places, where June average temperature is below 10.25 degrees and maximal temperature in September is greater than 10.75. Or, where average temperature in June is greater than 10.25 degrees Celsius. This rule also describes living conditions which are not suitable for a mammal. The information about conditions which does not allow to some species to survive combined with other redescriptions that involve same animal can put all aspects of its preferences together and describe both (suitable and inappropriate) living conditions for a particular animal. Yet, in this particular case, redescription cover almost a whole Bio data set, which diminishes its value. 48 Chapter 5 Experiments with Algorithms for Redescription Mining Decision trees which were built using Algorithm 1 and formed this redescription are on the Figure 5.5: Figure 5.5: A pair of decision trees returned by the Algorithm Let’s consider the second redescriprion from Table 5.3: (Wood.mouse < 0.5∧M ountain.Hare ≥ 0.5) ←→ avg max (tmax Oct < 10.85 ∧ tF eb < −1.45 ∧ tJul ≥ 10.65) or equivalently: avg max (¬W ood.mouse ∧ M ountain.Hare )←→ (tmax Oct < 10.85 ∧ tF eb < −1.45 ∧ tJul ≥ 10.65) This redescription is formed from a pair of decision trees, which had been grown by the Algorithm 1. They are depicted on the Figure 5.6 and one can assess them even without text representation of the redescription. Figure 5.6: A pair of decision trees returned by the Algorithm This redescription can be expressed in natural language as follows: Mountain Hare lives in the places without Wood Mouse, where maximal temperature in October is below 10.85, maximal temperature in February is lower than -1.54 and average temperature in July is greater or equal to 10.65 degrees Celsius. Final redescription from Table 5.3: avg avg (European.Hamster) ←→ (pavg Oct < 45.15 ∧ pJun ≥ 61.85 ∧ pApr < 48.25) describes conditions for European Hamster and can be formulated as: Chapter 5 Experiments with Algorithms for Redescription Mining 49 European Hamster dwells territories in Europe where rainfall in October is lower than 45.15, in June is greater than 61.85 and in April is lower than 48.25 millimeters. A pair of decision trees for this particular case can be found on Figure Figure 5.7: A pair of decision trees returned by the Algorithm This rule shows, that for the European Hamster precipitation is more essential than temperature conditions, because on each new depth of the decision tree CART selected rainfall as a splitting rule which maximizes nodes’ purity. Yet, only an expert can confirm the importance of rainfall measurements for the hamster. Moreover, in some instance trees in a pair have different depth. This aspect is not surprising because Algorithm 1 builds each of them increasing the depth with each iteration and compares the result with the previous one. Termination happens when there were no improvements comparing to the split on the previous depth or resulting nodes are pure. Differently looking trees are assessed in the same manner as it was described in detail in Section 4.5. Using parameters indicated in Table 5.3, Algorithm 1 found 44 unique redescriptions, 25 of them with Jaccard’s coefficients above 0.8. Similarly to the experiment with parameters from Table 5.2 the received different supports (from few rows to almost whole data) and high accuracy. All redescriptions involve different parameters, yet they are easy to be interpreted, since they do not include very complex and nested structures. Full results can be found in Appendix A. 50 Chapter 5 Experiments with Algorithms for Redescription Mining Plotting on a map. Biological data set provides one more flexibility option: spacial coordinates associated with each location in the Europe assist in visualization of derived results. Plotting on the map make is easier to evaluate and interpret the outcomes. So, whenever user encounter difficulties in reading mined redescriptions, plots of resulting trees solve this issue. Let’s exemplify all example redescriptions discussed before from tables 5.2 and 5.3. Figure 5.8 represents three aforementioned redescription from Table 5.2 on a map. (a) shows first redescription. On (b) there is a plot of the second redescription. (c) represents the third redescription from Table 5.2. For all plots: red color indicates places whre only left-hand side query holds, blue - right-hand side query, purple ares depict places where both queries hold. (a) (b) (c) Figure 5.8: Example redescriptions from Table 5.2.(a) first; (b) second; (c) third. Moreover, plots on a map of redescriptions from Table 5.3 can be found on a Figure 5.9: (a) (b) (c) Figure 5.9: Example redescriptions from Table 5.3.(a) first; (b) second; (c) third. Chapter 5 Experiments with Algorithms for Redescription Mining 51 Algorithm 2. We tested Algorithm 2 on the same data set (Bio) using equal parameters. Some example redescriptions for one of the runs of the Algorithm 2 using IG as impurity measure and min bucket = 50 are presented in a Table 5.4. Full version of results can be found in Appendix A. Table 5.4: Redescriptions mined by Algorithm 2 from Bio data set (with IG impurity measure and min bucket=50.) LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 - support; max avg tmin X ; tX ; tX stand for minimum, maximum, and average temperature of month X in degrees Celsius, and pavg X stands for average precipitation of month X in millimeters. LHS RHS J E1,1 (M editerranean.W ater.Shrew ≥ 0.5 ∧ Alpine.Shrew ≥ 0.5) ∨ (M editerranean.W ater.Shrew < 0.5 ∧ M oose < 0.5 ∧ Arctic.F ox < 0.5) Kuhl.s.P ipistrelle ≥ 0.5 ∧ Alpine.marmot < 0.5) ∨ (Kuhl.s.P ipistrelle < 0.5 ∧ Common.Shrew < 0.5 ∧ House.mouse ≥ 0.5) (Brown.Bear ≥ 0.5 ∧ M ountain.Hare ≥ 0.5) ∨ (Brown.Bear < 0.5) ∧ W ood.mouse < 0.5 ∧ M oose ≥ 0.5) avg (pavg M ay ≥ 58.65 ∧ pJun ≥ 86.85) ∨ max (pavg M ay < 58.65 ∧ tN ov ≥ 6.85 ∧ max tSep ≥ 10.75) 0.912 1406 min (tmax M ar ≥ 11.05 ∧ tF eb ≥ −3.95) ∨ avg (tmax M ar < 11.05 ∧ tM ar ≥ 6.375 ∧ tmax Jan ≥ 3.55) 0.802 759 max (tmin Jan < −8.25 ∧ tSep < 17.35) ∨ max (tmin Jan ≥ −8.25 ∧ tF eb < −1.15 ∧ max tM ar < 5.55) 0.762 492 These redescriptions constitute a good example to show that visualization options become crucial. Let’s consider the first redescription: (M editerranean W ater Shrew ∧ Alpine Shrew) ∨ (¬M editerranean W ater Shrew ∧ avg avg max ¬M oose ∧ ¬Arctic F ox) ←→ (pavg M ay ≥ 58.65 ∧ pJun ≥ 86.85) ∨ (pM ay < 58.65 ∧ tN ov ≥ max 6.85 ∧ tSep ≥ 10.75) This rule implies that Mediterranean Water Shrew and Alpine Shrew or neither Mediterranean Water Shrew, nor Moose, nor Arctic Fox live in areas where either in May it rains more than 58.65 millimeters and in June more than 86.85 millimeters, or it rains in May less than 58.65 millimeters and where maximal temperature in November is above 6.85 and in September - above 10.75 degrees Celsius. In similar manner residual redescriptions can be expressed. A pair of resulting decision trees for this redescription from Table 5.4 is on the Figure 5.10. 52 Chapter 5 Experiments with Algorithms for Redescription Mining Figure 5.10: A pair of decision trees returned by the Algorithm Second redescription from the Table 5.4: Kuhl0 s P ipistrelle ≥ 0.5 ∧ Alpine.marmot < 0.5) ∨ (Kuhl0 s P ipistrelle < min 0.5 ∧ Common Shrew < 0.5 ∧ House.mouse ≥ 0.5) ←→ (tmax M ar ≥ 11.05 ∧ tF eb ≥ avg max max −3.95) ∨ (tM ar < 11.05 ∧ tM ar ≥ 6.375 ∧ tJan ≥ 3.55) was formed from a pair of trees depicted on the Figure 5.11. Figure 5.11: A pair of decision trees returned by the Algorithm Final redescription from Table 5.4 (Brown Bear ∧ M ountain Hare) ∨ (¬Brown Bear) ∧ ¬W ood mouse ∧ M oose) ←→ max min max max (tmin Jan < −8.25 ∧ tSep < 17.35) ∨ (tJan ≥ −8.25 ∧ tF eb < −1.15 ∧ tM ar < 5.55) was formed from a pair of decision trees which are depicted on the Figure 5.12: Chapter 5 Experiments with Algorithms for Redescription Mining 53 Figure 5.12: A pair of decision trees returned by the Algorithm With parameters, indicated in Table 5.4 Algorithm 2 returned 17 unique redescriptions, all of them are statistically significant. Support of all is in a range [300; 1600] rows, which is acceptable for Bio data set. Half of redescriptions have the highest accuracy (above 0.8), yet even others have Jaccard’s coefficient above 0.5. Plotting on a map. For the second Algorithm plotting is available as well. Figure 5.13 illustrates all redescriptions from Table 5.4 on a Europe’s map. Representation on a map makes it easier to evaluate the quality of the redescription. As previously, red color is responsible for left query, blue - for the right and overlap of both is purple color on a map: (a) (b) (c) Figure 5.13: Support of redescription from Bio data set.(a) First redescription; (b) Second redescription; (c) Third redescription from Table 5.4 Maps also help to consider the ares where actually animals live. The overlap part indicates places in Europe where the whole redescription is true, i.e. animals from left query live (or do not live, if they are negated in the left query) and climatic conditions hold from right query. The size of overlap is actually the support of the redescription (i.e. E1,1 ) which becomes a defying feature when assessing the quality of results in real-world data sets. 54 Chapter 5 Experiments with Algorithms for Redescription Mining 5.3.1 Discussion While running experiments with Bio data set, we used two Impurity measures: Gini index and Information Gain (their default implementations from R’s package rpart [52]). Yet, any others can be plugged in both Algorithm. In addition to this, increasing of min bucket parameter (minimal number of entities in a leaf node) results in smaller number of final redescriptions. Yet, they are of slightly higher Jaccard similarity and shorter structure. Note, decision tree induction routine will not be able to split the data if min bucket parameter is indicated erroneously, i.e. min bucket is set to be greater than the total number of places in Europe where a particular animal lives P divided by two, since we split each parent node into 2 children (e.g. min bucket > 2Li ;). We used the minimal size of min bucket to cover least 1% of the rows from data set in our experiments. In such a setting, Algorithms that had been run on Bio data set, returned quite big trees for the animals, which live in many areas in Europe (more than in 1000 areas out 2575 available). Nevertheless, animals such as Polar Bear or Moose, which have quite specific living conditions (cold climate) returned nice, easy to analyze trees. When minimal number of entries to the node is set to 100 or 50, trees naturally tend to be smaller. It is more suitable for the animals which are live in more than 500 locations in Europe (i.e. Shrew, Fox, Mice, etc.). Since all of them live all over the Europe, this limitation helps in finding a meaningful redescription for them, with more specific climatic features. Such animals as Moose, Polar Bear or Seal are not very wide-spread and live in a very specific climatic conditions (e.g. cold areas, and for the Seal - coastline). Setting min bucket too high is not suitable here. Setting minimal size of a node size to the half of the number of places where a particular animal lives, can be considered as a 50% threshold. This implies, that we expect at least a half of the population of a particular animal share the same conditions. Which makes sense, since in such a case, we know that the given redescriprion describes a niche which is shared at least by a half of the considered animal. If we compare both Algorithms based on their results on Bio data set it can be seen: • Both algorithms returned statistically significant redescriptions (with p − value < 0.01 within all runs). • If we compare accuracy of the results (e.g. Jaccard’s), it can be concluded that in every run of Algorithm 1 it returns up to 67% of redescriptions with accuracy above 0.8, while Algorithm 2 only up to 50%. Note that only Algorithm 1 managed to mine redescription with a perfect accuracy (Jaccard is exactly 1) in some runs Li (for instance, with Gini and IG, min bucket = 100 ) but they have low support size (below 30 rows) making them less informative. Algorithm 2 did not return redescriptions with Jaccard’s coefficient exactly 1. • The size of support of the redescription is an important parameter which determines the extent of its interestingness. If we apply a threshold of 100 rows for the redescription to regard them as interesting and compare only those redescriptions which has E1,1 > 100, it can be seen that Algorithm 1 mines up to 75% of redescriptions with accuracy above 0.8 and support size at least 100, while In Algorithm 2 this parameter is not greater than 48% for each run of both algorithms. Chapter 5 Experiments with Algorithms for Redescription Mining 55 • If we look at Top-20 (per Jaccard) redescriptions that have supp > 100 and p − value < 0.01 we can conclude that, redescriptions from Algorithm 1 in most cases cover above 1700 rows, while support sizes in Algorithm 2 are more diverse (from 150 up to 2000). Based on this, Algorithm 1 returns rules which hold for majority (or almost all) rows from data set, e.g. redescriptions describe conditions which are true all over the Europe. Also, queries are shorter, involving fewer attributes, because decision tree induction very often terminates before reaching the maximal allowed depth of the trees (we were using d=3 ). In Algorithm 2 more parameters are involved and the structure of queries is more nested (e.g. trees mainly terminate when reaching maximal depth). These redescriptions reveal more specific details concerning fauna and climate peculiarities in Europe. • One more aspect to be compared is overlap of the redescriptions (e.g. queries involve same parameters making some redescriptions similar to each other). Algorithm 1 tends to produce more overlapping redescriptions (approx. 65%), because CART selects same splitting rules from the whole data set over and over again regardless the initialization point used. For the Algorithm 2 overlapping redescriptions happen less frequently (approx. 50%), because we build every depth and branch of the tree independently, using corresponding part of the data set. These percentages vary slightly depending on parameters used within each run, but the global tendency holds for all experimental runs. Hence, if we sort redescriptions by Jaccard (from highest to lowest) and discard redescriptions which involve identical animals, we can see that accuracy of residual redescriptions for both algorithms is similarly high (around 0.8 on average) and support of the redescriptions for Algorithm 1 is about 10% greater (depending on the parameters used) than in Algorithm 2. • Usage of Gini impurity measure on Bio data set (with other equal conditions) in Algorithm 1 resulted in slightly deeper trees and consequently in longer redescriptions. For example, in Algorithm 1 with min bucket = 20 Gini returned 91 redescriptions (average length of a query - 5.51 variables), IG - 71 unique redescriptions (average length of a query - 4.87 variables). However, for Algorithm 2, Information Gain returned slightly longer redescriptions than the ones mined with Gini index. For example, the Algorithm 2 with min bucket = 20 mined with Gini 67 redescriptions (average length of a query - 7.06 variables); while with IG - 37 unique redescriptions (average length of a query - 7.75 variables). For both Algorithm, usage of Information Gain tends to produce higher percentage of repeating redescriptions in each experimental run and fewer unique redesctiptions in total comparing to Gini index. All in all, both algorithms return reasonable redescriptions for Bio data set with high accuracy. While testing them with equal parameters, it can be seen that Algorithm 1 found greater number of redescriptions which are shorter than the ones mined by Algorithm 2 and they are more similar between each other. For example, Moose, House Mouse and Stoat participate in many of them. This is caused by the fact, that CART algorithm tend to pick up splitting rules which maximize purity of resulting nodes and very often these splitting rules coincide for different initialization points, since they provide the greater contribution to nodes’ purity. Redescriptions mined my Algorithm 2 consist of more various variables in both (left and right) queries. This is due to the fact, that we build each layer of the decision tree 56 Chapter 5 Experiments with Algorithms for Redescription Mining independently using a corresponding part of a target vector. Moreover, each branch of a decision tree in Algorithm 2 also grows independently until stopping criteria is met. Basically, we induce several decision trees with depth 1 using the corresponding part of either left or right matrix. This fact explains the inclination of the Algorithm 2 to produce deeper trees: whenever there are attributes of both classes (’1’ class and ’0 class) CART is able to perform a split of them into two leaf nodes. Both elaborated algorithms found interesting rules while applied to Bio data set. All of them were statistically significant and had varying support (from few rows to an almost whole data set). Thus, they can be used in a problem of finding of bio-climatic envelopes for species. Nevertheless, some resulting redescription with high supports, despite being accurate, might pose little interest to the biologists. They combine complex climatic queries with several, possibly unrelated, species. Usage of p-value to check the significance of the results mitigates this issue but does not resolve it completely. Chapter 5 Experiments with Algorithms for Redescription Mining 5.4 57 Experiments With Algorithms on Conference Data Set Many real-world data can not be represented as a data set consisting of two matrices, one of which is binary. That is why, the binarization routine is essential to enable application of our algorithms in such cases. For left-hand side of DBLP data set we tried three different clusterization techniques plugged in discretization procedure, described in Subsection 4.6.1. In total, we used three clustering approaches, but this set can be extended to other ones, if needed. Generally, when analyzing bibliography data, mined redescriptions shed light on the communities of researches that contribute to the field at most. Density-based spatial clustering. (DBSCAN) This clustering technique does not require specification of the number of clusters and it is automatically tailored to detect necessary number of clusters based on the notion of density reachability [15]. A cluster, which is a subset of the points of the database, should satisfy two properties: all points within it are mutually density-connected and if a point is density-reachable from any point of the cluster, it belongs to cluster too. Algorithm on its own requires two parameters from a user: minimal number of points in the cluster and distance [4]. We have applied DBSCAN to our DBLP data set, where initially we had 19 conferences in left matrix and used clusterization to split each of them into intervals and transform data into binary matrix. These intervals represent the number of papers each author published within a particular conference. Columns from discretized matrix are used as initial target vectors for both Algorithms. In this particular case we should take into account the characteristics of the data set. Namely, we have most of the authors submitted somewhere below 7-10 papers to the conference. However, there are rare instances when author submits more than 15 papers. Thus, first clusters are quite dense and last ones are mainly sparse. We picked up such distance and number of points for DBSCAN to receive segregation of each conference into 5-10 clusters. Every new data sets may require different set of initial parameters to perform well. Very often in data mining results are highly dependable on parameter selection. DBSCAN with Algorithm 1. Table 5.5 illustrates several redescriptions mined by Algorithm 1 after aforementioned discretization of the left-hand side matrix. Table 5.5: Redescriptions mined by Algorithm 1 from DBLP data set (with DBSCAN binarization routine; Gini-impurity measure; min bucket= 5 ) LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 - support. LHS RHS J E1,1 ECM L ≥ 2.5 ∧ U AI ≥ 1.5 ICDE ≥ 12.5 ∧ EDBT < 3.5 P eterGrunwald ≥ 0.5 AnthonyK.H.T ung ≥ Jef f reyXuY u ≥ 0.5 AviW igderson ≥ SilvioM icali ≥ 0.5 1.5 ∧ 0.571 0.5 4 5 ∧ 0.133 10 ST OC ≥ 8.5 ∧ SODA < 5.5 0.5 58 Chapter 5 Experiments with Algorithms for Redescription Mining With these parameters Algorithm 1 mined 15 unique redescriptions with high accuracy. Majority of the cover either below 10 rows or almost whole data set, which makes them quite obvious or expected, because they are supported by insufficient amount of rows. Complete list of the results can be found in Appendix B. First redescription from Table 5.5 ECM L ≥ 2.5 ∧ U AI ≥ 1.5 ←→ P eter Grunwald ≥ 0.5 implies that if the some author has published at least 3 papers within ECML and at least 2 papers within UAI, he/she has likely co-authored with Peter Grunwald at least once. This redescription hold only for 4 rows in the DBLP data set, which makes it less informative. Formally there no strict bounds for minimal or maximal support of the redescription which make it interesting. The following redescription from the Table 5.5: ICDE ≥ 12.5 ∧ EDBT < 3.5 ←→ Anthony K. H. T ung ≥ 1.5 ∧ Jef f rey Xu Y u ≥ 0.5 claims that if you published more than 13 papers on ICDE and from 0 to 3 papers within EDBT, you have probably co-authored twice (or more) with Anthony K. H. Tung and at least once with Jeffrey Xu Yu. This redescription has support 5, which can be considered as low as well. But there are not so many people who submit more than 13 papers for a single conference, so the size of support in this case can be considered as acceptable to regard this redescription as informative. Let’s consider the last redescription from this table: ST OC ≥ 8.5 ∧ SODA < 5.5 ←→ Avi W igderson ≥ 0.5 ∧ Silvio M icali ≥ 0.5 It can be formulated with natural language as follows: If you have more 9 or more papers accepted in STOC and 5 or fewer papers in SODA conferences, you have co-authored at least once with Avi Wigderson and at least once with Silvio Micali In analogous way all rules can be interpreted as well. The decision tree pair for this redescription is depicted on the Figure 5.14. We provide a decision tree exemplification only for this redescription among others from DBLB data set, they can be plotted analogously. DBSCAN with Algorithm 2. Similarly, we run experiments on the same DBLP data set usingPAlgorithm 2. When using Information Gain impurity measure and Li min bucket = 100 , Algorithm 2 found 110 unique redescriptions with diverse support size (from few rows to almost whole data set) 15 of which have p − value > 0.01 which make the statistically insignificant. 31 out of all number redescriptions have Jaccard’s coefficient higher than 0.8. Unlike with first algorithm, here we received lower Jaccard’s similarity but greater support of each redescription. Some of the results are listed in the Table 5.6, while full report including p-values for each redescription can be found in Appendix B. Chapter 5 Experiments with Algorithms for Redescription Mining 59 Figure 5.14: A pair of decision trees returned by the Algorithm 1 Table 5.6: Redescriptions mined by Algorithm 2 from DBLP data set (with DBSCAN P Li binarization routine; IG-impurity measure; min bucket= 100 ); LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 - support LHS RHS J E1,1 ST OC < 0.5 ∧ F OCS ≥ 0.5 ∧ SIGM ODConf erence < 0.5 ∨ ST OC ≥ 0.5 ∧ SODA ≥ 0.5 ∧ ICDE < 0.5 T omasF eder < 0.5 ∧ AviW igderson ≥ 0.5 ∧ CatrielBeeri < 0.5 ∨ T omasF eder ≥ 0.5∧AmosF iat ≥ 0.5 ∧ SergeA.P lotkin < 1.5 RakeshAgrawal ≥ 0.5 ∨ RakeshAgrawal < 0.5 ∧ HamidP irahesh ≥ 0.5 ∧ JiaweiHan < 0.5 M anf redK.W armuth ≥ 0.5 0.919 711 0.809 689 0.226 19 V LDB ≥ 1.5 ∨ V LDB < 1.5 ∧ SIGM ODConf erence ≥ 0.5 ∧ ICDM < 0.5 COLT ≥ 3.5 The first redescription from Table 5.6 ST OC < 0.5 ∧ F OCS ≥ 0.5 ∧ SIGM ODConf erence < 0.5 ∨ ST OC ≥ 0.5 ∧ SODA ≥ 0.5 ∧ ICDE < 0.5 ←→ T omas F eder < 0.5 ∧ Avi W igderson ≥ 0.5 ∧ Catriel Beeri < 0.5 ∨ T omas F eder ≥ 0.5 ∧ Amos F iat ≥ 0.5 ∧ Serge A. P lotkin < 1.5 states that if you have not published any paper in either STOC, or SIGMOD conference but have at least one publication in FOCS, or you have at least 1 paper in STOC and SODA but no papers in ICDE, you likely co-authored neither with Tomas Feder, nor with Catriel Beeri but have collaborated with Avi Wigderson at least once. Or, you have collaborated with Tomas Feder and Amos Fiat at least once, and have worked with Serge A. Plotkin from 0 to 1 times. The support of this redescription is in acceptable range to claim it is informative and accurate enough. p-value of these redescription is zero, which makes the result statistically significant as well. The second redescription from Table 5.6 V LDB ≥ 1.5 ∨ V LDB < 1.5 ∧ SIGM ODConf erence ≥ 0.5 ∧ ICDM < 0.5 ←→ Rakesh Agrawal ≥ 0.5 ∨ Rakesh Agrawal < 0.5 ∧ Hamid P irahesh ≥ 0.5 ∧ Jiawei Han < 0.5 60 Chapter 5 Experiments with Algorithms for Redescription Mining claims that if you have published more than 2 papers in VLDB or from 0 to 1 paper ind VLDB and at least 1 paper in SIGMODConference, you have probably co-authored with Rakesh Agrawal at least once, or you have co-authored with neither him, nor Jiawei Han but you have at the same time at least one publication with Hamid Pirahesh. And final redescription from Table 5.6 COLT ≥ 3.5 ←→ M anf red K. W armuth ≥ 0.5 states that if you have published 4 or more papers within COLT, than you have coauthored with Manfred K. Warmuth once ore more times. Note, that this redescription has quite low accuracy (0.226) which make it less interesting regardless the acceptable level of support. Thus, Algorithm 2 mined more interesting and diverse redescriptions which are still statistically significant. They have longer structures and involve grater amount of variables comparing to Algorithm 1. Whenever the length of the resulting redescription start to be bigger than desired, user may use limitation of the maximal depth of the trees (we were using max depth = 3 so far). Usage of DBSCAN is advantageous due to its ability to detect necessary amount of clusters automatically. Thus, the user has no need to specify which makes the whole process easier. k-means. As one more option to test our algorithms and compare result, we adopted k-means clustering technique to be used in discretization of data set. Unlike DBSCAN, k-means clusterization [37] require from user indication of desired number of clusters. This poses an issue on its own which is vigorously discussed in scientific literature [30, 53, 22]. The correct choice of clusters’ number is often ambiguous, with interpretations depending on the shape and scale of the distribution of points in a data set and the desired clustering resolution of the user. Increasing number of clusters without penalty will always reduce the amount of error in the resulting clustering, to the extreme case of zero error when each data point is considered its own cluster (i.e. when there are as many clusters as number of data points). Intuitively, the optimal number of clusters is a balance between these extreme cases. Nevertheless, when working with DBLP data set, we experimented in partition of each conference into 5 clusters. This choice is caused by previous knowledge about data. Partition in smaller number of clusters results in highly dense clusters which represent from 0 to 7 submitted papers within one conference. When partitioning into more than 10 clusters, some data points would be considered as a separate cluster and put unwanted computational burden. Some redescriptions returned by Algorithm 1 are listed in a Table 5.7. Chapter 5 Experiments with Algorithms for Redescription Mining 61 Table 5.7: Redescriptions mined by Algorithm 1 from DBLP data set (with k-means (5 clusters) binarization routine; Gini-impurity measure; min bucket= 5) LHS is a lefthand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 - support. LHS RHS J E1,1 U AI ≥ 2.5 ∧ KDD ≥ 2.5 V LDB ≥ 18.5 ∧ SIGM ODConf erence < 26.5 ST OC ≥ 8.5 ∧ SODA < 5.5 T omiSilander ≥ 0.5 ShaulDar ≥ 0.5 0.500 0.357 4 5 0.113 10 AviW igderson ≥ SilvioM icali ≥ 0.5 0.5 ∧ With these parameters Algorithm 1 mined 8 unique redescriptions (2 of them have p − value > 0.1). Other statistically significant redescriptions have support around 10 rows, leading to the conclusion that with k-means clustring used within discretization routine, Algorithm 1 returns more intuitively expected rules (full set of outcomes can be found in Appendix B). Algorithm 2 was also applied to this data set, some resulting redescriptions are listed in the Table 5.8, while full set is presented in Appendix B. Table 5.8: Redescriptions mined by Algorithm 2 from DBLP data set (with k-means Li (5 clusters) binarization routine; Gini-impurity measure; min bucket = 100 ) LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 - support LHS RHS ICDE ≥ 12.5 ∧ EDBT ≥ 2 ∨ ICDE < 12.5 ∧ SIGM ODConf erence ≥ 19.5 ∧ W W W ≥ 0.5 F lipKorn ≥ 0.5 KrithiRamamritham 4.5 ∨ F lipKorn < 0.5 SudarshanS.Chawathe 3 ∧ M ayankBawa ≥ 0.5 RichardCole ≥ 0.5 RichardCole < 0.5 LaszloLovasz ≥ 0.5 JurisHartmanis < 0.5 AviW igderson ≥ 0.5 AviW igderson < 0.5 SalilP.V adhan ≥ 4.5 Shaf iGoldwasser ≥ 0.5 SODA ≥ 17.5 ∨ SODA < 17.5 ∧ F OCS ≥ 10.5 ∧ ST OC ≥ 9.5 ST OC ≥ 8.5 ∨ ST OC < 8.5 ∧ F OCS ≥ 8.5 ∧ SODA < 1.5 J E1,1 ∧ < ∧ ≥ 0.833 15 ∨ ∧ ∧ 0.534 43 ∨ ∧ ∧ 0.337 33 62 Chapter 5 Experiments with Algorithms for Redescription Mining With these parameters Algorithm 2 returned 70 unique redescriptions, 30 of which have Jaccard’s coefficient above 0.8, with support from several rows up to covering of a whole data set. When comparing the performance of both algorithms on this data set, it can be seen, that, as before, Algorithm 1 results in more simple and intuitive rules with lower support (around 10 rows) and high Jaccard’s similarity between queries, while Algorithm 2 returns longer, more detailed redescriptions with higher support, but lower Jaccard’s similarity. Hierarchical clustering. To exploit diversity we have tested one more clustering technique to discretize DBLP data set. Hierarchical clustering [29, 25], similarly to k-means, does not detect the number of cluster automatically. We tried splitting each conference into 5 clusters and run presented algorithms for redescription mining to see the performance. Some resulting redescriptions for Algorithm 1 are presented in Table 5.9. Full results can be found in Appendix B Table 5.9: Redescriptions mined by Algorithm 1 from DBLP data set (with hierarchical (5 clusters) binarization routine; IG-impurity measure;) LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 - support. LHS RHS ICDT ≥ 4.5 ∧ V LDB ≥ 0.5 ∧ ICDE ≥ 2.5 GostaGrahne < 2.5 KotagiriRamamohanarao 13 ∨ GostaGrahne ≥ 2.5 JigneshM.P atel < 0.5 BenyuZhang ≥ 12 P eterGrunwald < 1.5 StephenD.Bay ≥ 3.5 P eterGrunwald ≥ 1.5 W W W ≥ 4.5 ∧ ICDM ≥ 3.5 ECM L ≥ 2.5 ∧ U AI ≥ 1.5 ∧ ≥ ∧ ∧ ∨ J E1,1 1 3 1 1 2 5 With indicated parameters in Table 5.9 Algorithm 1 was able to return 15 unique redescriptions with high Jaccards’ and low supports (below 10 rows), Algorithm 2 returned 76 unique redescriptions. They share analogous features as the ones returned by Algorithm 2 before (i.e. using DBSCAN and k-means). Table 5.10 contains some examples of them, while full report can be found in Appendix B. Chapter 5 Experiments with Algorithms for Redescription Mining 63 Table 5.10: Redescriptions mined by Algorithm 2 from DBLP data set (with hierarchical (5 clusters) binarization routine; IG-impurity measure;) LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 - support. LHS RHS J E1,1 SDM ≥ 1.5 ∨ SDM < 1.5 ∧ ICDM ≥ 0.5 ∧ KDD ≥ 0.5 P hilipS.Y u ≥ 4.5 ∨ P hilipS.Y u < 4.5 ∧ V ipinKumar ≥ 0.5 ∧ SunilP rabhakar < 1.5 CatrielBeeri ≥ 0.5 ∨ CatrielBeeri < 0.5 ∧ LeonidLibkin ≥ 0.5 ∧ T homasSchwentick ≥ 0.5 M osesCharikar ≥ 0.5 ∨ M osesCharikar < 0.5 ∧ AviW igderson ≥ 0.5 ∧ M oniN aor ≥ 0.5 0.621 125 0.342 13 0.245 39 P ODS ≥ 2.5 ∨ P ODS < 2.5 ∧ ICDT ≥ 0.5 ∧ ST ACS ≥ 0.5 SODA ≥ 5.5 ∨ SODA < 5.5 ∧ F OCS ≥ 0.5 ∧ ST OC ≥ 8.5 64 Chapter 5 Experiments with Algorithms for Redescription Mining 5.4.1 Discussion When running two algorithms on DBLP data set, regardless the binarization procedure, Algorithm 2 tend to find considerably greater amount of redescriptions. While at the same time they are more complex in structure and longer than the ones found by Algorithm 1. However, first algorithm finds more intuitive, or obvious, redescriptions which are shorter and have Jaccard coefficient either 1, or very close to it. Majority of them supported by either by below 10 or above 2000 rows. This leads to the conclusion that Algorithm 1 with DBLP data tend to select obvious rules which hold only for few rows of data set. This issue can be slightly adjusted by parameter min bucket. Whenever we increase it, we tend to mine redescriptions with higher supports. This effect can by exploited on other data set as well. Algorithm 2 in this data set tend to find more interesting results, supported by the number of rows which is greater than 10 but small enough not to cover the whole data. This makes results more informative. However, redescriptions which carry almost no useful information happen to appear here as well. As before, increasing min bucket parameter can fix this. If we investigate the performance of both algorithms on DBLP data set in detail, we can see the following: • When using DBSCAN, Algorithm 1 tends to mine considerably fewer redescriptions than Algorithm 2. The accuracy of the results vary as well. Namely, Algorithm 1 returns redescriptions with perfect Jaccard’s (e.g. exactly 1) in most cases, but the support of these redescriptions is below 10 rows, yet all of them have p − value = 0 making the significant of the highest level. Algorithm 2 returns less uniform outcomes, which means we observed variety of supports (from few rows up to almost all)and accuracy. Here Jaccard’s coefficients drop from 0.99 to 0.06. Algorithm 2 returned up to 20 % of statistically insignificant results (e.g. p − value > 0.01), They happen in redescriptions which have high support - E1,1 > 1500. • The structure of resulting redescription is similar to the results recieved on Bio data set, e.g. Algorithm 1 returns more compact structures, involving fewer attributes. Decisions tree induction routine terminated before reaching maximal allowed depth. Respectively, Algorithm 2 returned deeper trees which resulted in longer redescriptions, involving greater amount of parameters. • When using k-means for discretization of the left-hand side of the data set, both Algorithms returned greater amount of statistically insignificant results. Algorithm 1 - up to 30% and Algorithm 2 - up to 10%. If we look at Top-5 redescriptions (per Jaccard) it can be underlined, that Algorithm 1 returns rules with support around 5 row, but the redescriptions involve quite extreme cases. For example, one author which published above 10 papers within 1 conference, so low support here is not surprising, because there are not so many researches in Computer Science who publish that many of scientific articles. Algorithm 2 in Top-5 redescriptions (per Jaccard) returned rules with E1,1 > 1700 rows. And parameters inside reflect more common amount of papers that a researcher submit within one conference (from 0 to 7 papers). Thus, these high supports are not surprising as well. Chapter 5 Experiments with Algorithms for Redescription Mining 65 • Having applied Hierarchical clustering to turn left-hand side of the data set into a binary matrix, both algorithms behave as before. Yet, Algorithm 1 returned all statistically significant redescriptions, but in Algorithm 2 up to 15% of them did not pass the p−value < 0.01 threshold. The accuracy of the results in Algorithm 1 is perfect (Jaccard is exactly 1), but redescription again describe cases when author submits unusually big amount of papers within conferences (above 10). Hence, support sizes of results are low. In Algorithm 2 we observed diverse supports, majority of which are between 20 and 700 rows giving the redescriptions desirable interestingness. There are no strict formal limitation a of the support of the mined redescriptions. These criterion rather caused by the data set we work with. Thus, in DBLP data we adopt the idea that support between 10 and 1800 rows would pose an interest. However, this choice is influenced only by the nature of this particular data set. All in all, selection of clustering method within binarization routine (DBSCAN, kmeans, Hierarchical clustering) on DBLP data set does not affect significantly neither the amount of mined redescriptions, nor quality of them. The only noticeable difference is when using k-means both algorithms return more statistically insignificant redescriptions. Both algorithm tend to return typical for them results with all clustering methods used. This caused by the fact, that discretized data participates in the inception of the algorithm’s run only. Algorithm further uses fully non-Binary setting to work through the data. Hence, in cases when the user has no previous knowledge of data, we suggest using DBSCAN, because it defines amount of clusters automatically and can prevent from clustering of the data set into to many clusters, which would lead to unwanted computational burden. On the other hand, when user want to segregate data in a certain amount of clusters, k-means, hierarchical clustering on any other available clustering algorithms can be used for this purpose. 66 5.5 Chapter 5 Experiments with Algorithms for Redescription Mining Experiments against ReReMi algorithm To evaluate Algorithm 1 and 2 we compared them to ReReMi algorithm presented in [20] and extended with on-fly-line-bucketing in [18]. ReReMi reported meaningful results for redescription mining both with real-world an synthetic data. Hence, it is a logical choice to compare Algorithm 1 and 2 using the same data sets. Comparing algorithms for redescription mining on real-world data sets is an intricate task, since they might produce different type of the redescriptions. Such parameter as ’interestingness’ is hard to measure, yet it is important when analyzing set of mined redescriptions. We run ReReMi algorithm with analogous parameters on both Bio and DBLP data set. We used limitation of the depth=3 when running Algorithms 1 and 2 which corresponds to maximum 7 number of variables involved into each query in ReReMi algorithm. We allowed on both sides of a redescription usage of conjunction and disjunction operators in Algorithms 1, 2 and ReReMi. However, when running Algorithm 1 and 2, we change Impurity measure, which does not have identical equivalents in ReReMi algorithm. min bucket parameter could be related to minimal contribution in ReReMi (used 0.05; details in [19]). In addition, we allow as much initial pairs as there runs of our Algorithms 1 and 2 for each particular case. Bio. Using the same Bio data set, ReReMi returned 209 unique statistically significant redescriptions. 201 of them are of Jaccard higher than 0.8. However, the size of the support tend to be large, i.e. E1,1 > 1300, meaning that most redescription cover high percentage of rows from data set. Algorithm 1 returned 140 unique redescriptions which are statistically significant. They also have diverse supports. Yet, unlike ReReMi, we observed redescriptions which have Jaccard’s exactly 1 and low support (around 10). This means, that Algorithm 1 tends to mine more obvious and less informative rules than ReReMi. Algorithm 2 returned 156 unique statistically significant redescriptions, with support not lower than 30 rows. which make them informative. In general results are closer to the ReReMi. Many of mined redescriptions (by either ReReMi, or Algorithms 1 and 2) overlap, describing similar parts of a Bio data set. DBLP. We used the same DBLP data set to test with ReReMi algorithm and Algorithms 1 and 2. ReReMi returned 102 redescriptions with support mainly around 10 rows, yet, many of them have higher support (up to 68 rows) making them quite interesting. Jaccard coefficients of 37 out of total number are above 0.5. Algorithm 1 (with Gini index) mined only 32, majority of which have support below 10 rows. These redescriptions have shorter, comparing to ReReMi, structure. And despite the fact we allowed maximal depth to be 3 (e.g. involving at most 7 parameters) include fewer parameters. Thus, this results show obvious rules which carry small interesting information. At the same time, Algorithm 2 returned 81 redescriptions which are statistically significant, whose support confirms their interestingness (above 15 rows). 30 redescriptions have Jaccard above 0.5 They are more complex in structure than the ones returned by Algorithm 1, yet similar to the ones returned by ReReMi. General remark. Algorithms 1 and 2 are different from ReReMi, since the use distinct approaches to mine and assess redescriptions (decision tree induction versus Chapter 5 Experiments with Algorithms for Redescription Mining 67 greedy atomic updates). Underlying in Algorithms 1 and 2 CART approaches involves usage of Impurity measures, while in ReReMi this can not be anyhow used. In addition, ReReMi allows direct indication of the resulting minimal support size for a redescription, while min bucket parameter we use only adjusts the minimal amount of entities in each (terminal and leaf) nodes of the decision tree, which does not guaranty to provide minimal support size. Query language and the way we extract redescriptions from a pair of decision trees involves participation of the same variable in a query for several times, which is not applicable in ReReMi. This makes results difficult to compare with each other. Finally, Algorithms 1 and 2 when processing fully-numerical data set, use a clusterization as a pre-processing step, since they require binary targets in the inception. This brings up one more aspect to incompatibility of results, because ReReMi uses on-fly-bucketing approach [18] when processing fully-numerical data sets. Despite this, rules, mined by Algorithm 2 over DBLP data set resemble the ones mined by ReReMi. Note, that there is no strict limits on minimal/maximal support of the redescription which is considered acceptable or not. Usually, this indicator is set based on the particular data set we work with. In experiments with computer science bibliography we consider redescriptions interesting when their support is higher than 10 rows. Logically, redescriptions which are supported by nearly whole data set carry no useful information as well, because the simply describe the rule which is true fro all attributes of a data. Chapter 6 Conclusions and Future Work This Thesis is dedicated to the data analysis task called redescriptions mining, which aims to discover objects which have multiple common descriptions. Or, vice versa, revealing shared characteristics for the set of objects. Redescription mining gives insight into the data with the help of queries that relate different views to the objects from it and provides domain-neutral way to cast complex data mining scenarios in terms of simpler primitives. In this Thesis we extended alternating algorithm for redescription mining beyond propositional Boolean queries to real-valued attributes and presented two algorithms based on decision tree induction to mine redescriptions. Peculiarities of used parameters were discussed in detail and their influence in real-world data set was explored. We run our Algorithms on two distinct real-world data set and received results that can be used for discussed problems in these domains. Numerous runs of algorithms proved, that they are able to find reasonable, statistically significant redescriptions in studied domains. The actual value of outcomes can only be evaluated by putting them to use in collaboration with experts of corresponding fields. Underlying principle of redescription mining seems easy and intuitive, yet is forms a powerful tool for data exploration, that can find practical application in numerous domains. Existing algorithms for redescription mining, augmented by our contributions, empowers scientists to create their own descriptors and reason with them for better understanding of scientific data sets. There is a big field for the future work with redescription mining. In particular, effective methods with profound theoretic foundations to model information content of the redescriptions in the subjective interestingness framework. Cooperation of elaborated algorithms with existing methods of filtering and post-processing of redescriptions poses an interest as well. Since uncertainties are inherent to most real-world scenarios, redescription mining is to be tailored to take them into consideration, potentially by usage of other data analysis developments [5]. 69 Bibliography [1] http://www.salford-systems.com/products/cart. [2] http://www.informatik.uni-trier.de/∼ ley/db/. [3] https://www.salford-systems.com/resources/whitepapers/115-technical-note-forstatisticians. [4] http://en.wikipedia.org/wiki/DBSCAN. [5] C. C. Aggarwal. Managing and Mining Uncertain Data: 3, A., volume 35. Springer, 2010. [6] R. Agrawal, R. Srikant, et al. Fast algorithms for mining association rules. In Proc. 20th int. conf. very large data bases, VLDB, volume 1215, pages 487–499, 1994. [7] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of the eleventh annual conference on Computational learning theory, pages 92– 100. ACM, 1998. [8] E. Boros, P. L. Hammer, T. Ibaraki, A. Kogan, E. Mayoraz, and I. Muchnik. An implementation of logical analysis of data. Knowledge and Data Engineering, IEEE Transactions on, 12(2):292–306, 2000. [9] P. Bradley, J. Gehrke, R. Ramakrishnan, and R. Srikant. Scaling mining algorithms to large databases. Communications of the ACM, 45(8):38–43, 2002. [10] L. Breiman. Technical note: Some properties of splitting criteria. Machine Learning, 24(1):41–47, 1996. [11] T. Calders and B. Goethals. Mining all non-derivable frequent itemsets. In Principles of Data Mining and Knowledge Discovery, pages 74–86. Springer, 2002. [12] J. Crawford and F. Crawford. Data mining in a scientific environment. In In Proceedings of AUUG 96 and Asia Pacific World Wide Web, 1996. [13] P. S. E. Hunt, J. Marin. Experiments in induction. Academic Press, New York, 1966. [14] E. Edgington and P. Onghena. Randomization tests. CRC Press, 2007. [15] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In Kdd, volume 96, pages 226–231, 1996. [16] J. Friedman, T. Hastie, and R. Tibshirani. The elements of statistical learning, volume 1. Springer Series in Statistics New York, 2001. [17] E. Galbrun et al. Methods for redescription mining. 2013. [18] E. Galbrun and P. Miettinen. From black and white to full color: extending redescription mining outside the boolean world. Statistical Analysis and Data Mining, pages 284–303, 2012. [19] E. Galbrun and P. Miettinen. Siren demo at sigmod 2014. 2014. [20] A. Gallo, P. Miettinen, and H. Mannila. Finding subgroups having several descriptions: Algorithms for redescription mining. In SDM, pages 334–345. SIAM, 2008. 71 72 BIBLIOGRAPHY [21] G. Gigerenzer and Z. Swijtink. The empire of chance: How probability changed science and everyday life, volume 12. Cambridge University Press, 1989. [22] C. Goutte, P. Toft, E. Rostrup, F. Å. Nielsen, and L. K. Hansen. On clustering fmri time series. NeuroImage, 9(3):298–310, 1999. [23] J. Grinnell. The niche-relationships of the california thrasher. The Auk, pages 427–433, 1917. [24] J. Han, H. Cheng, D. Xin, and X. Yan. Frequent pattern mining: current status and future directions. Data Mining and Knowledge Discovery, 15(1):55–86, 2007. [25] T. Hastie, R. Tibshirani, J. Friedman, T. Hastie, J. Friedman, and R. Tibshirani. The elements of statistical learning, volume 2. Springer, 2009. [26] R. J. Hijmans, S. E. Cameron, J. L. Parra, P. G. Jones, and A. Jarvis. Very high resolution interpolated climate surfaces for global land areas. International journal of climatology, 25(15):1965–1978, 2005. [27] P. Jaccard. Distribution de la flore alpine dans le bassin des dranses et dans quelques rgions voisines. Bulletin de la Socit Vaudoise des Sciences Naturelles, 37. [28] C. Kamath, E. Cantú-Paz, I. K. Fodor, and N. A. Tang. Classifying of bent-double galaxies. Computing in Science & Engineering, 4(4):52–60, 2002. [29] L. Kaufman and P. J. Rousseeuw. Finding groups in data: an introduction to cluster analysis, volume 344. John Wiley & Sons, 2009. [30] D. J. Ketchen and C. L. Shook. The application of cluster analysis in strategic management research: an analysis and critique. Strategic management journal, 17(6):441–458, 1996. [31] J. P. Kleijnen. Cross-validation using the t statistic. European Journal of Operational Research, 13(2):133–141, 1983. [32] M. Krzywinski and N. Altman. Points of significance: Significance, p values and t-tests. Nature methods, 10(11):1041–1042, 2013. [33] D. Kumar. Redescription mining: Algorithms and applications in bioinformatics. PhD thesis, Virginia Polytechnic Institute and State University, 2007. [34] R. O. L. Breiman, J.H. Friedman and C. Stone. Classification and Regression Trees. Chapman and Hall/CRC, 1984. [35] E. L. Lehmann and J. P. Romano. Testing statistical hypotheses. springer, 2006. [36] S. C. Lemon, J. Roy, M. A. Clark, P. D. Friedmann, and W. Rakowski. Classification and regression tree analysis in public health: methodological review and comparison with logistic regression. Annals of behavioral medicine, 26(3):172–181, 2003. [37] J. MacQueen. Some methods for classification and analysis of multivariate observations, 1967. [38] H. Mannila, H. Toivonen, and A. I. Verkamo. Eficient algorithms for discovering association rules. In KDD-94: AAAI workshop on Knowledge Discovery in Databases, pages 181–192, 1994. [39] K. Meier, J. Brudney, and J. Bohte. Applied statistics for public and nonprofit administration. Cengage Learning, 2011. [40] A. J. Mitchell-Jones. The atlas of european mammals,. Academic Press, London,, 1999. [41] P. K. Novak, N. Lavrač, and G. I. Webb. Supervised descriptive rule discovery: A unifying survey of contrast set, emerging pattern and subgroup mining. The Journal of Machine Learning Research, 10:377–403, 2009. [42] V. K. Pang-Ning Tan, Michael Steinbach. Introduction to Data Mining. Addison Wesley, 2006. BIBLIOGRAPHY 73 [43] L. Parida and N. Ramakrishnan. Redescription mining: Structure theory and algorithms. In AAAI, volume 5, pages 837–844, 2005. [44] F. Questier, R. Put, D. Coomans, B. Walczak, and Y. V. Heyden. The use of cart and multivariate regression trees for supervised and unsupervised feature selection. Chemometrics and Intelligent Laboratory Systems, 76(1):45–54, 2005. [45] J. R. Quevedo, A. Bahamonde, and O. Luaces. A simple and efficient method for variable ranking according to their usefulness for learning. Computational Statistics & Data Analysis, 52(1):578–595, 2007. [46] J. R. Quinlan. Induction of decision trees. Machine learning, 1(1):81–106, 1986. [47] J. R. Quinlan. C4. 5: programs for machine learning, volume 1. Morgan kaufmann, 1993. [48] N. Ramakrishnan, D. Kumar, B. Mishra, M. Potts, and R. F. Helm. Turning cartwheels: an alternating algorithm for mining redescriptions. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 266–275. ACM, 2004. [49] J. Soberón and M. Nakamura. Niches and distributional areas: concepts, methods, and assumptions. Proceedings of the National Academy of Sciences, 106(Supplement 2):19644– 19650, 2009. [50] R. Srikant and R. Agrawal. Mining quantitative association rules in large relational tables. In ACM SIGMOD Record, volume 25, pages 1–12. ACM, 1996. [51] D. Steinberg and P. Colla. Cart: classification and regression trees. The Top Ten Algorithms in Data Mining, 9:179, 2009. [52] T. M. Therneau, B. Atkinson, and M. B. Ripley. The rpart package, 2010. [53] R. L. Thorndike. Who belongs in the family? Psychometrika, 18(4):267–276, 1953. [54] A. Tripathi, A. Klami, M. Orešič, and S. Kaski. Matching samples of multiple views. Data Mining and Knowledge Discovery, 23(2):300–321, 2011. [55] G. Tsoumakas, I. Katakis, and I. Vlahavas. Mining multi-label data. In Data mining and knowledge discovery handbook, pages 667–685. Springer, 2010. [56] L. Umek, B. Zupan, M. Toplak, A. Morin, J.-H. Chauchat, G. Makovec, and D. Smrke. Subgroup Discovery in Data Sets with Multi–dimensional Responses: A Method and a Case Study in Traumatology. Springer, 2009. [57] V. N. Vapnik. The nature of statistical learning theory. statistics for engineering and information science. Springer-Verlag, New York, 2000. [58] G. P. Wadsworth and J. G. Bryan. Introduction to probability and random variables, volume 7. McGraw-Hill New York:, 1960. [59] G. J. Williams. Rattle: a data mining gui for r. The R Journal, 1(2):45–55, 2009. [60] D. Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd annual meeting on Association for Computational Linguistics, pages 189–196. Association for Computational Linguistics, 1995. [61] M. J. Zaki. Generating non-redundant association rules. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 34–43. ACM, 2000. [62] M. J. Zaki and C.-J. Hsiao. Charm: An efficient algorithm for closed itemset mining. In SDM, volume 2, pages 457–473. SIAM, 2002. [63] M. J. Zaki and N. Ramakrishnan. Reasoning about sets using redescription mining. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 364–373. ACM, 2005. 74 * BIBLIOGRAPHY Appendix A Redescription Sets from experiments with Bio Data Set 75 E1,1 2454 2434 p-val. 0 0 0.979 0.972 2379 2276 0 0 0.97 2326 0 0.967 2236 0 0.956 1975 0 0.956 1831 0 0.955 1876 0 0.953 1993 0 Redescription (Arctic.F ox ≥ 0.5 ∧ Red.F ox ≥ 0.5) ∨ (Arctic.F ox < 0.5 ∧ P olar.bear < 0.5) ←→ (t7 − max ≥ 13.45) (Arctic.F ox ≥ 0.5 ∧ Roe.Deer ≥ 0.5) ∨ (Arctic.F ox < 0.5 ∧ P olar.bear < 0.5) ←→ (t8 − max < 13.35 ∧ t7 − avg ≥ 9.72) ∨ (t8 − max ≥ 13.35) (Arctic.F ox < 0.5 ∧ P olar.bear < 0.5) ←→ (t8 − max < 14.65 ∧ t7 − min ≥ 6.35) ∨ (t8 − max ≥ 14.65) (Laxmann.s.Shrew ≥ 0.5∧Y ellow.necked.M ouse ≥ 0.5)∨(Laxmann.s.Shrew < 0.5∧Grey.Red.Backed.V ole ≥ 0.5∧ Brown.long.eared.bat ≥ 0.5)∨(Laxmann.s.Shrew < 0.5∧Grey.Red.Backed.V ole < 0.5∧N orth.American.Beaver < 0.5) ←→ (t12 − avg < −5.72 ∧ p8 ≥ 53.2 ∧ t4 − max ≥ 5.95) ∨ (t12 − avg < −5.72 ∧ p8 < 53.2) ∨ (t12 − avg ≥ −5.72) (Arctic.F ox < 0.5 ∧ N orway.lemming ≥ 0.5 ∧ W olverine < 0.5) ∨ (Arctic.F ox < 0.5 ∧ N orway.lemming < 0.5) ←→ (t9 − max < 10.85 ∧ p8 ≥ 36.85 ∧ t6 − avg ≥ 10.35) ∨ (t9 − max < 10.85 ∧ p8 < 36.85) ∨ (t9 − max ≥ 10.85) (Arctic.F ox < 0.5 ∧ Grey.Red.Backed.V ole ≥ 0.5 ∧ European.Hedgehog ≥ 0.5) ∨ (Arctic.F ox < 0.5 ∧ Grey.Red.Backed.V ole < 0.5 ∧ P olar.bear < 0.5) ←→ (t9 − avg < 7.835 ∧ t9 − max < 10.95 ∧ t7 − max ≥ 18.95) ∨ (t9 − avg < 7.835 ∧ t9 − max ≥ 10.95 ∧ p8 ≥ 102.9) ∨ (t9 − avg ≥ 7.835) (M oose ≥ 0.5 ∧ W ood.mouse < 0.5 ∧ Beech.M arten ≥ 0.5) ∨ (M oose ≥ 0.5 ∧ W ood.mouse ≥ 0.5 ∧ European.Hare ≥ 0.5)∨(M oose < 0.5∧Arctic.F ox ≥ 0.5∧American.M ink < 0.5)∨(M oose < 0.5∧Arctic.F ox < 0.5) ←→ (t10−max < 9.75 ∧ p8 ≥ 58.85 ∧ t12 − min ≥ −3.35) ∨ (t10 − max < 9.75 ∧ p8 < 58.85) ∨ (t10 − max ≥ 9.75 ∧ t3 − max < 3.45 ∧ p10 < 55.1) ∨ (t10 − max ≥ 9.75 ∧ t3 − max ≥ 3.45) (M oose ≥ 0.5 ∧ M ountain.Hare < 0.5 ∧ Beech.M arten ≥ 0.5) ∨ (M oose < 0.5 ∧ Arctic.F ox < 0.5 ∧ P olar.bear < 0.5) ←→ (t3−max < 4.65∧t10−max < 11.55∧p6 ≥ 114.5)∨(t3−max < 4.65∧t10−max ≥ 11.55)∨(t3−max ≥ 4.65) (M oose ≥ 0.5 ∧ M ountain.Hare < 0.5 ∧ Brown.long.eared.bat ≥ 0.5) ∨ (M oose < 0.5 ∧ Arctic.F ox ≥ 0.5 ∧ American.M ink < 0.5) ∨ (M oose < 0.5 ∧ Arctic.F ox < 0.5) ←→ (t3 − max < 4.55 ∧ p8 ≥ 56.75 ∧ p5 ≥ 96.5) ∨ (t3 − max < 4.55 ∧ p8 < 56.75) ∨ (t3 − max ≥ 4.55) (W ood.mouse < 0.5 ∧ M ountain.Hare ≥ 0.5 ∧ Southern.W hite.breasted.Hedgehog ≥ 0.5) ∨ (W ood.mouse < 0.5 ∧ M ountain.Hare < 0.5 ∧ P olar.bear < 0.5) ∨ (W ood.mouse ≥ 0.5 ∧ Arctic.F ox < 0.5) ←→ (t4 − max < 7.75 ∧ t10 − max ≥ 8.95) ∨ (t4 − max ≥ 7.75 ∧ t4 − max < 8.35 ∧ t8 − max < 19.85) ∨ (t4 − max ≥ 7.75 ∧ t4 − max ≥ 8.35) Continued Appendix A Redescription Sets from experiments with Bio Data Set J 0.995 0.988 76 Table A.1: Redescriptions mined by Algorithm 1 from Bio data set (with Gini impurity measure and min bucket=20.) J - Jaccard similarity E1,1 support ; tn − {max; min; avg} stand for minimum, maximum, and average temperature of month n in degrees Celsius, and pn stands for average precipitation of month n in millimeters. J 0.95 E1,1 1923 p-val. 0 0.947 0.945 36 1803 0 0 0.944 1996 0 0.94 2232 0 0.934 1926 0 0.923 1722 0 0.91 1450 0 0.91 1906 0 77 Redescription (M oose ≥ 0.5 ∧ Lesser.W hite.toothed.Shrew < 0.5 ∧ Harbor.Seal ≥ 0.5) ∨ (M oose ≥ 0.5 ∧ Lesser.W hite.toothed.Shrew ≥ 0.5) ∨ (M oose < 0.5) ←→ (t2 − max < 0.45 ∧ t6 − max ≥ 11.45 ∧ p5 ≥ 63.15) ∨ (t2 − max < 0.45 ∧ t6 − max < 11.45) ∨ (t2 − max ≥ 0.45 ∧ t3 − max < 4.85 ∧ t6 − max < 19.05) ∨ (t2 − max ≥ 0.45 ∧ t3 − max ≥ 4.85) (P olar.bear ≥ 0.5) ←→ (t3 − max < −7.05) (M oose ≥ 0.5 ∧ Lesser.W hite.toothed.Shrew ≥ 0.5) ∨ (M oose < 0.5 ∧ Arctic.F ox ≥ 0.5 ∧ American.M ink < 0.5) ∨ (M oose < 0.5 ∧ Arctic.F ox < 0.5) ←→ (t3 − max < 5.15 ∧ p6 ≥ 34.35 ∧ t12 − min ≥ −0.95) ∨ (t3 − max < 5.15 ∧ p6 < 34.35) ∨ (t3 − max ≥ 5.15) (M oose ≥ 0.5 ∧ W ood.mouse ≥ 0.5) ∨ (M oose < 0.5 ∧ P olar.bear < 0.5) ←→ (t2 − max < −1.15 ∧ t2 − max ≥ −2.05 ∧ t12 − min < −5.25) ∨ (t2 − max ≥ −1.15 ∧ t4 − max < 7.55 ∧ t7 − max < 14.05) ∨ (t2 − max ≥ −1.15 ∧ t4 − max ≥ 7.55) (European.Hamster ≥ 0.5 ∧ House.mouse.1 < 0.5) ∨ (European.Hamster < 0.5 ∧ European.ground.squirrel ≥ 0.5 ∧ Southern.V ole ≥ 0.5) ∨ (European.Hamster < 0.5 ∧ European.ground.squirrel < 0.5) ←→ (p10 < 45.15 ∧ p6 ≥ 61.85 ∧ p6 ≥ 92.15) ∨ (p10 < 45.15 ∧ p6 < 61.85) ∨ (p10 ≥ 45.15) (W ild.boar < 0.5 ∧ M ountain.Hare ≥ 0.5 ∧ Y ellow.necked.M ouse ≥ 0.5) ∨ (W ild.boar < 0.5 ∧ M ountain.Hare < 0.5 ∧ Arctic.F ox < 0.5) ∨ (W ild.boar ≥ 0.5) ←→ (t8 − max < 18.85 ∧ t10 − max < 7.15 ∧ p8 < 36.85) ∨ (t8 − max < 18.85 ∧ t10 − max ≥ 7.15 ∧ p5 ≥ 101.5) ∨ (t8 − max ≥ 18.85 ∧ t8 − max < 19.35 ∧ t6 − avg ≥ 13.65) ∨ (t8 − max ≥ 18.85 ∧ t8 − max ≥ 19.35) (M ountain.Hare ≥ 0.5 ∧ Beech.M arten ≥ 0.5) ∨ (M ountain.Hare < 0.5 ∧ Arctic.F ox < 0.5) ←→ (t9 − max < 15.95 ∧ p8 < 36.85) ∨ (t9 − max ≥ 15.95 ∧ t7 − max ≥ 19.15) (American.M ink ≥ 0.5 ∧ Grey.long.eared.bat < 0.5 ∧ Greater.W hite.toothed.Shrew ≥ 0.5) ∨ (American.M ink ≥ 0.5∧Grey.long.eared.bat ≥ 0.5)∨(American.M ink < 0.5∧Gray.Seal < 0.5∧M ountain.Hare < 0.5) ←→ (t9−max < 17.95 ∧ t5 − max < 4.5) ∨ (t9 − max ≥ 17.95 ∧ t4 − max < 13.05 ∧ p7 ≥ 56.9) ∨ (t9 − max ≥ 17.95 ∧ t4 − max ≥ 13.05) (Roe.Deer < 0.5 ∧ Eurasian.P ygmy.Shrew < 0.5 ∧ Eurasian.Lynx ≥ 0.5) ∨ (Roe.Deer < 0.5 ∧ Eurasian.P ygmy.Shrew ≥ 0.5) ∨ (Roe.Deer ≥ 0.5 ∧ Granada.Hare ≥ 0.5 ∧ M editerranean.W ater.Shrew ≥ 0.5) ∨ (Roe.Deer ≥ 0.5 ∧ Granada.Hare < 0.5) ←→ (p6 < 39.65 ∧ p7 ≥ 50.1) ∨ (p6 ≥ 39.65 ∧ t7 − max < 14.65 ∧ t1 − max < −1.9) ∨ (p6 ≥ 39.65 ∧ t7 − max ≥ 14.65 ∧ t9 − avg < 19.75) Continued Appendix A Redescription Sets from experiments with Bio Data Set Table A.1: Redescriptions mined by Algorithm 1 from Bio data set (with Gini impurity measure and min bucket=20.) J - Jaccard similarity E1,1 support ; tn − {max; min; avg} stand for minimum, maximum, and average temperature of month n in degrees Celsius, and pn stands for average precipitation of month n in millimeters. E1,1 1798 p-val. 0 0.906 1597 0 0.905 1942 0 0.9 1525 0 0.899 1644 0 0.898 1929 0 0.897 1679 0 0.896 1399 0 0.894 693 0 Redescription (M ountain.Hare ≥ 0.5 ∧ W ild.boar ≥ 0.5 ∧ European.Snow.V ole < 0.5) ∨ (M ountain.Hare < 0.5) ←→ (t5 − avg < 10.35 ∧ t7 − max < 13.45) ∨ (t5 − avg ≥ 10.35 ∧ t6 − avg ≥ 13.55) (Common.Shrew < 0.5∧House.mouse.1 < 0.5∧M ountain.Hare ≥ 0.5)∨(Common.Shrew < 0.5∧House.mouse.1 ≥ 0.5) ∨ (Common.Shrew ≥ 0.5 ∧ Etruscan.Shrew < 0.5) ←→ (t3 − max ≥ 11.15 ∧ t1 − avg < 1.045 ∧ t10 − max < 18.4) ∨ (t3 − max < 11.15 ∧ t8 − max < 11.95 ∧ t9 − avg ≥ 4.705) ∨ (t3 − max < 11.15 ∧ t8 − max ≥ 11.95) (Common.Shrew < 0.5 ∧ House.mouse ≥ 0.5 ∧ Common.V ole ≥ 0.5) ∨ (Common.Shrew < 0.5 ∧ House.mouse < 0.5∧M editerranean.M onk.Seal < 0.5)∨(Common.Shrew ≥ 0.5) ←→ (t12−avg < 5.69∧t1−avg ≥ 2.715∧t6−avg ≥ 13.35) ∨ (t12 − avg < 5.69 ∧ t1 − avg < 2.715) (M ountain.Hare ≥ 0.5 ∧ Chamois ≥ 0.5 ∧ Alpine.F ield.M ouse < 0.5) ∨ (M ountain.Hare < 0.5 ∧ Arctic.F ox < 0.5∧Gray.Seal < 0.5) ←→ (t9−max < 17.55∧t9−max < 15.65∧p8 < 36.85)∨(t9−max ≥ 17.55∧t5−max ≥ 15.85) (M ountain.Hare ≥ 0.5 ∧ Beech.M arten ≥ 0.5 ∧ Alpine.F ield.M ouse < 0.5) ∨ (M ountain.Hare < 0.5 ∧ Gray.Seal ≥ 0.5 ∧ Serotine.bat ≥ 0.5) ∨ (M ountain.Hare < 0.5 ∧ Gray.Seal < 0.5) ←→ (t9 − max < 17.15 ∧ t8 − max < 12.35) ∨ (t9 − max ≥ 17.15 ∧ t8 − max < 20.85 ∧ t3 − min < 1.9) ∨ (t9 − max ≥ 17.15 ∧ t8 − max ≥ 20.85) (Stoat < 0.5 ∧ House.mouse ≥ 0.5 ∧ Gray.Seal ≥ 0.5) ∨ (Stoat < 0.5 ∧ House.mouse < 0.5) ∨ (Stoat ≥ 0.5 ∧ Algerian.M ouse < 0.5) ←→ (t11 − max ≥ 11.25 ∧ t9 − max ≥ 21.9 ∧ p10 < 54.05) ∨ (t11 − max ≥ 11.25 ∧ t9 − max < 21.9) ∨ (t11 − max < 11.25 ∧ t11 − max ≥ 10.45 ∧ t11 − min ≥ 0.95) ∨ (t11 − max < 11.25 ∧ t11 − max < 10.45) (M ountain.Hare ≥ 0.5 ∧ W ild.boar ≥ 0.5 ∧ Gray.Seal < 0.5) ∨ (M ountain.Hare < 0.5 ∧ Gray.Seal < 0.5) ←→ (t5 − max < 15.85 ∧ t8 − max < 12.35) ∨ (t5 − max ≥ 15.85) (M ountain.Hare ≥ 0.5 ∧ Chamois ≥ 0.5 ∧ Alpine.F ield.M ouse < 0.5) ∨ (M ountain.Hare < 0.5 ∧ American.M ink ≥ 0.5 ∧ Greater.W hite.toothed.Shrew ≥ 0.5) ∨ (M ountain.Hare < 0.5 ∧ American.M ink < 0.5 ∧ Gray.Seal < 0.5) ←→ (t9−max < 17.95∧t5−max < 4.5)∨(t9−max ≥ 17.95∧t4−max < 13.75∧p5 ≥ 56.8)∨(t9−max ≥ 17.95∧t4−max ≥ 13.75) (M ountain.Hare < 0.5 ∧ Arctic.F ox < 0.5 ∧ P olar.bear ≥ 0.5) ∨ (M ountain.Hare < 0.5 ∧ Arctic.F ox ≥ 0.5) ∨ (M ountain.Hare ≥ 0.5 ∧ M oose < 0.5 ∧ European.Badger < 0.5) ∨ (M ountain.Hare ≥ 0.5 ∧ M oose ≥ 0.5) ←→ (t10 − max < 11.55 ∧ p5 < 109.5 ∧ t3 − max < 6.05) Continued Appendix A Redescription Sets from experiments with Bio Data Set J 0.908 78 Table A.1: Redescriptions mined by Algorithm 1 from Bio data set (with Gini impurity measure and min bucket=20.) J - Jaccard similarity E1,1 support ; tn − {max; min; avg} stand for minimum, maximum, and average temperature of month n in degrees Celsius, and pn stands for average precipitation of month n in millimeters. J 0.894 E1,1 1735 p-val. 0 0.892 1450 0 0.89 1354 0 0.887 1450 0 0.877 1761 0 0.874 1694 0 0.873 982 0 0.869 1540 0 0.868 1607 0 0.866 1653 0 79 Redescription (Common.Shrew < 0.5 ∧ House.mouse ≥ 0.5 ∧ M ountain.Hare ≥ 0.5) ∨ (Common.Shrew < 0.5 ∧ House.mouse < 0.5 ∧ M editerranean.M onk.Seal < 0.5) ∨ (Common.Shrew ≥ 0.5 ∧ Savi.s.P ine.V ole < 0.5) ←→ (t3 − max ≥ 11.25 ∧ t1 − avg < 0.2825) ∨ (t3 − max < 11.25) (House.mouse ≥ 0.5 ∧ Raccoon < 0.5 ∧ N orthern.Bat ≥ 0.5) ∨ (House.mouse ≥ 0.5 ∧ Raccoon ≥ 0.5 ∧ Y ellow.necked.M ouse ≥ 0.5) ∨ (House.mouse < 0.5 ∧ Granada.Hare < 0.5 ∧ M editerranean.M onk.Seal < 0.5) ←→ (t1 − max ≥ 4.35 ∧ t1 − min < −1.95 ∧ t11 − min ≥ 2.45) ∨ (t1 − max < 4.35) (American.M ink ≥ 0.5 ∧ Grey.long.eared.bat ≥ 0.5 ∧ Common.V ole ≥ 0.5) ∨ (American.M ink < 0.5 ∧ Gray.Seal < 0.5∧M ountain.Hare < 0.5) ←→ (t8−max < 22.05∧t5−max < 4.5)∨(t8−max ≥ 22.05∧t10−max ≥ 12.05∧p6 < 125) (House.mouse ≥ 0.5 ∧ Raccoon < 0.5 ∧ N orthern.Bat ≥ 0.5) ∨ (House.mouse ≥ 0.5 ∧ Raccoon ≥ 0.5) ∨ (House.mouse < 0.5 ∧ Granada.Hare < 0.5 ∧ M editerranean.M onk.Seal < 0.5) ←→ (t1 − max < 4.45) (Stoat < 0.5 ∧ Bank.V ole < 0.5 ∧ Common.Shrew ≥ 0.5) ∨ (Stoat < 0.5 ∧ Bank.V ole ≥ 0.5) ∨ (Stoat ≥ 0.5 ∧ Granada.Hare < 0.5) ←→ (p7 ≥ 39.6 ∧ t7 − max ≥ 13.45) (M uskrat < 0.5 ∧ Common.Shrew < 0.5 ∧ House.mouse.1 ≥ 0.5) ∨ (M uskrat < 0.5 ∧ Common.Shrew ≥ 0.5) ∨ (M uskrat ≥ 0.5) ←→ (p7 ≥ 42.35 ∧ p1 ≥ 100.5 ∧ t2 − min < −0.25) ∨ (p7 ≥ 42.35 ∧ p1 < 100.5) (American.M ink < 0.5 ∧ Gray.Seal < 0.5 ∧ M ountain.Hare ≥ 0.5) ∨ (American.M ink < 0.5 ∧ Gray.Seal ≥ 0.5) ∨ (American.M ink ≥ 0.5 ∧ Grey.long.eared.bat < 0.5 ∧ Greater.W hite.toothed.Shrew < 0.5) ←→ (t9 − max ≥ 17.95 ∧ t4 − max < 13.05 ∧ p7 < 56.9) ∨ (t9 − max < 17.95 ∧ t5 − max ≥ 4.5) (Stoat < 0.5 ∧ American.M ink < 0.5 ∧ Gray.Seal ≥ 0.5) ∨ (Stoat < 0.5 ∧ American.M ink ≥ 0.5 ∧ Beech.M arten < 0.5) ∨ (Stoat ≥ 0.5 ∧ Common.Genet ≥ 0.5 ∧ Algerian.M ouse < 0.5) ∨ (Stoat ≥ 0.5 ∧ Common.Genet < 0.5 ∧ Steppe.M ouse < 0.5) ←→ (t8 − max ≥ 24.75 ∧ p8 ≥ 65.45 ∧ t9 − avg < 17.35) ∨ (t8 − max < 24.75 ∧ t5 − max ≥ 4.5) (Common.Shrew < 0.5 ∧ Beech.M arten < 0.5 ∧ Black.rat < 0.5) ∨ (Common.Shrew ≥ 0.5 ∧ Stoat < 0.5 ∧ European.ground.squirrel < 0.5) ∨ (Common.Shrew ≥ 0.5 ∧ Stoat ≥ 0.5) ←→ (t3 − max ≥ 10.55 ∧ p7 ≥ 64.95) ∨ (t3 − max < 10.55 ∧ t9 − max ≥ 22.35 ∧ p11 < 45.9) ∨ (t3 − max < 10.55 ∧ t9 − max < 22.35) (House.mouse ≥ 0.5 ∧ Common.Shrew ≥ 0.5) ∨ (House.mouse < 0.5 ∧ European.F ree.tailed.Bat < 0.5) ←→ (t2 − max ≥ 6.75 ∧ t3 − avg < 6.375 ∧ t6 − min ≥ 9.65) ∨ (t2 − max < 6.75 ∧ t2 − avg ≥ 1.985 ∧ t9 − max ≥ 13.95) ∨ (t2 − max < 6.75 ∧ t2 − avg < 1.985) Continued Appendix A Redescription Sets from experiments with Bio Data Set Table A.1: Redescriptions mined by Algorithm 1 from Bio data set (with Gini impurity measure and min bucket=20.) J - Jaccard similarity E1,1 support ; tn − {max; min; avg} stand for minimum, maximum, and average temperature of month n in degrees Celsius, and pn stands for average precipitation of month n in millimeters. E1,1 1536 p-val. 0 0.864 667 0 0.845 551 0 0.844 949 0 0.842 1195 0 0.838 335 0 0.837 589 0 0.836 940 0 0.831 813 0 Redescription (Bank.V ole < 0.5 ∧ House.mouse.1 ≥ 0.5 ∧ Arctic.F ox < 0.5) ∨ (Bank.V ole ≥ 0.5 ∧ N orway.lemming ≥ 0.5 ∧ Raccoon.Dog ≥ 0.5) ∨ (Bank.V ole ≥ 0.5 ∧ N orway.lemming < 0.5 ∧ Roman.M ole < 0.5) ←→ (p7 < 36.65 ∧ p8 ≥ 47.5) ∨ (p7 ≥ 36.65 ∧ t7 − max ≥ 17.95 ∧ t10 − avg < 13.95) (M oose < 0.5∧Arctic.F ox ≥ 0.5∧American.M ink ≥ 0.5)∨(M oose ≥ 0.5∧Lesser.W hite.toothed.Shrew < 0.5) ←→ (t3 − max < 5.15 ∧ p6 ≥ 34.35 ∧ t12 − min < −0.95) (M oose ≥ 0.5 ∧ Lesser.W hite.toothed.Shrew < 0.5 ∧ Harbor.Seal < 0.5) ←→ (t2 − max ≥ 0.45 ∧ t3 − max < 4.85 ∧ t6 − max ≥ 19.05) ∨ (t2 − max < 0.45 ∧ t6 − max ≥ 11.45 ∧ p5 < 63.15) (House.mouse < 0.5 ∧ Granada.Hare < 0.5 ∧ M editerranean.M onk.Seal ≥ 0.5) ∨ (House.mouse < 0.5 ∧ Granada.Hare ≥ 0.5) ∨ (House.mouse ≥ 0.5 ∧ Raccoon ≥ 0.5 ∧ Y ellow.necked.M ouse < 0.5) ∨ (House.mouse ≥ 0.5∧Raccoon < 0.5∧N orthern.Bat < 0.5) ←→ (t1−max ≥ 4.35∧t1−min < −1.95∧t11−min < 2.45)∨(t1−max ≥ 4.35 ∧ t1 − min ≥ −1.95) (House.mouse ≥ 0.5 ∧ Raccoon ≥ 0.5 ∧ Y ellow.necked.M ouse ≥ 0.5) ∨ (House.mouse < 0.5 ∧ House.mouse.1 < 0.5∧M oose ≥ 0.5)∨(House.mouse < 0.5∧House.mouse.1 ≥ 0.5) ←→ (t1−max < 3.85∧t6−max < 10.7∧t10−avg ≥ 1.35) ∨ (t1 − max < 3.85 ∧ t6 − max ≥ 10.7) (Grey.Red.Backed.V ole < 0.5 ∧ Siberian.F lying.Squirrel < 0.5 ∧ W olverine ≥ 0.5) ∨ (Grey.Red.Backed.V ole < 0.5 ∧ Siberian.F lying.Squirrel ≥ 0.5) ∨ (Grey.Red.Backed.V ole ≥ 0.5) ←→ (t2 − avg ≥ −6.82 ∧ t2 − avg < −5.515 ∧ p5 < 41.35) ∨ (t2 − avg < −6.82 ∧ p8 ≥ 49.8) (M oose < 0.5 ∧ P olar.bear ≥ 0.5) ∨ (M oose ≥ 0.5 ∧ Lesser.W hite.toothed.Shrew < 0.5 ∧ Harbor.Seal < 0.5) ←→ (t2 − max ≥ 0.45 ∧ t3 − max < 4.85 ∧ t6 − max ≥ 19.05) ∨ (t2 − max < 0.45 ∧ p5 < 63.15) (House.mouse < 0.5 ∧ Granada.Hare < 0.5 ∧ M editerranean.M onk.Seal ≥ 0.5) ∨ (House.mouse < 0.5 ∧ Granada.Hare ≥ 0.5) ∨ (House.mouse ≥ 0.5 ∧ Raccoon < 0.5 ∧ N orthern.Bat < 0.5) ←→ (t1 − max ≥ 4.45) (Common.Shrew ≥ 0.5 ∧ Etruscan.Shrew ≥ 0.5) ∨ (Common.Shrew < 0.5 ∧ M ountain.Hare < 0.5 ∧ House.mouse.1 < 0.5) ←→ (t3 − max < 11.15 ∧ t8 − max < 11.95 ∧ t9 − avg < 4.705) ∨ (t3 − max ≥ 11.15 ∧ t1 − avg < 1.045 ∧ t10 − max ≥ 18.4) ∨ (t3 − max ≥ 11.15 ∧ t1 − avg ≥ 1.045) Continued Appendix A Redescription Sets from experiments with Bio Data Set J 0.866 80 Table A.1: Redescriptions mined by Algorithm 1 from Bio data set (with Gini impurity measure and min bucket=20.) J - Jaccard similarity E1,1 support ; tn − {max; min; avg} stand for minimum, maximum, and average temperature of month n in degrees Celsius, and pn stands for average precipitation of month n in millimeters. J 0.831 E1,1 813 p-val. 0 0.828 1080 0 0.813 1206 0 0.811 877 0 0.803 293 0 0.802 449 0 0.802 937 0 0.797 839 0 Redescription (Common.Shrew ≥ 0.5 ∧ Etruscan.Shrew ≥ 0.5) ∨ (Common.Shrew < 0.5 ∧ House.mouse.1 < 0.5 ∧ M ountain.Hare < 0.5) ←→ (t3 − max < 11.15 ∧ t8 − max < 11.95 ∧ t9 − avg < 4.705) ∨ (t3 − max ≥ 11.15 ∧ t1 − avg < 1.045 ∧ t10 − max ≥ 18.4) ∨ (t3 − max ≥ 11.15 ∧ t1 − avg ≥ 1.045) (House.mouse < 0.5 ∧ Common.Shrew < 0.5 ∧ House.mouse.1 < 0.5) ∨ (House.mouse ≥ 0.5 ∧ N orthern.Bat < 0.5 ∧ Striped.F ield.M ouse < 0.5) ←→ (t1 − avg < 0.684 ∧ t8 − max < 11.95 ∧ t9 − avg < 4.705) ∨ (t1 − avg ≥ 0.684 ∧ p6 < 98.65) (Common.V ole < 0.5 ∧ European.M ole < 0.5 ∧ European.P ine.V ole ≥ 0.5) ∨ (Common.V ole < 0.5 ∧ European.M ole ≥ 0.5 ∧ M oose < 0.5) ∨ (Common.V ole ≥ 0.5 ∧ M oose ≥ 0.5 ∧ M ountain.Hare < 0.5) ∨ (Common.V ole ≥ 0.5 ∧ M oose < 0.5) ←→ (t3 − max ≥ 3.95 ∧ t12 − max ≥ 10.25 ∧ p6 ≥ 63.4) ∨ (t3 − max ≥ 3.95 ∧ t12 − max < 10.25) (Common.Shrew ≥ 0.5 ∧ M arbled.P olecat < 0.5 ∧ Etruscan.Shrew ≥ 0.5) ∨ (Common.Shrew ≥ 0.5 ∧ M arbled.P olecat ≥ 0.5) ∨ (Common.Shrew < 0.5 ∧ American.M ink ≥ 0.5 ∧ Greater.W hite.toothed.Shrew ≥ 0.5) ∨ (Common.Shrew < 0.5 ∧ American.M ink < 0.5 ∧ Gray.Seal < 0.5) ←→ (t10 − max < 15.25 ∧ t5 − max < 4.5) ∨ (t10 − max ≥ 15.25 ∧ t1 − avg < 0.6415 ∧ p8 < 49.85) ∨ (t10 − max ≥ 15.25 ∧ t1 − avg ≥ 0.6415) (Grey.Red.Backed.V ole < 0.5 ∧ N orth.American.Beaver < 0.5 ∧ Laxmann.s.Shrew ≥ 0.5) ∨ (Grey.Red.Backed.V ole < 0.5 ∧ N orth.American.Beaver ≥ 0.5) ∨ (Grey.Red.Backed.V ole ≥ 0.5) ←→ (t2 − avg ≥ −6.82 ∧ t2 − avg < −5.865 ∧ p5 < 42.5) ∨ (t2 − avg < −6.82 ∧ p8 ≥ 53.2 ∧ t11 − max < 1.65) (M oose ≥ 0.5 ∧ W ood.mouse < 0.5) ←→ (t2 − max ≥ −1.15 ∧ t4 − max < 7.55 ∧ t7 − max ≥ 14.05) ∨ (t2 − max < −1.15 ∧ p8 ≥ 58.85) (Stoat ≥ 0.5 ∧ Common.Bent.wing.Bat < 0.5 ∧ Steppe.M ouse ≥ 0.5) ∨ (Stoat ≥ 0.5 ∧ Common.Bent.wing.Bat ≥ 0.5) ∨ (Stoat < 0.5 ∧ Arctic.F ox < 0.5 ∧ Gray.Seal < 0.5) ←→ (t8 − max < 24.75 ∧ t8 − max < 23.85 ∧ p7 < 29.4) ∨ (t8 − max < 24.75 ∧ t8 − max ≥ 23.85 ∧ p4 ≥ 53.5) ∨ (t8 − max ≥ 24.75) (Common.Shrew ≥ 0.5 ∧ Etruscan.Shrew ≥ 0.5) ∨ (Common.Shrew < 0.5 ∧ House.mouse.1 < 0.5 ∧ M oose < 0.5) ←→ (t2 − max < 7.25 ∧ t7 − max < 12.35 ∧ t9 − avg < 4.705) ∨ (t2 − max ≥ 7.25 ∧ t3 − avg < 6.37 ∧ t5 − min < 6.35) ∨ (t2 − max ≥ 7.25 ∧ t3 − avg ≥ 6.37) Continued Appendix A Redescription Sets from experiments with Bio Data Set Table A.1: Redescriptions mined by Algorithm 1 from Bio data set (with Gini impurity measure and min bucket=20.) J - Jaccard similarity E1,1 support ; tn − {max; min; avg} stand for minimum, maximum, and average temperature of month n in degrees Celsius, and pn stands for average precipitation of month n in millimeters. 81 E1,1 460 p-val. 0 0.792 880 0 0.787 0.786 111 704 0 0 0.775 802 0 0.772 922 0 0.758 681 0 0.754 898 0 0.735 613 0 0.733 0.722 55 666 0 0 Redescription (M oose < 0.5∧P olar.bear ≥ 0.5)∨(M oose ≥ 0.5∧W ood.mouse < 0.5) ←→ (t2−max ≥ −1.15∧t4−max < 7.55∧t7− max ≥ 14.05) ∨ (t2 − max < −1.15 ∧ t2 − max ≥ −2.05 ∧ t12 − min ≥ −5.25) ∨ (t2 − max < −1.15 ∧ t2 − max < −2.05) (Stoat ≥ 0.5 ∧ Algerian.M ouse < 0.5 ∧ Steppe.M ouse ≥ 0.5) ∨ (Stoat ≥ 0.5 ∧ Algerian.M ouse ≥ 0.5) ∨ (Stoat < 0.5 ∧ American.M ink ≥ 0.5 ∧ Beech.M arten ≥ 0.5) ∨ (Stoat < 0.5 ∧ American.M ink < 0.5 ∧ Gray.Seal < 0.5) ←→ (t8 − max < 24.75 ∧ t5 − max < 4.5) ∨ (t8 − max ≥ 24.75) (Arctic.F ox < 0.5∧P olar.bear ≥ 0.5)∨(Arctic.F ox ≥ 0.5∧Roe.Deer < 0.5) ←→ (t8−max < 13.35∧t7−avg < 9.72) (M ountain.Hare < 0.5 ∧ Gray.Seal ≥ 0.5) ∨ (M ountain.Hare ≥ 0.5 ∧ W ild.boar ≥ 0.5 ∧ Gray.Seal ≥ 0.5) ∨ (M ountain.Hare ≥ 0.5 ∧ W ild.boar < 0.5) ←→ (t5 − max < 15.85 ∧ t8 − max ≥ 12.35) (Stoat ≥ 0.5 ∧ Common.Genet < 0.5 ∧ Steppe.M ouse ≥ 0.5) ∨ (Stoat ≥ 0.5 ∧ Common.Genet ≥ 0.5 ∧ Algerian.M ouse ≥ 0.5) ∨ (Stoat < 0.5 ∧ American.M ink ≥ 0.5 ∧ Beech.M arten ≥ 0.5) ∨ (Stoat < 0.5 ∧ American.M ink < 0.5 ∧ Gray.Seal < 0.5) ←→ (t8 − max < 24.75 ∧ t5 − max < 4.5) ∨ (t8 − max ≥ 24.75 ∧ p8 ≥ 65.45 ∧ t9 − avg ≥ 17.35) ∨ (t8 − max ≥ 24.75 ∧ p8 < 65.45) (W ild.boar ≥ 0.5 ∧ Gray.Seal ≥ 0.5) ∨ (W ild.boar < 0.5 ∧ M ountain.Hare < 0.5 ∧ Black.rat < 0.5) ∨ (W ild.boar < 0.5 ∧ M ountain.Hare ≥ 0.5) ←→ (t5 − max ≥ 15.95 ∧ t6 − max < 20.15 ∧ p9 < 69.6) ∨ (t5 − max < 15.95 ∧ p5 < 112) (Stoat ≥ 0.5 ∧ Common.Genet ≥ 0.5 ∧ European.P ine.V ole < 0.5) ∨ (Stoat < 0.5 ∧ House.mouse.1 < 0.5 ∧ Gray.Seal < 0.5) ←→ (t11 − max < 11.85 ∧ p7 < 41.9) ∨ (t11 − max ≥ 11.85 ∧ t10 − max ≥ 17.55) (Common.V ole < 0.5 ∧ European.ground.squirrel ≥ 0.5) ∨ (Common.V ole ≥ 0.5 ∧ Alpine.P ine.V ole < 0.5) ←→ (t8 − max < 20.45 ∧ t6 − min < 9.85 ∧ p5 ≥ 112) ∨ (t8 − max < 20.45 ∧ t6 − min ≥ 9.85 ∧ t4 − avg < 4.22) ∨ (t8 − max ≥ 20.45 ∧ t12 − max < 10.25 ∧ p10 < 86.05) (Stoat ≥ 0.5 ∧ Spanish.M ole ≥ 0.5) ∨ (Stoat < 0.5 ∧ Common.Shrew ≥ 0.5 ∧ European.F ree.tailed.Bat ≥ 0.5) ∨ (Stoat < 0.5 ∧ Common.Shrew < 0.5 ∧ European.M ole < 0.5) ←→ (p7 ≥ 42.85 ∧ t7 − max < 13.45) ∨ (p7 < 42.85) (Egyptian.M ongoose ≥ 0.5) ←→ (p8 < 7.47 ∧ p5 ≥ 27.5) (House.mouse < 0.5 ∧ European.F ree.tailed.Bat ≥ 0.5) ∨ (House.mouse ≥ 0.5 ∧ Common.Shrew < 0.5) ←→ (t2 − max < 6.75 ∧ t2 − avg ≥ 1.985 ∧ t9 − max < 13.95) ∨ (t2 − max ≥ 6.75 ∧ t3 − avg < 6.375 ∧ t6 − min < 9.65) ∨ (t2 − max ≥ 6.75 ∧ t3 − avg ≥ 6.375) Continued Appendix A Redescription Sets from experiments with Bio Data Set J 0.794 82 Table A.1: Redescriptions mined by Algorithm 1 from Bio data set (with Gini impurity measure and min bucket=20.) J - Jaccard similarity E1,1 support ; tn − {max; min; avg} stand for minimum, maximum, and average temperature of month n in degrees Celsius, and pn stands for average precipitation of month n in millimeters. J 0.722 E1,1 636 p-val. 0 0.721 536 0 0.718 140 0 0.715 178 0 0.708 600 0 0.707 188 0 0.684 0.681 117 552 0 0 0.657 142 0 0.642 70 0 0.638 347 0 0.605 130 0 83 Redescription (M uskrat < 0.5 ∧ Common.Shrew < 0.5 ∧ House.mouse.1 < 0.5) ←→ (p7 ≥ 42.35 ∧ p1 ≥ 100.5 ∧ t2 − min ≥ −0.25) ∨ (p7 < 42.35) (Stoat ≥ 0.5 ∧ Common.Genet ≥ 0.5) ∨ (Stoat < 0.5 ∧ Black.rat < 0.5 ∧ Kuhl.s.P ipistrelle ≥ 0.5) ∨ (Stoat < 0.5 ∧ Black.rat ≥ 0.5∧House.mouse.1 < 0.5) ←→ (t11−max < 10.95∧t3−max ≥ 11.35∧t2−max ≥ 7.65)∨(t11−max ≥ 10.95 ∧ t8 − max ≥ 22.7 ∧ p11 ≥ 49.95) (Grey.Red.Backed.V ole < 0.5 ∧ P olar.bear ≥ 0.5) ∨ (Grey.Red.Backed.V ole ≥ 0.5 ∧ Laxmann.s.Shrew < 0.5 ∧ Eurasian.W ater.Shrew < 0.5) ∨ (Grey.Red.Backed.V ole ≥ 0.5 ∧ Laxmann.s.Shrew ≥ 0.5) ←→ (t11 − min < −8.45 ∧ t2 − max ≥ −6.85 ∧ p4 < 42.6) ∨ (t11 − min < −8.45 ∧ t2 − max < −6.85) (Arctic.F ox < 0.5 ∧ N orway.lemming ≥ 0.5 ∧ W olverine ≥ 0.5) ∨ (Arctic.F ox ≥ 0.5) ←→ (t9 − max < 10.85 ∧ p8 ≥ 36.85 ∧ t6 − avg < 10.35) (House.mouse < 0.5 ∧ M editerranean.M onk.Seal ≥ 0.5) ∨ (House.mouse ≥ 0.5 ∧ Y ellow.necked.M ouse ≥ 0.5 ∧ Common.V ole < 0.5) ∨ (House.mouse ≥ 0.5 ∧ Y ellow.necked.M ouse < 0.5 ∧ Common.Shrew < 0.5) ←→ (t1 − avg ≥ 2.575 ∧ t2 − min < −0.05 ∧ t7 − min ≥ 12.95) ∨ (t1 − avg ≥ 2.575 ∧ t2 − min ≥ −0.05) (Grey.Red.Backed.V ole < 0.5 ∧ P olar.bear ≥ 0.5) ∨ (Grey.Red.Backed.V ole ≥ 0.5 ∧ European.Hedgehog < 0.5) ←→ (t11 − min ≥ −8.3 ∧ t11 − avg < −2.175 ∧ p7 ≥ 80.95) ∨ (t11 − min < −8.3 ∧ t11 − max < −0.65) (P olar.bear < 0.5∧Laxmann.s.Shrew ≥ 0.5∧European.P olecat < 0.5)∨(P olar.bear ≥ 0.5) ←→ (t3−min < −12.55) (Kuhl.s.P ipistrelle < 0.5 ∧ Southwestern.W ater.V ole < 0.5 ∧ Savi.s.P ine.V ole ≥ 0.5) ∨ (Kuhl.s.P ipistrelle < 0.5 ∧ Southwestern.W ater.V ole ≥ 0.5) ∨ (Kuhl.s.P ipistrelle ≥ 0.5 ∧ P arti.coloured.bat < 0.5 ∧ Alpine.marmot < 0.5) ←→ (t3 − max ≥ 11.15 ∧ t1 − avg ≥ 0.483) (Laxmann.s.Shrew < 0.5 ∧ P olar.bear ≥ 0.5) ∨ (Laxmann.s.Shrew ≥ 0.5 ∧ Y ellow.necked.M ouse < 0.5) ←→ (t2 − max < −5.55 ∧ t3 − min ≥ −12.95 ∧ t7 − max ≥ 19.95) ∨ (t2 − max < −5.55 ∧ t3 − min < −12.95) (P olar.bear < 0.5∧N orthern.Red.backed.V ole ≥ 0.5∧Laxmann.s.Shrew ≥ 0.5)∨(P olar.bear ≥ 0.5) ←→ (t3−avg < −8.635) (Raccoon.Dog ≥ 0.5 ∧ House.mouse.1 < 0.5 ∧ Siberian.F lying.Squirrel ≥ 0.5) ∨ (Raccoon.Dog ≥ 0.5 ∧ House.mouse.1 ≥ 0.5 ∧ W ildcat < 0.5) ←→ (p2 < 34.05 ∧ p8 ≥ 55.15 ∧ t7 − max ≥ 20.15) (Alpine.marmot < 0.5 ∧ Alpine.Shrew ≥ 0.5 ∧ European.Hamster < 0.5) ∨ (Alpine.marmot ≥ 0.5) ←→ (p5 ≥ 89.45 ∧ p1 < 146.5 ∧ t2 − min < −3.35) Continued Appendix A Redescription Sets from experiments with Bio Data Set Table A.1: Redescriptions mined by Algorithm 1 from Bio data set (with Gini impurity measure and min bucket=20.) J - Jaccard similarity E1,1 support ; tn − {max; min; avg} stand for minimum, maximum, and average temperature of month n in degrees Celsius, and pn stands for average precipitation of month n in millimeters. E1,1 200 p-val. 0 0.571 100 0 0.551 194 0 0.494 198 0 0.492 405 0 0.455 0.444 35 179 0 0 0.407 48 0 0.392 143 0 0.383 0.349 0.323 0.294 31 22 20 15 0 0 0 0 Redescription (European.Hamster < 0.5 ∧ European.ground.squirrel ≥ 0.5 ∧ Southern.V ole < 0.5) ∨ (European.Hamster ≥ 0.5 ∧ House.mouse.1 ≥ 0.5) ←→ (p10 < 45.15 ∧ p6 ≥ 61.85 ∧ p6 < 92.15) (Alpine.Shrew ≥ 0.5 ∧ Alpine.marmot < 0.5 ∧ European.Hamster < 0.5) ∨ (Alpine.Shrew ≥ 0.5 ∧ Alpine.marmot ≥ 0.5) ←→ (p5 ≥ 90.15 ∧ t1 − avg < −0.286 ∧ p10 < 152.5) (European.Hamster < 0.5 ∧ P olar.bear ≥ 0.5) ∨ (European.Hamster ≥ 0.5 ∧ T undra.V ole < 0.5 ∧ House.mouse.1 ≥ 0.5) ∨ (European.Hamster ≥ 0.5 ∧ T undra.V ole ≥ 0.5) ←→ (p10 ≥ 42.75 ∧ p10 < 45.15 ∧ p6 ≥ 73.75) ∨ (p10 < 42.75 ∧ t11 − max < 9.55 ∧ p4 < 49.6) (Etruscan.Shrew < 0.5 ∧ Common.Shrew < 0.5 ∧ Savi.s.P ine.V ole ≥ 0.5) ∨ (Etruscan.Shrew ≥ 0.5 ∧ P yrenean.Desman < 0.5) ←→ (t10 − avg ≥ 12.75 ∧ p10 ≥ 54.25 ∧ p1 < 104.5) (Edible.dormouse < 0.5 ∧ M editerranean.W ater.Shrew ≥ 0.5 ∧ P yrenean.Desman ≥ 0.5) ∨ (Edible.dormouse ≥ 0.5 ∧ M editerranean.W ater.Shrew < 0.5 ∧ American.M ink < 0.5) ∨ (Edible.dormouse ≥ 0.5 ∧ M editerranean.W ater.Shrew ≥ 0.5) ←→ (p5 ≥ 58.55 ∧ t7 − max < 21.15 ∧ p5 ≥ 107.5) ∨ (p5 ≥ 58.55 ∧ t7 − max ≥ 21.15 ∧ t5 − min < 7.45) (Alpine.Ibex ≥ 0.5) ←→ (p5 ≥ 107.5) (Common.Genet < 0.5 ∧ Kuhl.s.P ipistrelle ≥ 0.5 ∧ Coypu ≥ 0.5) ∨ (Common.Genet ≥ 0.5 ∧ Lesser.W hite.toothed.Shrew < 0.5 ∧ Egyptian.M ongoose < 0.5) ∨ (Common.Genet ≥ 0.5 ∧ Lesser.W hite.toothed.Shrew ≥ 0.5) ←→ (t3 − max < 11.95 ∧ t3 − max ≥ 11.05 ∧ t5 − min < 5.65) ∨ (t3 − max ≥ 11.95 ∧ t10 − max ≥ 20.25 ∧ p12 < 59.45) ∨ (t3 − max ≥ 11.95 ∧ t10 − max < 20.25 ∧ t3 − max ≥ 13.75) (Alpine.marmot < 0.5 ∧ Alpine.Shrew ≥ 0.5 ∧ Brown.rat < 0.5) ∨ (Alpine.marmot ≥ 0.5 ∧ Common.Shrew ≥ 0.5) ←→ (p5 ≥ 103.5 ∧ t2 − min < −6.35) (Chamois < 0.5 ∧ Gerbe.s.V ole < 0.5 ∧ Alpine.Shrew ≥ 0.5) ∨ (Chamois < 0.5 ∧ Gerbe.s.V ole ≥ 0.5) ∨ (Chamois ≥ 0.5) ←→ (p5 ≥ 75.95 ∧ p1 < 132.5 ∧ p5 ≥ 90.9) (Egyptian.M ongoose ≥ 0.5 ∧ Etruscan.Shrew ≥ 0.5) ←→ (t3 − max ≥ 13.25 ∧ t3 − max ≥ 17.25) (Stoat < 0.5 ∧ M editerranean.M onk.Seal ≥ 0.5) ←→ (p9 < 15.5) (Alpine.marmot ≥ 0.5 ∧ Alpine.F ield.M ouse ≥ 0.5) ←→ (p5 ≥ 106.5) (Granada.Hare ≥ 0.5 ∧ Iberian.Lynx ≥ 0.5) ←→ (t7 − max ≥ 34.25) Continued Appendix A Redescription Sets from experiments with Bio Data Set J 0.583 84 Table A.1: Redescriptions mined by Algorithm 1 from Bio data set (with Gini impurity measure and min bucket=20.) J - Jaccard similarity E1,1 support ; tn − {max; min; avg} stand for minimum, maximum, and average temperature of month n in degrees Celsius, and pn stands for average precipitation of month n in millimeters. J E1,1 p-val. Redescription Appendix A Redescription Sets from experiments with Bio Data Set Table A.1: Redescriptions mined by Algorithm 1 from Bio data set (with Gini impurity measure and min bucket=20.) J - Jaccard similarity E1,1 support ; tn − {max; min; avg} stand for minimum, maximum, and average temperature of month n in degrees Celsius, and pn stands for average precipitation of month n in millimeters. 85 E1,1 2347 2262 2159 p-val. 0.000 0.000 0.000 0.949 0.947 0.947 0.947 2372 2286 2267 2178 0.000 0.000 0.000 0.000 0.938 0.938 1905 1905 0.000 0.000 0.932 2072 0.000 0.907 0.906 1789 2079 0.000 0.000 0.900 1816 0.000 0.891 1750 0.000 0.883 1968 0.000 Redescription (Arctic.F ox < 0.5) ←→ (t6 − avg < 10.25 ∧ t9 − max ≥ 10.75) ∨ (t6 − avg ≥ 10.25) (Arctic.F ox < 0.5 ∧ W olverine < 0.5) ←→ (t9 − max < 12.15 ∧ t9 − max ≥ 10.85) ∨ (t9 − max ≥ 12.15) (W ood.Lemming < 0.5∧M oose ≥ 0.5∧Brown.Bear < 0.5)∨(W ood.Lemming < 0.5∧M oose < 0.5) ←→ (t2−avg < −4.99 ∧ t7 − avg < 10.85) ∨ (t2 − avg ≥ −4.99) (Stoat < 0.5 ∧ Granada.Hare < 0.5) ∨ (Stoat ≥ 0.5) ←→ (p8 < 40.15 ∧ p11 ≥ 54) ∨ (p8 ≥ 40.15) (N orway.lemming < 0.5) ←→ (t8 − avg < 12.55 ∧ t7 − max < 13.95) ∨ (t8 − avg ≥ 12.55) (Grey.Red.Backed.V ole < 0.5) ←→ (t1 − min < −11.45 ∧ t7 − max ≥ 20.05) ∨ (t1 − min ≥ −11.45) (W ood.mouse < 0.5∧M ountain.Hare ≥ 0.5∧Striped.F ield.M ouse ≥ 0.5)∨(W ood.mouse < 0.5∧M ountain.Hare < 0.5)∨(W ood.mouse ≥ 0.5) ←→ (t4−max < 7.85∧t7−max ≥ 14.05∧t10−max ≥ 7.15)∨(t4−max < 7.85∧t7−max < 14.05) ∨ (t4 − max ≥ 7.85) (M oose ≥ 0.5∧M ountain.Hare < 0.5)∨(M oose < 0.5) ←→ (t3−max < 4.65∧t7−max < 13.45)∨(t3−max ≥ 4.65) (M ountain.Hare ≥ 0.5 ∧ M oose < 0.5) ∨ (M ountain.Hare < 0.5) ←→ (t3 − max < 4.65 ∧ t7 − max < 13.45) ∨ (t3 − max ≥ 4.65) (W ood.mouse < 0.5 ∧ M ountain.Hare < 0.5) ∨ (W ood.mouse ≥ 0.5) ←→ (t10 − max < 10.85 ∧ t2 − max < −1.45 ∧ t7 − avg < 10.65) ∨ (t10 − max < 10.85 ∧ t2 − max ≥ −1.45) ∨ (t10 − max ≥ 10.85) (M oose < 0.5) ←→ (t2 − max < 1.55 ∧ t6 − max < 12.05) ∨ (t2 − max ≥ 1.55) (Stoat < 0.5 ∧ House.mouse ≥ 0.5 ∧ Common.V ole ≥ 0.5) ∨ (Stoat < 0.5 ∧ House.mouse < 0.5) ∨ (Stoat ≥ 0.5) ←→ (t10 − max ≥ 18.65 ∧ t11 − avg < 9.73) ∨ (t10 − max < 18.65) (M ountain.Hare ≥ 0.5 ∧ W ild.boar ≥ 0.5) ∨ (M ountain.Hare < 0.5 ∧ W ild.boar < 0.5 ∧ Gray.Seal < 0.5) ∨ (M ountain.Hare < 0.5∧W ild.boar ≥ 0.5) ←→ (t5−max < 16.05∧t7−max ≥ 13.45∧t8−max ≥ 19.55)∨(t5−max < 16.05 ∧ t7 − max < 13.45) ∨ (t5 − max ≥ 16.05) (M ountain.Hare < 0.5) ←→ (t9−avg < 13.05∧t7−max ≥ 13.45∧t10−max ≥ 11.45)∨(t9−avg < 13.05∧t7−max < 13.45) ∨ (t9 − avg ≥ 13.05) (Stoat < 0.5 ∧ House.mouse < 0.5) ∨ (Stoat ≥ 0.5) ←→ (t11 − max ≥ 11.05 ∧ t8 − avg ≥ 19.55 ∧ p10 < 55.45) ∨ (t11 − max ≥ 11.05 ∧ t8 − avg < 19.55) ∨ (t11 − max < 11.05) Continued Appendix A Redescription Sets from experiments with Bio Data Set J 0.966 0.958 0.956 86 Table A.2: Redescriptions mined by Algorithm 1 from Bio data set (with IG - impurity measure and min bucket=100.) J - Jaccard similarity E1,1 support ; tn − {max; min; avg} stand for minimum, maximum, and average temperature of month n in degrees Celsius, and pn stands for average precipitation of month n in millimeters. J 0.877 E1,1 1613 p-val. 0.000 0.870 1667 0.000 0.849 1625 0.000 0.842 1691 0.000 0.835 1701 0.000 0.829 0.823 0.808 0.804 0.802 1414 1599 1325 1436 1167 0.000 0.000 0.000 0.000 0.000 0.781 0.767 0.749 0.748 0.745 1013 603 870 935 825 0.000 0.000 0.000 0.000 0.000 0.741 0.738 0.714 0.702 611 637 282 353 0.000 0.000 0.000 0.000 87 Redescription (M ountain.Hare ≥ 0.5 ∧ Beech.M arten ≥ 0.5) ∨ (M ountain.Hare < 0.5 ∧ Gray.Seal < 0.5) ←→ (t8 − max < 21.15 ∧ t7 − max < 13.45) ∨ (t8 − max ≥ 21.15) (M uskrat < 0.5 ∧ Common.Shrew < 0.5 ∧ House.mouse.1 ≥ 0.5) ∨ (M uskrat < 0.5 ∧ Common.Shrew ≥ 0.5) ∨ (M uskrat ≥ 0.5) ←→ (p7 ≥ 39.85 ∧ p10 ≥ 83.35 ∧ t1 − avg < 2.745) ∨ (p7 ≥ 39.85 ∧ p10 < 83.35) (House.mouse ≥ 0.5 ∧ Y ellow.necked.M ouse ≥ 0.5 ∧ Common.V ole ≥ 0.5) ∨ (House.mouse < 0.5) ←→ (t1 − avg ≥ 0.95 ∧ t1 − min < −0.35 ∧ t8 − max ≥ 22.65) ∨ (t1 − avg < 0.95) (Common.Shrew < 0.5 ∧ House.mouse < 0.5) ∨ (Common.Shrew ≥ 0.5) ←→ (t1 − max ≥ 4.35 ∧ t3 − avg < 6.235) ∨ (t1 − max < 4.35) (Stoat < 0.5 ∧ American.M ink < 0.5 ∧ Eurasian.W ater.Shrew ≥ 0.5) ∨ (Stoat < 0.5 ∧ American.M ink ≥ 0.5) ∨ (Stoat ≥ 0.5) ←→ (p8 ≥ 49.55) (House.mouse ≥ 0.5 ∧ Raccoon ≥ 0.5) ∨ (House.mouse < 0.5) ←→ (t1 − max < 4.35) (Stoat < 0.5 ∧ European.M ole ≥ 0.5) ∨ (Stoat ≥ 0.5) ←→ (p7 ≥ 42 ∧ t7 − max ≥ 16.75) (American.M ink < 0.5 ∧ M ountain.Hare < 0.5) ←→ (t8 − max < 22.05 ∧ p8 < 57.65) ∨ (t8 − max ≥ 22.05) (Stoat < 0.5 ∧ Black.rat < 0.5 ∧ American.M ink ≥ 0.5) ∨ (Stoat ≥ 0.5) ←→ (t9 − max < 22.15 ∧ p8 ≥ 51.75) (House.mouse ≥ 0.5∧Raccoon ≥ 0.5)∨(House.mouse < 0.5∧M oose < 0.5∧House.mouse.1 ≥ 0.5)∨(House.mouse < 0.5 ∧ M oose ≥ 0.5) ←→ (t1 − max < 5.05 ∧ p2 ≥ 40.45 ∧ t1 − max < 3.15) ∨ (t1 − max < 5.05 ∧ p2 < 40.45) (House.mouse.1 < 0.5 ∧ Raccoon.Dog ≥ 0.5) ∨ (House.mouse.1 ≥ 0.5) ←→ (t1 − max < 4.35 ∧ t7 − avg ≥ 12.55) (M oose ≥ 0.5) ←→ (t2 − max < 1.55 ∧ t6 − max ≥ 12.05) (House.mouse ≥ 0.5 ∧ Raccoon < 0.5) ←→ (t1 − max ≥ 4.35) (American.M ink < 0.5 ∧ M ountain.Hare ≥ 0.5) ∨ (American.M ink ≥ 0.5) ←→ (t8 − max < 22.05 ∧ p8 ≥ 57.65) (American.M ink < 0.5 ∧ M ountain.Hare ≥ 0.5) ∨ (American.M ink ≥ 0.5 ∧ Beech.M arten < 0.5) ←→ (t8 − max < 21.15) (M ountain.Hare ≥ 0.5) ←→ (t9 − avg < 13.05 ∧ t7 − max ≥ 13.45 ∧ t10 − max < 11.45) (M oose < 0.5∧Red.F ox < 0.5∧House.mouse < 0.5)∨(M oose ≥ 0.5∧Beech.M arten < 0.5) ←→ (t10−max < 10.45) (M oose ≥ 0.5 ∧ Brown.Bear ≥ 0.5) ←→ (t2 − max < −1.75 ∧ t7 − avg ≥ 10.85) (W ood.mouse < 0.5 ∧ M ountain.Hare ≥ 0.5) ←→ (t10 − max < 10.85 ∧ t2 − max < −1.45 ∧ t7 − avg ≥ 10.65) Continued Appendix A Redescription Sets from experiments with Bio Data Set Table A.2: Redescriptions mined by Algorithm 1 from Bio data set (with IG - impurity measure and min bucket=100.) J - Jaccard similarity E1,1 support ; tn − {max; min; avg} stand for minimum, maximum, and average temperature of month n in degrees Celsius, and pn stands for average precipitation of month n in millimeters. E1,1 604 p-val. 0.000 0.693 789 0.000 0.663 0.655 0.641 0.634 212 182 567 772 0.000 0.000 0.000 0.000 0.591 0.484 0.442 0.418 182 151 76 107 0.000 0.000 0.000 0.000 Redescription (Common.Shrew < 0.5 ∧ Greater.W hite.toothed.Shrew < 0.5 ∧ Black.rat ≥ 0.5) ∨ (Common.Shrew < 0.5 ∧ Greater.W hite.toothed.Shrew ≥ 0.5) ←→ (t3 − max ≥ 10.55 ∧ t1 − avg ≥ 0.657) (Stoat < 0.5 ∧ Black.rat < 0.5 ∧ American.M ink < 0.5) ∨ (Stoat < 0.5 ∧ Black.rat ≥ 0.5) ←→ (t9 − max < 22.15 ∧ p8 < 51.75) ∨ (t9 − max ≥ 22.15) (Arctic.F ox < 0.5 ∧ N orway.lemming ≥ 0.5) ∨ (Arctic.F ox ≥ 0.5) ←→ (t8 − avg < 12.55 ∧ t9 − avg < 6.535) (W ood.Lemming ≥ 0.5) ←→ (t12 − max < −0.65 ∧ t5 − avg ≥ 3.195 ∧ t12 − avg < −6.125) (Common.Shrew < 0.5 ∧ House.mouse ≥ 0.5) ←→ (t1 − max ≥ 4.35 ∧ t3 − avg ≥ 6.235) (M oose < 0.5 ∧ Greater.Horseshoe.Bat < 0.5 ∧ Gray.W olf ≥ 0.5) ∨ (M oose < 0.5 ∧ Greater.Horseshoe.Bat ≥ 0.5) ←→ (t10 − max ≥ 14.25 ∧ t9 − min < 16.25) (Grey.Red.Backed.V ole ≥ 0.5) ←→ (t1 − min < −11.45 ∧ t7 − max < 20.05) (European.Hamster ≥ 0.5) ←→ (p10 < 45.15 ∧ p6 ≥ 61.85 ∧ p4 < 48.25) (Alpine.marmot ≥ 0.5) ←→ (p4 ≥ 51.75 ∧ p5 ≥ 97.35) (Alpine.Shrew ≥ 0.5) ←→ (p6 ≥ 86.85 ∧ p5 ≥ 90.15) Appendix A Redescription Sets from experiments with Bio Data Set J 0.701 88 Table A.2: Redescriptions mined by Algorithm 1 from Bio data set (with IG - impurity measure and min bucket=100.) J - Jaccard similarity E1,1 support ; tn − {max; min; avg} stand for minimum, maximum, and average temperature of month n in degrees Celsius, and pn stands for average precipitation of month n in millimeters. J 0.912 E1,1 1406 p-val. 0.000 0.876 1590 0.000 0.841 1671 0.000 0.823 1293 0.000 0.811 1368 0.000 0.803 1237 0.000 0.802 759 0.000 0.801 1180 0.000 Redescription (M editerranean.W ater.Shrew ≥ 0.5) ∧ (Alpine.Shrew ≥ 0.5) ∨ (M editerranean.W ater.Shrew < 0.5) ∧ (M oose < 0.5) ∧ (Arctic.F ox < 0.5) ←→ (p5 ≥ 58.65) ∧ (p6 ≥ 86.85) ∨ (p5 < 58.65) ∧ (t11 − max ≥ 6.85) ∧ (t9 − max ≥ 10.75) (Brown.rat < 0.5) ∧ (Common.Shrew ≥ 0.5) ∨ (Brown.rat ≥ 0.5) ∧ (Eurasian.W ater.Shrew ≥ 0.5) ∨ (Common.Shrew < 0.5) ∧ (American.M ink ≥ 0.5) ∨ (Eurasian.W ater.Shrew < 0.5) ∧ (M ountain.Hare ≥ 0.5) ←→ (p8 < 38.25) ∧ (p7 ≥ 46.1) ∨ (p8 ≥ 38.25) ∧ (t7 − avg < 20.25) ∨ (t7 − avg ≥ 20.25) ∧ (t9 − max < 17.35) (Eurasian.P ygmy.Shrew < 0.5) ∧ (American.M ink ≥ 0.5) ∨ (American.M ink < 0.5) ∧ (W ood.mouse < 0.5) ∨ (Eurasian.P ygmy.Shrew ≥ 0.5) ∧ (Lusitanian.P ine.V ole < 0.5) ∧ (Etruscan.Shrew < 0.5) ←→ (p8 < 49.55) ∧ (t3 − max < 5.45) ∨ (t3 − max ≥ 5.45) ∧ (t1 − max < 1.65) ∨ (p8 ≥ 49.55) ∧ (t2 − max < 8.85) ∧ (t11 − max < 10.95) (European.P ine.M arten ≥ 0.5) ∧ (Common.Shrew ≥ 0.5) ∨ (European.P ine.M arten < 0.5) ∧ (House.mouse < 0.5) ∧ (House.mouse.1 ≥ 0.5) ∨ (Common.Shrew < 0.5) ∧ (W ood.mouse < 0.5) ←→ (t1 − avg < 2.365) ∧ (t2 − max < 6.25) ∨ (t1 − avg ≥ 2.365) ∧ (t1 − max < 4.05) ∧ (p7 ≥ 36.75) ∨ (t2 − max ≥ 6.25) ∧ (t12 − max < 4.45) (Garden.dormouse ≥ 0.5) ∧ (Common.Shrew < 0.5) ∨ (Garden.dormouse < 0.5) ∧ (American.M ink ≥ 0.5) ∧ (Beech.M arten ≥ 0.5) ∨ (Garden.dormouse < 0.5) ∧ (American.M ink < 0.5) ∧ (Gray.Seal < 0.5) ←→ (t9 − max ≥ 18.95) ∧ (p7 < 61.85) ∨ (t9 − max < 18.95) ∧ (t8 − max < 22.25) ∧ (t5 − min ≥ 5.95) ∨ (t9 − max < 18.95) ∧ (t8 − max ≥ 22.25) ∧ (t5 − max ≥ 15.9) (Common.Shrew < 0.5) ∧ (House.mouse < 0.5) ∧ (House.mouse.1 ≥ 0.5) ∨ (Common.Shrew ≥ 0.5) ∧ (Kuhl.s.P ipistrelle < 0.5) ∧ (Chamois < 0.5) ←→ (t2 − max ≥ 7.25) ∧ (t1 − max < 3.55) ∧ (p7 ≥ 36.85) ∨ (t2 − max < 7.25) ∧ (p4 < 59.55) ∧ (p5 < 76.05) (Kuhl.s.P ipistrelle ≥ 0.5) ∧ (Alpine.marmot < 0.5) ∨ (Kuhl.s.P ipistrelle < 0.5) ∧ (Common.Shrew < 0.5) ∧ (House.mouse ≥ 0.5) ←→ (t3 − max ≥ 11.05) ∧ (t2 − min ≥ −3.95) ∨ (t3 − max < 11.05) ∧ (t3 − avg ≥ 6.375) ∧ (t1 − max ≥ 3.55) (European.W ater.V ole ≥ 0.5) ∧ (Common.Shrew ≥ 0.5) ∨ (European.W ater.V ole < 0.5) ∧ (House.mouse < 0.5) ∧ (M oose ≥ 0.5) ←→ (t1−max < 7.25)∧(t3−max < 10.55)∨(t1−max ≥ 7.25)∧(t12−max < 5.65)∧(t2−max < 0.45) Appendix A Redescription Sets from experiments with Bio Data Set Table A.3: Redescriptions mined by Algorithm 2 from Bio data set (with IG impurity measure and min bucket=50.) J - Jaccard similarity E1,1 support ; tn − {max; min; avg} stand for minimum, maximum, and average temperature of month n in degrees Celsius, and pn stands for average precipitation of month n in millimeters. 89 Appendix B Redescription Sets from experiments with DBLP data Set 91 E1,1 2335 2334 2334 2331 2270 4 4 4 5 3 3 5 4 3 10 p-val. 0.119 0.127 0.261 0.151 0.155 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 Redescription COLT ≥ 9.5 ∧ U AI < 0.5 ∨ COLT < 9.5 ←→ Y oavF reund < 1.5 COLT ≥ 10.5 ∧ ICM L < 3.5 ∨ COLT < 10.5 ←→ RobertE.Schapire < 3.5 V LDB ≥ 17.5 ∧ P ODS < 3.5 ∨ V LDB < 17.5 ←→ S.Sudarshan < 6 V LDB ≥ 18.5 ∧ SIGM ODConf erence ≥ 26.5 ∨ V LDB < 18.5 ←→ ShaulDar < 0.5 ST OC ≥ 8.5 ∧ SODA ≥ 5.5 ∨ ST OC < 8.5 ←→ AviW igderson ≥ 0.5 ∧ SilvioM icali < 0.5 ∨ AviW igderson < 0.5 ECM L ≥ 2.5 ∧ U AI ≥ 1.5 ←→ P eterGrunwald ≥ 0.5 W W W ≥ 3.5 ∧ F OCS ≥ 1.5 ←→ RaviKumar ≥ 7.5 U AI ≥ 2.5 ∧ KDD ≥ 2.5 ←→ T omiSilander ≥ 0.5 ICDE ≥ 12.5 ∧ EDBT < 3.5 ←→ AnthonyK.H.T ung ≥ 1.5 ∧ Jef f reyXuY u ≥ 0.5 KDD ≥ 6.5 ∧ ICDM ≥ 6.5 ←→ JiongY ang ≥ 3.5 F OCS ≥ 13.5 ∧ COLT ≥ 0.5 ←→ RichardM.Karp ≥ 1.5 ∧ AviW igderson ≥ 0.5 P ODS ≥ 8.5 ∧ SIGM ODConf erence ≥ 3.5 ←→ CatrielBeeri ≥ 1.5 ∧ HectorGarcia − M olina ≥ 0.5 ICDE ≥ 13.5 ∧ W W W < 1.5 ←→ AnthonyK.H.T ung ≥ 1.5 ∧ RaymondT.N g ≥ 0.5 V LDB ≥ 17.5 ∧ P ODS ≥ 3.5) ←→ S.Sudarshan ≥ 6 ST OC ≥ 8.5 ∧ SODA < 5.5 ←→ AviW igderson ≥ 0.5 ∧ SilvioM icali ≥ 0.5 Appendix B Redescription Sets from experiments with DBLP data Set J 0.998 0.997 0.997 0.996 0.972 0.571 0.500 0.500 0.500 0.429 0.333 0.313 0.308 0.273 0.133 92 Table B.1: Redescriptions mined by Algorithm 1 from DBLP data set (with DBSCAN binarization routine; Gini-impurity measure; min bucket = 5) LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 - support ; CON F [a−b] - author submitted from a to b papers for conference CONF . J 0.996 0.972 0.667 0.500 0.417 0.357 0.333 0.133 E1,1 2330 2270 4 4 5 5 4 10 p-val. 0.158 0.155 0.000 0.000 0.000 0.000 0.000 0.000 Redescription KDD ≥ 6.5 ∧ SIGM ODConf erence < 2.5 ∨ KDD < 6.5 ←→ JianyongW ang < 1.5 ST OC ≥ 8.5 ∧ SODA ≥ 5.5) ∨ (ST OC < 8.5 ←→ AviW igderson ≥ 0.5 ∧ SilvioM icali < 0.5 ∨ AviW igderson < 0.5 U AI ≥ 23 ←→ DavidM axwellChickering ≥ 0.5 U AI ≥ 2.5 ∧ KDD ≥ 2.5) ←→ (T omiSilander ≥ 0.5 P ODS ≥ 11.5 ∧ ST OC < 0.5 ←→ F rankN even ≥ 0.5 V LDB ≥ 18.5 ∧ SIGM ODConf erence < 26.5 ←→ ShaulDar ≥ 0.5 F OCS ≥ 20.5) ←→ SilvioM icali ≥ 1.5 ∧ RonaldL.Rivest ≥ 0.5 ST OC ≥ 8.5 ∧ SODA < 5.5 ←→ AviW igderson ≥ 0.5 ∧ SilvioM icali >= 0.5 Appendix B Redescription Sets from experiments with DBLP data Set Table B.2: Redescriptions mined by Algorithm 1 from DBLP data set (with k-means (5 clusters) binarization routine; Gini-impurity measure; min bucket = 5 LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 - support ; CON F [a−b] - author submitted from a tob papers for conference CONF 93 E1,1 2 3 p-val. 0 0 1 1 1 1 1 1 1 1 1 1 3 1 2 3 1 5 0 0 0 0 0 0 0 0 1 1 1 0.667 1 1 5 2 0 0 0 0 Redescription ICDT ≥ 2.5 ∧ COLT ≥ 0.5 ←→ F otoN.Af rati ≥ 1.5 ∧ GeorgGottlob ≥ 0.5 ICDT ≥ 4.5 ∧ V LDB ≥ 0.5 ∧ ICDE ≥ 2.5 ←→ GostaGrahne < 2.5 ∧ KotagiriRamamohanarao ≥ 13 ∨ GostaGrahne ≥ 2.5 ∧ JigneshM.P atel < 0.5 ST ACS ≥ 14.5 ←→ LeenT orenvliet ≥ 7.5 P ODS ≥ 11.5 ∧ W W W ≥ 1.5 ←→ ArnaudSahuguet ≥ 6.5 W W W ≥ 3.5 ∧ F OCS ≥ 4 ←→ AndrewT omkins ≥ 8.5 P KDD ≥ 4.5 ∧ ECM L ≥ 2.5 ←→ LucDeRaedt ≥ 6.5 W W W ≥ 4.5 ∧ ICDM ≥ 3.5 ←→ BenyuZhang ≥ 12 ICDM ≥ 8.5 ∧ ICM L < 0.5 ←→ JiongY ang ≥ 2.5 ∧ P hilipS.Y u ≥ 8 ICDM ≥ 15.5 ←→ Kun − LungW u ≥ 24 U AI ≥ 23 ←→ DavidM axwellChickering < 0.5 ∧ BruceD0 Ambrosio ≥ 0.5 ∨ DavidM axwellChickering ≥ 0.5 ∧ P eterSpirtes < 1 V LDB ≥ 29.5 ←→ JigneshM.P atel ≥ 4.5 F OCS ≥ 28.5 ←→ BennySudakov ≥ 4.5 ECM L ≥ 2.5 ∧ U AI ≥ 1.5 ←→ P eterGrunwald < 1.5 ∧ StephenD.Bay ≥ 3.5 ∨ P eterGrunwald ≥ 1.5 ICDM ≥ 10 ←→ W eiF an ≥ 11 Appendix B Redescription Sets from experiments with DBLP data Set J 1 1 94 Table B.3: Redescriptions mined by Algorithm 1 from DBLP data set (with hierarchical clustering binarization routine; IG-impurity measure; Li min bucket = 100 ) LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 support ; CON F [a−b] - author submitted from a to b papers for conference CONF . J 0.985 E1,1 2301 p-val. 0.125 0.978 2287 0.255 0.971 2272 0.279 0.959 745 0.000 0.959 1557 0.000 0.957 707 0.000 0.954 700 0.000 0.949 636 0.000 0.948 529 0.000 Redescription COLT < 0.5 ∧ SIGM ODConf erence < 0.5 ∨ COLT ≥ 0.5 ∧ F OCS ≥ 1 ∨ SIGM ODConf erence ≥ 0.5 ∧ V LDB < 21.5 ∨ F OCS < 1 ∧ ECM L < 1.5 ←→ RobertE.Schapire < 0.5 ∧ HamidP irahesh < 0.5 ∨ RobertE.Schapire ≥ 0.5 ∧ M ichaelJ.Kearns ≥ 1.5 ∨ HamidP irahesh ≥ 0.5 ∧ JohannesGehrke < 0.5 ∨ M ichaelJ.Kearns < 1.5 ∧ Stef anKramer < 0.5 COLT ≥ 0.5 ∨ COLT < 0.5 ∧ SIGM ODConf erence < 0.5 ∨ SIGM ODConf erence ≥ 0.5 ∧ V LDB < 21.5 ←→ RonittRubinf eld ≥ 2.5 ∨ RonittRubinf eld < 2.5 ∧ HamidP irahesh < 0.5 ∨ HamidP irahesh ≥ 0.5 ∧ JohannesGehrke < 0.5 ST ACS ≥ 1.5 ∨ ST ACS < 1.5 ∧ SIGM ODConf erence < 0.5 ∨ SIGM ODConf erence ≥ 0.5 ∧ V LDB < 21.5 ←→ LeszekGasieniec ≥ 2.5 ∨ LeszekGasieniec < 2.5 ∧ HamidP irahesh < 0.5 ∨ HamidP irahesh ≥ 0.5 ∧ JohannesGehrke < 0.5 SIGM ODConf erence < 1.5 ∧ V LDB ≥ 0.5 ∧ SODA < 1.5 ∨ SIGM ODConf erence ≥ 1.5 ∧ ICDE ≥ 3.5 ∧ KDD ≥ 3 ←→ H.V.Jagadish < 0.5 ∧ M ichaelJ.Carey ≥ 0.5 ∧ M osesCharikar < 1.5 ∨ H.V.Jagadish ≥ 0.5 ∧ JiaweiHan ≥ 2.5 ∧ BengChinOoi ≥ 0.5 ST OC ≥ 1.5 ∧ F OCS < 0.5 ∨ ST OC < 1.5 ∧ SODA < 0.5 ∨ F OCS ≥ 0.5 ∧ SIGM ODConf erence ≥ 0.5 ∨ SODA ≥ 0.5 ∧ P ODS ≥ 1.5 ←→ F rankT homsonLeighton ≥ 0.5 ∧ AviW igderson < 0.5 ∨ F rankT homsonLeighton < 0.5 ∧ SergeA.P lotkin < 0.5 ∨ AviW igderson ≥ 0.5 ∧ CatrielBeeri ≥ 0.5 ∨ SergeA.P lotkin ≥ 0.5 ∧ M ayurDatar ≥ 0.5 ST OC < 0.5 ∧ F OCS ≥ 0.5 ∧ V LDB < 1.5 ∨ ST OC ≥ 0.5 ∧ SODA ≥ 4.5 ∧ SIGM ODConf erence ≥ 1 ←→ F riedhelmM eyerauf derHeide < 1.5 ∧ AviW igderson ≥ 0.5 ∧ AlbertoO.M endelzon < 0.5 ∨ F riedhelmM eyerauf derHeide ≥ 1.5 ∧ SantoshV empala ≥ 0.5 ∧ M ayurDatar ≥ 2 ST OC < 0.5 ∧ F OCS ≥ 0.5 ∧ V LDB < 1.5 ∨ ST OC ≥ 0.5 ∧ SODA ≥ 4.5 ∧ ST ACS < 0.5 ←→ V ijayV.V azirani < 0.5∧AviW igderson ≥ 0.5∧AlbertoO.M endelzon < 0.5∨V ijayV.V azirani ≥ 0.5∧P iotrIndyk ≥ 0.5∧T etsuoAsano < 0.5 ST OC < 1.5 ∧ F OCS ≥ 0.5 ∧ V LDB < 1.5 ∨ ST OC ≥ 1.5 ∧ SODA ≥ 4.5 ∧ COLT ≥ 2.5 ←→ M ichaelE.Saks < 0.5 ∧ AviW igderson ≥ 0.5 ∧ AlbertoO.M endelzon < 0.5 ∨ M ichaelE.Saks ≥ 0.5 ∧ AmosF iat ≥ 0.5 ∧ N aderH.Bshouty ≥ 0.5 F OCS < 2.5 ∧ ST OC ≥ 1.5 ∧ V LDB < 1.5 ∨ F OCS ≥ 2.5 ∧ SODA ≥ 6 ∧ ICDT ≥ 0.5 ←→ RobertW.F loyd < 0.5 ∧ AviW igderson ≥ 0.5∧RonaldF agin < 0.5∨RobertW.F loyd ≥ 0.5∧M ichaelA.Bender ≥ 0.5∧SanjeevKhanna ≥ 1.5 Continued Appendix B Redescription Sets from experiments with DBLP data Set Table B.4: Redescriptions mined by Algorithm 2 from DBLP data set (with DBSCAN binarization routine; IG-impurity measure; min bucket= P Li 100 ); LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 - support 95 E1,1 775 p-val. 0.000 0.919 711 0.000 0.917 2061 0.000 0.904 2094 0.091 0.901 182 0.000 0.883 2032 0.022 0.879 2033 0.092 0.877 2028 0.099 0.876 2024 0.074 Redescription V LDB < 0.5 ∧ SIGM ODConf erence ≥ 0.5 ∧ SDM < 0.5 ∨ V LDB ≥ 0.5 ∧ ICDE ≥ 7.5 ∧ ICDT < 0.5 ←→ ChristosF aloutsos < 0.5∧HamidP irahesh ≥ 0.5∧P hilipS.Y u < 2.5∨ChristosF aloutsos ≥ 0.5∧Kian−LeeT an ≥ 0.5 ∧ RaymondT.N g < 4 ST OC < 0.5 ∧ F OCS ≥ 0.5 ∧ SIGM ODConf erence < 0.5 ∨ ST OC ≥ 0.5 ∧ SODA ≥ 0.5 ∧ ICDE < 0.5 ←→ T omasF eder < 0.5 ∧ AviW igderson ≥ 0.5 ∧ CatrielBeeri < 0.5 ∨ T omasF eder ≥ 0.5 ∧ AmosF iat ≥ 0.5 ∧ SergeA.P lotkin < 1.5 U AI < 2.5 ∧ ICDE ≥ 0.5 ∧ V LDB < 5.5 ∨ U AI < 2.5 ∧ ICDE < 0.5 ∧ SIGM ODConf erence < 2.5 ∨ U AI ≥ 2.5 ∧ ECM L ≥ 2.5 ∧ P KDD < 0.5 ←→ T omiSilander < 0.5 ∧ GioW iederhold ≥ 0.5 ∧ M ichaelJ.Carey < 1.5 ∨ T omiSilander < 0.5 ∧ GioW iederhold < 0.5 ∧ HamidP irahesh < 0.5 ∨ T omiSilander ≥ 0.5 ∧ P eterGrunwald ≥ 1 ∧ P etriKontkanen ≥ 7 W W W ≥ 0.5 ∨ W W W < 0.5 ∧ F OCS < 0.5 ∨ F OCS ≥ 0.5 ∧ ST OC < 8.5 ←→ AmelieM arian ≥ 1.5 ∨ AmelieM arian < 1.5 ∧ AviW igderson < 0.5 ∨ AviW igderson ≥ 0.5 ∧ M oniN aor < 0.5 SIGM ODConf erence ≥ 0.5∧ICM L < 0.5∨SIGM ODConf erence < 0.5∧V LDB ≥ 0.5∧ICDE ≥ 13.5∨ICM L ≥ 0.5 ∧ W W W < 0.5 ←→ W en − SyanLi ≥ 0.5 ∧ W ei − Y ingM a < 2 ∨ W en − SyanLi < 0.5 ∧ M ichaelJ.Carey ≥ 0.5 ∧ LuisGravano ≥ 3.5 ∨ W ei − Y ingM a ≥ 2 ∧ Ji − RongW en < 0.5 COLT ≥ 1.5 ∧ SODA < 9.5 ∨ COLT < 1.5 ∧ ICDE ≥ 0.5 ∧ SIGM ODConf erence < 8.5 ∨ COLT < 1.5 ∧ ICDE < 0.5 ∧ V LDB < 0.5 ∨ SODA ≥ 9.5 ∧ F OCS < 3.5 ←→ LeslieG.V aliant ≥ 0.5 ∧ AmosF iat < 0.5 ∨ LeslieG.V aliant < 0.5 ∧ SurajitChaudhuri ≥ 0.5 ∧ RakeshAgrawal < 0.5 ∨ LeslieG.V aliant < 0.5 ∧ SurajitChaudhuri < 0.5 ∧ M ichaelJ.Carey < 0.5 ∨ AmosF iat ≥ 0.5 ∧ M artinF urer ≥ 0.5 ST ACS ≥ 2.5∨ST ACS < 2.5∧V LDB ≥ 0.5∧SIGM ODConf erence < 7.5∨ST ACS < 2.5∧V LDB < 0.5∧ICDE < 0.5 ←→ HansL.Bodlaender ≥ 1.5 ∨ HansL.Bodlaender < 1.5 ∧ M ichaelJ.Carey ≥ 0.5 ∧ DiveshSrivastava < 0.5 ∨ HansL.Bodlaender < 1.5 ∧ M ichaelJ.Carey < 0.5 ∧ Hans − P eterKriegel < 7.5 ST ACS ≥ 1.5 ∨ ST ACS < 1.5 ∧ V LDB ≥ 0.5 ∧ SIGM ODConf erence < 7.5 ∨ ST ACS < 1.5 ∧ V LDB < 0.5 ∧ ICDE < 0.5 ←→ AlanL.Selman ≥ 1.5 ∨ AlanL.Selman < 1.5 ∧ M ichaelJ.Carey ≥ 0.5 ∧ DiveshSrivastava < 0.5 ∨ AlanL.Selman < 1.5 ∧ M ichaelJ.Carey < 0.5 ∧ Hans − P eterKriegel < 7.5 U AI ≥ 2.5 ∨ U AI < 2.5 ∧ ICDE ≥ 0.5 ∧ SIGM ODConf erence < 8.5 ∨ U AI < 2.5 ∧ ICDE < 0.5 ∧ V LDB < 0.5 ←→ T omiSilander ≥ 1.5 ∨ T omiSilander < 1.5 ∧ SurajitChaudhuri ≥ 0.5 ∧ RakeshAgrawal < 0.5 ∨ T omiSilander < 1.5 ∧ SurajitChaudhuri < 0.5 ∧ M ichaelJ.Carey < 0.5 Continued Appendix B Redescription Sets from experiments with DBLP data Set J 0.937 96 Table B.4: Redescriptions mined by Algorithm 2 from DBLP data set (with DBSCAN binarization routine; IG-impurity measure; min bucket= P Li 100 ); LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 - support J 0.872 E1,1 2014 p-val. 0.052 0.865 198 0.000 0.843 532 0.000 0.841 660 0.000 0.839 0.822 0.821 1929 1900 630 0.026 0.095 0.000 0.816 248 0.000 0.816 624 0.000 0.815 0.809 1876 689 0.054 0.000 0.806 797 0.000 0.802 701 0.000 Redescription ST OC ≥ 1.5 ∧ F OCS < 5.5 ∨ ST OC < 1.5 ∧ SODA < 0.5 ←→ AviW igderson < 0.5 ∨ AviW igderson ≥ 0.5 ∧ OdedGoldreich < 0.5 SIGM ODConf erence ≥ 0.5 ∧ W W W < 1.5 ∨ SIGM ODConf erence < 0.5 ∧ V LDB ≥ 0.5 ∧ ICDE ≥ 13.5 ←→ StanleyB.Zdonik ≥ 0.5 ∧ M anolisKoubarakis < 0.5 ∨ StanleyB.Zdonik < 0.5 ∧ M ichaelJ.Carey ≥ 0.5 ∧ RaymondT.N g ≥ 0.5 ST OC ≥ 2.5 ∧ ICDE < 0.5 ∨ ST OC < 2.5 ∧ F OCS ≥ 1.5 ∧ SODA < 5.5 ∨ ICDE ≥ 0.5 ∧ KDD < 2 ←→ M oniN aor ≥ 0.5 ∧ SurajitChaudhuri < 0.5 ∨ M oniN aor < 0.5 ∧ AviW igderson ≥ 0.5 ∧ M osesCharikar < 0.5 ∨ SurajitChaudhuri ≥ 0.5 ∧ SanjayAgrawal < 0.5 ST OC ≥ 2.5∧ICDE < 1.5∨ST OC < 2.5∧F OCS ≥ 0.5∧SODA < 5.5 ←→ RonittRubinf eld ≥ 0.5∧M icahAdler < 0.5 ∨ RonittRubinf eld < 0.5 ∧ AviW igderson ≥ 0.5 ∧ M osesCharikar < 0.5 ST OC < 1.5 ∨ ST OC ≥ 1.5 ∧ F OCS < 2.5 ←→ AviW igderson < 0.5 ∨ AviW igderson ≥ 0.5 ∧ OdedGoldreich < 0.5 KDD < 0.5 ←→ JiaweiHan < 0.5 ST OC ≥ 2.5∨ST OC < 2.5∧F OCS ≥ 0.5∧SODA < 5.5 ←→ M oniN aor ≥ 0.5∨M oniN aor < 0.5∧AviW igderson ≥ 0.5 ∧ M osesCharikar < 0.5 SDM < 1.5 ∧ ICDM ≥ 0.5 ∧ EDBT < 0.5 ∨ SDM ≥ 1.5 ∧ KDD ≥ 2.5 ∧ SIGM ODConf erence ≥ 4.5 ←→ P hilipS.Y u < 4.5∧V ipinKumar ≥ 0.5∧P eerKroger < 1.5∨P hilipS.Y u ≥ 4.5∧JiaweiHan ≥ 2∧AidongZhang ≥ 2 F OCS ≥ 1.5 ∨ F OCS < 1.5 ∧ ST OC ≥ 0.5 ∧ SODA < 5.5 ←→ M adhuSudan ≥ 0.5 ∨ M adhuSudan < 0.5 ∧ AviW igderson ≥ 0.5 ∧ M osesCharikar < 0.5 P ODS < 0.5 ←→ AbrahamSilberschatz < 0.5 V LDB ≥ 1.5 ∨ V LDB < 1.5 ∧ SIGM ODConf erence ≥ 0.5 ∧ ICDM < 0.5 ←→ RakeshAgrawal ≥ 0.5 ∨ RakeshAgrawal < 0.5 ∧ HamidP irahesh ≥ 0.5 ∧ JiaweiHan < 0.5 SIGM ODConf erence < 2.5 ∧ V LDB < 0.5 ∧ ICDE ≥ 0.5 ∨ SIGM ODConf erence < 2.5 ∧ V LDB ≥ 0.5 ∧ SODA < 1.5 ∨ SIGM ODConf erence ≥ 2.5 ∧ P ODS ≥ 0.5 ∧ KDD < 0.5 ←→ H.V.Jagadish < 1.5 ∧ M ichaelJ.Carey < 0.5 ∧ Hans−P eterKriegel ≥ 7.5∨H.V.Jagadish < 1.5∧M ichaelJ.Carey ≥ 0.5∧M osesCharikar < 1.5∨H.V.Jagadish ≥ 1.5 ∧ CatrielBeeri ≥ 0.5 ∧ JiaweiHan < 2 SIGM ODConf erence ≥ 0.5 ∧ P KDD < 2.5 ∨ SIGM ODConf erence < 0.5 ∧ V LDB ≥ 0.5 ∧ SODA < 0.5 ←→ HectorGarcia − M olina ≥ 2.5 ∨ HectorGarcia − M olina < 2.5 ∧ M ichaelJ.Carey ≥ 0.5 ∧ P iotrIndyk < 0.5 Continued Appendix B Redescription Sets from experiments with DBLP data Set Table B.4: Redescriptions mined by Algorithm 2 from DBLP data set (with DBSCAN binarization routine; IG-impurity measure; min bucket= P Li 100 ); LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 - support 97 E1,1 617 p-val. 0.000 0.786 279 0.000 0.734 80 0.000 0.728 560 0.000 0.674 58 0.000 0.663 690 0.000 0.649 0.640 1488 290 0.084 0.000 0.637 1399 0.000 0.620 1412 0.125 0.618 194 0.000 Redescription ST OC ≥ 1.5 ∨ ST OC < 1.5 ∧ F OCS ≥ 0.5 ∧ SODA < 5.5 ←→ ChristianSohler ≥ 0.5 ∨ ChristianSohler < 0.5 ∧ AviW igderson ≥ 0.5 ∧ M osesCharikar < 0.5 EDBT ≥ 0.5 ∧ F OCS < 4.5 ∨ EDBT < 0.5 ∧ ICDE ≥ 0.5 ∧ SIGM ODConf erence ≥ 8.5 ∨ F OCS ≥ 4.5 ∧ ST OC < 11.5 ←→ P eerKroger ≥ 2 ∧ AvrimBlum < 0.5 ∨ P eerKroger < 2 ∧ SurajitChaudhuri ≥ 0.5 ∧ RakeshAgrawal ≥ 0.5 ∨ AvrimBlum ≥ 0.5 ∧ V enkatesanGuruswami < 0.5 ICM L ≥ 3.5 ∧ ICDE < 0.5 ∨ ICM L < 3.5 ∧ ECM L ≥ 1 ∧ P KDD ≥ 1.5 ∨ ICDE ≥ 0.5 ∧ ICDM < 0.5 ←→ DoinaP recup ≥ 1.5 ∧ JiaweiHan < 0.5 ∨ DoinaP recup < 1.5 ∧ SasoDzeroski ≥ 0.5 ∧ Stef anKramer ≥ 0.5 ∨ JiaweiHan ≥ 0.5 ∧ W ei − Y ingM a < 0.5 ST OC ≥ 1.5 ∧ V LDB < 2.5 ∨ ST OC < 1.5 ∧ F OCS ≥ 0.5 ∧ SODA < 5.5 ←→ RanCanetti ≥ 0.5 ∨ RanCanetti < 0.5 ∧ AviW igderson ≥ 0.5 ∧ M osesCharikar < 0.5 ECM L ≥ 2.5 ∧ V LDB < 0.5 ∨ ECM L < 2.5 ∧ ICM L ≥ 0.5 ∧ U AI ≥ 3.5 ∨ V LDB ≥ 0.5 ∧ ICDT < 3 ←→ Stef anKramer ≥ 1.5 ∧ JianP ei < 0.5 ∨ Stef anKramer < 1.5 ∧ SatinderP.Singh ≥ 0.5 ∧ DavidHeckerman ≥ 0.5 ∨ JianP ei ≥ 0.5 ∧ M ichaelBenedikt < 0.5 SIGM ODConf erence ≥ 1.5∨SIGM ODConf erence < 1.5∧V LDB < 0.5∧ICDE ≥ 0.5∨SIGM ODConf erence < 1.5 ∧ V LDB ≥ 0.5 ∧ P ODS < 1.5 ←→ Jef f reyF.N aughton ≥ 0.5 ∨ Jef f reyF.N aughton < 0.5 ∧ M ichaelJ.Carey < 0.5 ∧ Hans − P eterKriegel ≥ 7.5 ∨ Jef f reyF.N aughton < 0.5 ∧ M ichaelJ.Carey ≥ 0.5 ∧ CatrielBeeri < 0.5 ICDE < 0.5 ←→ SurajitChaudhuri < 0.5 ST OC ≥ 3.5 ∨ ST OC < 3.5 ∧ F OCS ≥ 2.5 ∧ SODA < 5.5 ←→ IgorShparlinski ≥ 0.5 ∨ IgorShparlinski < 0.5 ∧ AviW igderson ≥ 0.5 ∧ M osesCharikar < 0.5 F OCS < 10.5 ∧ ST OC < 6.5 ∧ SIGM ODConf erence < 0.5 ∨ F OCS < 10.5 ∧ ST OC ≥ 6.5 ∧ ICDT < 0.5 ∨ F OCS ≥ 10.5 ∧ SODA ≥ 2.5 ∧ V LDB ≥ 3 ←→ RichardM.Karp < 1.5 ∧ AviW igderson < 0.5 ∧ HamidP irahesh < 0.5 ∨ RichardM.Karp < 1.5 ∧ AviW igderson ≥ 0.5 ∧ HectorGarcia − M olina < 0.5 ∨ RichardM.Karp ≥ 1.5 ∧ S.M uthukrishnan ≥ 0.5 ∧ JosephN aor ≥ 3.5 F OCS ≥ 5.5 ∨ F OCS < 5.5 ∧ ST OC < 5.5 ∧ SIGM ODConf erence < 0.5 ∨ F OCS < 5.5 ∧ ST OC ≥ 5.5 ∧ SODA < 5.5 ←→ RyanO0 Donnell ≥ 1.5 ∨ RyanO0 Donnell < 1.5 ∧ AviW igderson < 0.5 ∧ HamidP irahesh < 0.5 ∨ RyanO0 Donnell < 1.5 ∧ AviW igderson ≥ 0.5 ∧ M onikaRauchHenzinger < 0.5 SDM ≥ 1.5 ∨ SDM < 1.5 ∧ ICDM ≥ 0.5 ∧ KDD ≥ 0.5 ←→ P hilipS.Y u ≥ 4.5 ∨ P hilipS.Y u < 4.5 ∧ V ipinKumar ≥ 0.5 ∧ SunilP rabhakar < 1.5 Continued Appendix B Redescription Sets from experiments with DBLP data Set J 0.797 98 Table B.4: Redescriptions mined by Algorithm 2 from DBLP data set (with DBSCAN binarization routine; IG-impurity measure; min bucket= P Li 100 ); LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 - support J 0.616 E1,1 1397 p-val. 0.105 0.606 43 0.000 0.576 182 0.000 0.564 211 0.000 0.559 170 0.000 0.541 199 0.000 0.537 29 0.000 0.451 32 0.000 0.411 92 0.000 0.400 6 0.000 0.389 7 0.000 Redescription ST OC ≥ 5.5 ∨ ST OC < 5.5 ∧ F OCS < 4.5 ∧ SIGM ODConf erence < 0.5 ∨ ST OC < 5.5 ∧ F OCS ≥ 4.5 ∧ SODA < 5.5 ←→ JamesR.Lee ≥ 3 ∨ JamesR.Lee < 3 ∧ AviW igderson < 0.5 ∧ HamidP irahesh < 0.5 ∨ JamesR.Lee < 3 ∧ AviW igderson ≥ 0.5 ∧ M osesCharikar < 0.5 ECM L ≥ 0.5 ∧ EDBT < 0.5 ∨ ECM L < 0.5 ∧ ICM L ≥ 0.5 ∧ U AI ≥ 3.5 ∨ EDBT ≥ 0.5 ∧ SIGM ODConf erence < 3 ←→ M icheleSebag ≥ 0.5 ∧ H.V.Jagadish < 0.5 ∨ M icheleSebag < 0.5 ∧ SatinderP.Singh ≥ 0.5 ∧ DavidHeckerman ≥ 0.5 ∨ H.V.Jagadish ≥ 0.5 ∧ M ong − LiLee < 2 SODA ≥ 1.5 ∧ COLT < 5 ∨ SODA < 1.5 ∧ ST OC ≥ 0.5 ∧ F OCS ≥ 6.5 ←→ Y airBartal ≥ 0.5 ∧ LeonardP itt < 0.5 ∨ Y airBartal < 0.5 ∧ AviW igderson ≥ 0.5 ∧ M adhuSudan ≥ 0.5 SODA ≥ 0.5 ∧ ST ACS < 2.5 ∨ SODA < 0.5 ∧ ST OC ≥ 0.5 ∧ F OCS ≥ 6.5 ←→ N icoleImmorlica ≥ 0.5 ∧ HarryBuhrman < 0.5 ∨ N icoleImmorlica < 0.5 ∧ AviW igderson ≥ 0.5 ∧ OdedGoldreich ≥ 0.5 SODA ≥ 0.5∧COLT < 5∨SODA < 0.5∧ST OC ≥ 0.5∧F OCS ≥ 6.5 ←→ BorisAronov ≥ 0.5∧RobertE.Schapire < 0.5 ∨ BorisAronov < 0.5 ∧ AviW igderson ≥ 0.5 ∧ M adhuSudan ≥ 0.5 SODA ≥ 0.5 ∧ COLT < 0.5 ∨ SODA < 0.5 ∧ ST OC ≥ 1 ∧ F OCS ≥ 6.5 ←→ M ichielH.M.Smid ≥ 1.5 ∧ M anf redK.W armuth < 1.5 ∨ M ichielH.M.Smid < 1.5 ∧ AviW igderson ≥ 0.5 ∧ M adhuSudan ≥ 0.5 ICM L ≥ 0.5 ∧ V LDB < 0.5 ∨ ICM L < 0.5 ∧ ECM L ≥ 0.5 ∧ P KDD ≥ 1.5 ∨ V LDB ≥ 0.5 ∧ W W W < 3 ←→ P eterA.F lach ≥ 0.5 ∧ KrithiRamamritham < 1 ∨ P eterA.F lach < 0.5 ∧ SasoDzeroski ≥ 0.5 ∧ Stef anKramer ≥ 0.5 ∨ KrithiRamamritham ≥ 1 ∧ ArvindHulgeri < 1.5 ICM L ≥ 0.5 ∧ F OCS < 0.5 ∨ ICM L < 0.5 ∧ ECM L ≥ 0.5 ∧ P KDD ≥ 1.5 ←→ F ernandoC.N.P ereira ≥ 0.5 ∧ EricAllender < 0.5 ∨ F ernandoC.N.P ereira < 0.5 ∧ SasoDzeroski ≥ 0.5 ∧ Stef anKramer ≥ 0.5 SDM ≥ 0.5 ∧ SIGM ODConf erence < 4 ∨ SDM < 0.5 ∧ ICDM ≥ 0.5 ∧ KDD ≥ 0.5 ∨ SIGM ODConf erence ≥ 4 ∧ V LDB < 2.5 ←→ ArindamBanerjee ≥ 2 ∧ M artinL.Kersten < 0.5 ∨ ArindamBanerjee < 2 ∧ V ipinKumar ≥ 0.5 ∧ P hilipS.Y u ≥ 0.5 ∨ M artinL.Kersten ≥ 0.5 ∧ JiaweiHan ≥ 2.5 P ODS < 0.5 ∧ ICDT ≥ 0.5 ∧ V LDB ≥ 20.5 ∨ P ODS ≥ 0.5 ∧ ICDE < 0.5 ∧ W W W ≥ 3.5 ←→ RichardA.DeM illo < 1.5∧Y ehoshuaSagiv ≥ 0.5∧SunitaSarawagi ≥ 0.5∨RichardA.DeM illo ≥ 1.5∧RaviKumar ≥ 0.5∧D.Sivakumar ≥ 1.5 KDD ≥ 8.5 ∧ SDM ≥ 0.5 ∧ SIGM ODConf erence ≥ 2.5 ←→ KeW ang ≥ 1.5 ∧ P hilipS.Y u ≥ 2.5 ∧ HaixunW ang ≥ 0.5 Continued Appendix B Redescription Sets from experiments with DBLP data Set Table B.4: Redescriptions mined by Algorithm 2 from DBLP data set (with DBSCAN binarization routine; IG-impurity measure; min bucket= P Li 100 ); LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 - support 99 E1,1 50 p-val. 0.000 0.318 7 0.000 0.300 3 0.000 0.299 38 0.000 0.261 0.260 6 54 0.000 0.000 0.250 22 0.000 0.226 0.226 19 40 0.000 0.000 0.207 35 0.000 0.207 25 0.000 0.205 40 0.000 0.194 7 0.000 Redescription ST OC ≥ 1.5 ∨ ST OC < 1.5 ∧ F OCS < 0.5 ∧ ICDT ≥ 3.5 ∨ ST OC < 1.5 ∧ F OCS ≥ 0.5 ∧ SODA ≥ 11.5 ←→ AviW igderson ≥ 0.5 ∨ AviW igderson < 0.5 ∧ N ogaAlon < 0.5 ∧ EmmanuelW aller ≥ 0.5 ∨ AviW igderson < 0.5 ∧ N ogaAlon ≥ 0.5 ∧ EstherM.Arkin ≥ 0.5 W W W ≥ 2.5 ∨ W W W < 2.5 ∧ ICDE ≥ 21.5 ∧ P KDD ≥ 1.5 ←→ D.Sivakumar ≥ 1.5 ∨ D.Sivakumar < 1.5 ∧ Xif engY an ≥ 4.5 ∧ W eiW ang ≥ 4 ICDE ≥ 12.5 ∧ EDBT ≥ 4.5 ∧ SIGM ODConf erence ≥ 19.5 ←→ AnthonyK.H.T ung ≥ 1.5 ∧ ShaulDar ≥ 1.5 ∧ AlonY.Levy ≥ 0.5 ST OC ≥ 1.5 ∨ ST OC < 1.5 ∧ F OCS < 0.5 ∧ ICDT ≥ 3.5 ∨ ST OC < 1.5 ∧ F OCS ≥ 0.5 ∧ SODA ≥ 11.5 ←→ AviW igderson ≥ 0.5 ∨ AviW igderson < 0.5 ∧ N ogaAlon < 0.5 ∧ EmmanuelW aller ≥ 0.5 ∨ AviW igderson < 0.5 ∧ N ogaAlon ≥ 0.5 ∧ M artinF arach ≥ 2.5 ICDT ≥ 3.5 ←→ EmmanuelW aller ≥ 0.5 SODA ≥ 5.5 ∨ SODA < 5.5 ∧ F OCS ≥ 0.5 ∧ ST OC ≥ 7.5 ←→ M osesCharikar ≥ 0.5 ∨ M osesCharikar < 0.5 ∧ AviW igderson ≥ 0.5 ∧ OdedGoldreich ≥ 0.5 P ODS ≥ 2.5 ∨ P ODS < 2.5 ∧ ICDT ≥ 0.5 ∧ ST OC ≥ 10.5 ←→ CatrielBeeri ≥ 0.5 ∨ CatrielBeeri < 0.5 ∧ W angChiewT an ≥ 0.5 ∧ DonCoppersmith ≥ 0.5 COLT ≥ 3.5 ←→ M anf redK.W armuth ≥ 0.5 SODA < 3.5 ∧ ST OC ≥ 0.5 ∧ F OCS ≥ 6.5 ∨ SODA ≥ 3.5 ∧ P ODS ≥ 0.5 ∧ ICDT < 0.5 ←→ StevenSkiena < 0.5 ∧ AviW igderson ≥ 0.5∧OdedGoldreich ≥ 0.5∨StevenSkiena ≥ 0.5∧EstherM.Arkin ≥ 0.5∧RajmohanRajaraman < 1.5 ST ACS ≥ 2.5 ∨ ST ACS < 2.5 ∧ F OCS ≥ 0.5 ∧ ST OC ≥ 8.5 ←→ U weSchoning ≥ 0.5 ∨ U weSchoning < 0.5 ∧ AviW igderson ≥ 0.5 ∧ M oniN aor ≥ 0.5 ECM L ≥ 2.5 ∨ ECM L < 2.5 ∧ ICM L < 0.5 ∧ COLT ≥ 3.5 ∨ ECM L < 2.5 ∧ ICM L ≥ 0.5 ∧ F OCS ≥ 0.5 ←→ Stef anKramer ≥ 0.5 ∨ Stef anKramer < 0.5 ∧ RobertE.Schapire < 0.5 ∧ Rolf W iehagen ≥ 0.5 ∨ Stef anKramer < 0.5 ∧ RobertE.Schapire ≥ 0.5 ∧ M anf redK.W armuth ≥ 0.5 SODA ≥ 0.5 ∨ SODA < 0.5 ∧ ST OC ≥ 0.5 ∧ F OCS ≥ 6.5 ←→ ClaireKenyon ≥ 0.5 ∨ ClaireKenyon < 0.5 ∧ AviW igderson ≥ 0.5 ∧ M adhuSudan ≥ 0.5 ICDM ≥ 3.5 ∧ ICDE ≥ 4.5 ∨ ICDM < 3.5 ∧ SDM ≥ 0.5 ∧ SIGM ODConf erence ≥ 2.5 ←→ ShengM a ≥ 0.5 ∧ RaymondT.N g ≥ 0.5 ∨ ShengM a < 0.5 ∧ P hilipS.Y u ≥ 2.5 ∧ Jef f reyXuY u ≥ 1.5 Continued Appendix B Redescription Sets from experiments with DBLP data Set J 0.318 100 Table B.4: Redescriptions mined by Algorithm 2 from DBLP data set (with DBSCAN binarization routine; IG-impurity measure; min bucket= P Li 100 ); LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 - support J 0.194 E1,1 38 p-val. 0.000 0.191 43 0.000 0.190 30 0.000 0.189 165 0.000 0.183 11 0.000 0.182 68 0.000 0.180 37 0.000 0.162 23 0.000 0.158 40 0.000 101 Redescription SODA ≥ 4.5 ∨ SODA < 4.5 ∧ ST OC ≥ 0.5 ∧ F OCS ≥ 6.5 ←→ HaimKaplan ≥ 0.5 ∨ HaimKaplan < 0.5 ∧ AviW igderson ≥ 0.5 ∧ OdedGoldreich ≥ 0.5 P ODS ≥ 0.5 ∨ P ODS < 0.5 ∧ ICDT < 0.5 ∧ V LDB ≥ 5.5 ∨ P ODS < 0.5 ∧ ICDT ≥ 0.5 ∧ SIGM ODConf erence ≥ 6.5 ←→ AbrahamSilberschatz ≥ 0.5 ∨ AbrahamSilberschatz < 0.5 ∧ Y ehoshuaSagiv < 0.5 ∧ M ichaelJ.Carey ≥ 0.5 ∨ AbrahamSilberschatz < 0.5 ∧ Y ehoshuaSagiv ≥ 0.5 ∧ DiveshSrivastava ≥ 0.5 U AI ≥ 0.5 ∨ U AI < 0.5 ∧ ICM L < 0.5 ∧ COLT ≥ 3.5 ∨ U AI < 0.5 ∧ ICM L ≥ 0.5 ∧ F OCS ≥ 0.5 ←→ CraigBoutilier ≥ 0.5 ∨ CraigBoutilier < 0.5 ∧ RobertE.Schapire < 0.5 ∧ Rolf W iehagen ≥ 0.5 ∨ CraigBoutilier < 0.5 ∧ RobertE.Schapire ≥ 0.5 ∧ M anf redK.W armuth ≥ 0.5 SIGM ODConf erence ≥ 4.5∨SIGM ODConf erence < 4.5∧V LDB < 5.5∧ICDE ≥ 0.5∨SIGM ODConf erence < 4.5 ∧ V LDB ≥ 5.5 ∧ W W W < 1.5 ←→ RajeevRastogi ≥ 1.5 ∨ RajeevRastogi < 1.5 ∧ P hilipA.Bernstein < 0.5 ∧ JiaweiHan ≥ 0.5 ∨ RajeevRastogi < 1.5 ∧ P hilipA.Bernstein ≥ 0.5 ∧ KevinChen − ChuanChang < 1 ECM L ≥ 2.5 ∨ ECM L < 2.5 ∧ ICM L ≥ 0.5 ∧ U AI ≥ 3.5 ←→ P eterGrunwald ≥ 1.5 ∨ P eterGrunwald < 1.5 ∧ SatinderP.Singh ≥ 0.5 ∧ DavidHeckerman ≥ 0.5 P ODS ≥ 0.5 ∧ SIGM ODConf erence ≥ 1.5 ∨ P ODS < 0.5 ∧ ICDT < 0.5 ∧ V LDB ≥ 2.5 ∨ P ODS < 0.5 ∧ ICDT ≥ 0.5 ∧ F OCS ≥ 5.5 ∨ SIGM ODConf erence < 1.5 ∧ ICDE < 1.5 ←→ AbrahamSilberschatz ≥ 0.5 ∧ AlbrechtSchmidt < 0.5 ∨ AbrahamSilberschatz < 0.5 ∧ SergeAbiteboul < 0.5 ∧ RakeshAgrawal ≥ 0.5 ∨ AbrahamSilberschatz < 0.5 ∧ SergeAbiteboul ≥ 0.5 ∧ M ihalisY annakakis ≥ 1.5 ∨ AlbrechtSchmidt ≥ 0.5 ∧ GioW iederhold < 0.5 SODA ≥ 2.5 ∨ SODA < 2.5 ∧ ST OC ≥ 0.5 ∧ F OCS ≥ 6.5 ←→ RichardM.Karp ≥ 2.5 ∨ RichardM.Karp < 2.5 ∧ AviW igderson ≥ 0.5 ∧ M adhuSudan ≥ 0.5 ECM L ≥ 1.5 ∨ ECM L < 1.5 ∧ ICM L < 0.5 ∧ COLT ≥ 3.5 ∨ ECM L < 1.5 ∧ ICM L ≥ 0.5 ∧ F OCS ≥ 0.5 ←→ JuhoRousu ≥ 0.5 ∨ JuhoRousu < 0.5 ∧ RobertE.Schapire < 0.5 ∧ Rolf W iehagen ≥ 0.5 ∨ JuhoRousu < 0.5 ∧ RobertE.Schapire ≥ 0.5 ∧ M anf redK.W armuth ≥ 0.5 SIGM ODConf erence ≥ 1.5 ∨ SIGM ODConf erence < 1.5 ∧ V LDB < 0.5 ∧ U AI ≥ 2.5 ∨ SIGM ODConf erence < 1.5 ∧ V LDB ≥ 0.5 ∧ ICDE ≥ 8.5 ←→ HectorGarcia − M olina ≥ 0.5 ∨ HectorGarcia − M olina < 0.5 ∧ M ichaelJ.Carey < 0.5 ∧ CraigBoutilier ≥ 0.5 ∨ HectorGarcia − M olina < 0.5 ∧ M ichaelJ.Carey ≥ 0.5 ∧ BengChinOoi ≥ 0.5 Continued Appendix B Redescription Sets from experiments with DBLP data Set Table B.4: Redescriptions mined by Algorithm 2 from DBLP data set (with DBSCAN binarization routine; IG-impurity measure; min bucket= P Li 100 ); LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 - support E1,1 37 p-val. 0.000 0.157 22 0.000 0.152 20 0.000 0.146 37 0.000 0.143 0.142 2 26 0.000 0.000 0.141 9 0.000 0.125 35 0.000 0.124 11 0.000 0.117 49 0.000 0.115 18 0.000 Redescription SODA ≥ 1.5 ∨ SODA < 1.5 ∧ ST OC ≥ 0.5 ∧ F OCS ≥ 6.5 ←→ GerthStoltingBrodal ≥ 0.5 ∨ GerthStoltingBrodal < 0.5 ∧ AviW igderson ≥ 0.5 ∧ M adhuSudan ≥ 0.5 U AI ≥ 0.5 ∨ U AI < 0.5 ∧ ICM L < 0.5 ∧ COLT ≥ 3.5 ∨ U AI < 0.5 ∧ ICM L ≥ 0.5 ∧ ST OC ≥ 0.5 ←→ W olf − T iloBalke ≥ 3 ∨ W olf − T iloBalke < 3 ∧ RobertE.Schapire < 0.5 ∧ Rolf W iehagen ≥ 0.5 ∨ W olf − T iloBalke < 3 ∧ RobertE.Schapire ≥ 0.5 ∧ M anf redK.W armuth ≥ 0.5 ICDE ≥ 3.5 ∨ ICDE < 3.5 ∧ V LDB < 0.5 ∧ ST ACS ≥ 8.5 ∨ ICDE < 3.5 ∧ V LDB ≥ 0.5 ∧ SIGM ODConf erence ≥ 13.5 ←→ N arainH.Gehani ≥ 0.5 ∨ N arainH.Gehani < 0.5 ∧ DavidJ.DeW itt < 0.5 ∧ StephenA.F enner ≥ 0.5 ∨ N arainH.Gehani < 0.5 ∧ DavidJ.DeW itt ≥ 0.5 ∧ Jef f reyF.N aughton ≥ 0.5 SODA ≥ 2.5 ∨ SODA < 2.5 ∧ ST OC ≥ 0.5 ∧ F OCS ≥ 6.5 ←→ V ahabS.M irrokni ≥ 0.5 ∨ V ahabS.M irrokni < 0.5 ∧ AviW igderson ≥ 0.5 ∧ OdedGoldreich ≥ 0.5 P KDD ≥ 4.5 ←→ ArnoJ.Knobbe ≥ 0.5 W W W ≥ 2.5 ∨ W W W < 2.5 ∧ V LDB < 0.5 ∧ SIGM ODConf erence ≥ 1.5 ∨ W W W < 2.5 ∧ V LDB ≥ 0.5 ∧ ICDE ≥ 8.5 ←→ DanielGruhl ≥ 0.5 ∨ DanielGruhl < 0.5 ∧ M ichaelJ.Carey < 0.5 ∧ HectorGarcia − M olina ≥ 0.5 ∨ DanielGruhl < 0.5 ∧ M ichaelJ.Carey ≥ 0.5 ∧ BengChinOoi ≥ 0.5 ICM L ≥ 4.5 ∨ ICM L < 4.5 ∧ ECM L ≥ 0.5 ∧ P KDD ≥ 1.5 ←→ DavidP age ≥ 1.5 ∨ DavidP age < 1.5 ∧ SasoDzeroski ≥ 0.5 ∧ Stef anKramer ≥ 0.5 EDBT < 0.5 ∧ ICDE < 0.5 ∧ V LDB ≥ 0.5 ∨ EDBT < 0.5 ∧ ICDE ≥ 0.5 ∧ SIGM ODConf erence ≥ 8.5 ∨ EDBT ≥ 0.5 ∧ ICDT ≥ 0.5 ∧ F OCS ≥ 0.5 ←→ BernhardSeeger < 2.5 ∧ SurajitChaudhuri < 0.5 ∧ M ichaelJ.Carey ≥ 0.5 ∨ BernhardSeeger < 2.5 ∧ SurajitChaudhuri ≥ 0.5 ∧ RakeshAgrawal ≥ 0.5 ∨ BernhardSeeger ≥ 2.5 ∧ DanSuciu ≥ 0.5 ∧ SophieCluet ≥ 0.5 ICM L ≥ 4.5 ∨ ICM L < 4.5 ∧ ECM L ≥ 0.5 ∧ P KDD ≥ 1.5 ←→ T omM.M itchell ≥ 0.5 ∨ T omM.M itchell < 0.5 ∧ SasoDzeroski ≥ 0.5 ∧ Stef anKramer ≥ 1.5 KDD ≥ 3.5 ∨ KDD < 3.5 ∧ ICDM < 1.5 ∧ ICDE ≥ 2.5 ∨ KDD < 3.5 ∧ ICDM ≥ 1.5 ∧ V LDB ≥ 2.5 ←→ SugatoBasu ≥ 2.5 ∨ SugatoBasu < 2.5 ∧ HaixunW ang < 0.5 ∧ DiveshSrivastava ≥ 0.5 ∨ SugatoBasu < 2.5 ∧ HaixunW ang ≥ 0.5 ∧ Jef f reyXuY u ≥ 1.5 SIGM ODConf erence ≥ 1.5 ∨ SIGM ODConf erence < 1.5 ∧ V LDB ≥ 0.5 ∧ ICDE ≥ 11.5 ←→ HamidP irahesh ≥ 0.5 ∨ HamidP irahesh < 0.5 ∧ M ichaelJ.Carey ≥ 0.5 ∧ AnthonyK.H.T ung ≥ 1.5 Continued Appendix B Redescription Sets from experiments with DBLP data Set J 0.157 102 Table B.4: Redescriptions mined by Algorithm 2 from DBLP data set (with DBSCAN binarization routine; IG-impurity measure; min bucket= P Li 100 ); LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 - support J 0.114 E1,1 39 p-val. 0.000 0.110 42 0.000 0.108 70 0.000 0.108 93 0.000 0.102 33 0.000 0.099 9 0.000 0.095 20 0.000 0.091 32 0.000 0.089 49 0.000 0.088 77 0.000 103 Redescription W W W ≥ 1.5 ∨ W W W < 1.5 ∧ ICDE < 0.5 ∧ V LDB ≥ 0.5 ∨ W W W < 1.5 ∧ ICDE ≥ 0.5 ∧ SIGM ODConf erence ≥ 8.5 ←→ DanielF.Lieuwen ≥ 5 ∨ DanielF.Lieuwen < 5 ∧ SurajitChaudhuri < 0.5 ∧ M ichaelJ.Carey ≥ 0.5 ∨ DanielF.Lieuwen < 5 ∧ SurajitChaudhuri ≥ 0.5 ∧ RakeshAgrawal ≥ 0.5 SIGM ODConf erence < 4.5∧V LDB < 5.5∧EDBT ≥ 0.5∨SIGM ODConf erence < 4.5∧V LDB ≥ 5.5∧P ODS ≥ 1.5 ∨ SIGM ODConf erence ≥ 4.5 ∧ ICDE ≥ 2.5 ∧ SDM < 1.5 ←→ AristidesGionis < 0.5 ∧ H.V.Jagadish < 0.5 ∧ N ickKoudas ≥ 0.5 ∨ AristidesGionis < 0.5 ∧ H.V.Jagadish ≥ 0.5 ∧ Y ehoshuaSagiv ≥ 0.5 ∨ AristidesGionis ≥ 0.5 ∧ AbrahamSilberschatz ≥ 0.5 ∧ Kun − LungW u < 1 ICDT ≥ 1.5 ∨ ICDT < 1.5 ∧ P ODS < 3.5 ∧ V LDB ≥ 1.5 ∨ ICDT < 1.5 ∧ P ODS ≥ 3.5 ∧ SIGM ODConf erence ≥ 2.5 ←→ GeorgGottlob ≥ 0.5 ∨ GeorgGottlob < 0.5 ∧ CatrielBeeri < 0.5 ∧ M ichaelJ.Carey ≥ 0.5 ∨ GeorgGottlob < 0.5 ∧ CatrielBeeri ≥ 0.5 ∧ AbrahamSilberschatz ≥ 0.5 ICDM ≥ 0.5 ∧ ICM L < 2.5 ∨ ICDM < 0.5 ∧ KDD < 0.5 ∧ ICDE ≥ 0.5 ∨ ICDM < 0.5 ∧ KDD ≥ 0.5 ∧ V LDB ≥ 0.5 ←→ V ipinKumar ≥ 0.5 ∧ DmitryP avlov < 2.5 ∨ V ipinKumar < 0.5 ∧ JiaweiHan < 0.5 ∧ GioW iederhold ≥ 0.5 ∨ V ipinKumar < 0.5 ∧ JiaweiHan ≥ 0.5 ∧ SurajitChaudhuri ≥ 0.5 EDBT ≥ 0.5 ∨ EDBT < 0.5 ∧ ICDE ≥ 0.5 ∧ SIGM ODConf erence ≥ 8.5 ←→ RakeshAgrawal ≥ 1.5 ∨ RakeshAgrawal < 1.5 ∧ SurajitChaudhuri ≥ 0.5 ∧ RaghuRamakrishnan ≥ 0.5 ICM L ≥ 1.5 ∨ ICM L < 1.5 ∧ ECM L ≥ 0.5 ∧ P KDD ≥ 1.5 ←→ V ladimirV apnik ≥ 0.5 ∨ V ladimirV apnik < 0.5 ∧ SasoDzeroski ≥ 0.5 ∧ Stef anKramer ≥ 0.5 SDM ≥ 0.5 ∨ SDM < 0.5 ∧ ICDM ≥ 0.5 ∧ KDD ≥ 1.5 ←→ HuanLiu ≥ 1.5 ∨ HuanLiu < 1.5 ∧ W eiF an ≥ 0.5 ∧ HaixunW ang ≥ 0.5 ICDE ≥ 0.5 ∨ ICDE < 0.5 ∧ V LDB ≥ 0.5 ∧ SIGM ODConf erence ≥ 3.5 ←→ GerhardW eikum ≥ 0.5 ∨ GerhardW eikum < 0.5 ∧ M ichaelJ.Carey ≥ 0.5 ∧ H.V.Jagadish ≥ 1.5 V LDB ≥ 5.5 ∨ V LDB < 5.5 ∧ SIGM ODConf erence < 4.5 ∧ ICDE ≥ 1.5 ∨ V LDB < 5.5 ∧ SIGM ODConf erence ≥ 4.5∧P ODS ≥ 0.5 ←→ LuizT ucherman ≥ 4∨LuizT ucherman < 4∧H.V.Jagadish < 0.5∧AhmedK.Elmagarmid ≥ 0.5 ∨ LuizT ucherman < 4 ∧ H.V.Jagadish ≥ 0.5 ∧ AbrahamSilberschatz ≥ 0.5 ICDM ≥ 1.5 ∨ ICDM < 1.5 ∧ KDD < 0.5 ∧ ICDE ≥ 0.5 ∨ ICDM < 1.5 ∧ KDD ≥ 0.5 ∧ V LDB ≥ 0.5 ←→ HaixunW ang ≥ 0.5 ∨ HaixunW ang < 0.5 ∧ JiaweiHan < 1.5 ∧ GioW iederhold ≥ 0.5 ∨ HaixunW ang < 0.5 ∧ JiaweiHan ≥ 1.5 ∧ SurajitChaudhuri ≥ 0.5 Continued Appendix B Redescription Sets from experiments with DBLP data Set Table B.4: Redescriptions mined by Algorithm 2 from DBLP data set (with DBSCAN binarization routine; IG-impurity measure; min bucket= P Li 100 ); LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 - support E1,1 8 p-val. 0.000 0.085 17 0.000 0.084 55 0.000 0.084 40 0.000 0.082 36 0.000 0.081 66 0.000 0.080 69 0.000 0.074 32 0.000 0.074 52 0.000 Redescription ICM L ≥ 0.5 ∨ ICM L < 0.5 ∧ ECM L ≥ 0.5 ∧ U AI ≥ 8 ←→ P eterA.F lach ≥ 0.5 ∨ P eterA.F lach < 0.5 ∧ StephenM uggleton ≥ 0.5 ∧ DaphneKoller ≥ 2.5 P KDD ≥ 0.5 ∨ P KDD < 0.5 ∧ KDD ≥ 0.5 ∧ ICDE ≥ 2.5 ←→ XingquanZhu ≥ 3 ∨ XingquanZhu < 3 ∧ JiaweiHan ≥ 1.5 ∧ RaymondT.N g ≥ 0.5 EDBT ≥ 1.5 ∨ EDBT < 1.5 ∧ ICDE < 2.5 ∧ V LDB ≥ 1.5 ∨ EDBT < 1.5 ∧ ICDE ≥ 2.5 ∧ SIGM ODConf erence ≥ 1.5 ←→ ElenaBaralis ≥ 0.5 ∨ ElenaBaralis < 0.5 ∧ H.V.Jagadish < 0.5 ∧ BruceG.Lindsay ≥ 0.5 ∨ ElenaBaralis < 0.5 ∧ H.V.Jagadish ≥ 0.5 ∧ HectorGarcia − M olina ≥ 0.5 V LDB ≥ 5.5∨V LDB < 5.5∧SIGM ODConf erence < 4.5∧P ODS ≥ 0.5∨V LDB < 5.5∧SIGM ODConf erence ≥ 4.5 ∧ EDBT ≥ 0.5 ←→ W eiminDu ≥ 0.5 ∨ W eiminDu < 0.5 ∧ AbrahamSilberschatz < 0.5 ∧ Y ehoshuaSagiv ≥ 0.5 ∨ W eiminDu < 0.5 ∧ AbrahamSilberschatz ≥ 0.5 ∧ RakeshAgrawal ≥ 1.5 SIGM ODConf erence ≥ 5.5∨SIGM ODConf erence < 5.5∧V LDB < 5.5∧P ODS ≥ 0.5∨SIGM ODConf erence < 5.5 ∧ V LDB ≥ 5.5 ∧ ICDT ≥ 0.5 ←→ RaymondT.N g ≥ 0.5 ∨ RaymondT.N g < 0.5 ∧ RaghuRamakrishnan < 0.5 ∧ AbrahamSilberschatz ≥ 0.5 ∨ RaymondT.N g < 0.5 ∧ RaghuRamakrishnan ≥ 0.5 ∧ Jef f reyD.U llman ≥ 1.5 ICDM < 0.5 ∧ KDD < 0.5 ∧ ICDE ≥ 0.5 ∨ ICDM < 0.5 ∧ KDD ≥ 0.5 ∧ V LDB ≥ 0.5 ∨ ICDM ≥ 0.5 ∧ SDM ≥ 0.5 ∧ EDBT ≥ 1.5 ←→ V ipinKumar < 0.5 ∧ JiaweiHan < 1.5 ∧ GioW iederhold ≥ 0.5 ∨ V ipinKumar < 0.5 ∧ JiaweiHan ≥ 1.5 ∧ SurajitChaudhuri ≥ 0.5 ∨ V ipinKumar ≥ 0.5 ∧ P hilipS.Y u ≥ 0.5 ∧ XueminLin ≥ 0.5 P KDD ≥ 3.5 ∨ P KDD < 3.5 ∧ KDD < 0.5 ∧ ICDE ≥ 0.5 ∨ P KDD < 3.5 ∧ KDD ≥ 0.5 ∧ V LDB ≥ 0.5 ←→ M arcSebban ≥ 0.5 ∨ M arcSebban < 0.5 ∧ JiaweiHan < 0.5 ∧ GioW iederhold ≥ 0.5 ∨ M arcSebban < 0.5 ∧ JiaweiHan ≥ 0.5 ∧ SurajitChaudhuri ≥ 0.5 SIGM ODConf erence < 3.5 ∧ V LDB < 5.5 ∧ ICDE ≥ 1.5 ∨ SIGM ODConf erence < 3.5 ∧ V LDB ≥ 5.5 ∧ KDD ≥ 6.5 ∨ SIGM ODConf erence ≥ 3.5 ∧ P ODS ≥ 0.5 ∧ ICDT ≥ 2 ←→ SurajitChaudhuri < 1.5 ∧ H.V.Jagadish < 0.5 ∧ N ickKoudas ≥ 0.5 ∨ SurajitChaudhuri < 1.5 ∧ H.V.Jagadish ≥ 0.5 ∧ KeW ang ≥ 5 ∨ SurajitChaudhuri ≥ 1.5 ∧ RaghuRamakrishnan ≥ 2 ∧ AlbertoO.M endelzon ≥ 0.5 EDBT < 3.5 ∧ ICDE < 4.5 ∧ SIGM ODConf erence ≥ 0.5 ∨ EDBT < 3.5 ∧ ICDE ≥ 4.5 ∧ P ODS ≥ 11 ∨ EDBT ≥ 3.5 ∧ V LDB ≥ 6.5 ∧ F OCS < 4.5 ←→ GustavoAlonso < 5 ∧ SurajitChaudhuri < 0.5 ∧ HamidP irahesh ≥ 0.5 ∨ GustavoAlonso < 5 ∧ SurajitChaudhuri ≥ 0.5 ∧ S.Sudarshan ≥ 7.5 ∨ GustavoAlonso ≥ 5 ∧ RadekV ingralek ≥ 0.5 ∧ M onikaRauchHenzinger < 0.5 Continued Appendix B Redescription Sets from experiments with DBLP data Set J 0.086 104 Table B.4: Redescriptions mined by Algorithm 2 from DBLP data set (with DBSCAN binarization routine; IG-impurity measure; min bucket= P Li 100 ); LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 - support J 0.071 E1,1 53 p-val. 0.000 0.068 20 0.000 0.065 20 0.000 0.063 32 0.000 0.063 50 0.000 Redescription P ODS ≥ 1.5 ∨ P ODS < 1.5 ∧ ICDT < 0.5 ∧ SIGM ODConf erence ≥ 0.5 ∨ P ODS < 1.5 ∧ ICDT ≥ 0.5 ∧ V LDB ≥ 20.5 ←→ M ichaelKif er ≥ 2.5 ∨ M ichaelKif er < 2.5 ∧ Y ehoshuaSagiv < 0.5 ∧ HamidP irahesh ≥ 0.5 ∨ M ichaelKif er < 2.5 ∧ Y ehoshuaSagiv ≥ 0.5 ∧ SunitaSarawagi ≥ 0.5 KDD ≥ 0.5 ∨ KDD < 0.5 ∧ ICDM ≥ 0.5 ∧ V LDB ≥ 0.5 ←→ Y asuhikoM orimoto ≥ 0.5 ∨ Y asuhikoM orimoto < 0.5 ∧ JiaweiHan ≥ 0.5 ∧ JianP ei ≥ 0.5 ICDM ≥ 0.5 ∨ ICDM < 0.5 ∧ KDD ≥ 0.5 ∧ ICDE ≥ 0.5 ←→ Geof f reyI.W ebb ≥ 0.5 ∨ Geof f reyI.W ebb < 0.5 ∧ JiaweiHan ≥ 0.5 ∧ P hilipS.Y u ≥ 3.5 SIGM ODConf erence ≥ 4.5∨SIGM ODConf erence < 4.5∧V LDB < 4.5∧ICDE ≥ 1.5∨SIGM ODConf erence < 4.5 ∧ V LDB ≥ 4.5 ∧ P ODS ≥ 1.5 ←→ SetragKhoshaf ian ≥ 2.5 ∨ SetragKhoshaf ian < 2.5 ∧ DiveshSrivastava < 0.5 ∧ H.V.Jagadish ≥ 0.5 ∨ SetragKhoshaf ian < 2.5 ∧ DiveshSrivastava ≥ 0.5 ∧ Jef f reyD.U llman ≥ 1.5 V LDB ≥ 11.5 ∧ W W W ≥ 0.5 ∨ V LDB < 11.5 ∧ SIGM ODConf erence < 8.5 ∧ ICDE ≥ 0.5 ∨ V LDB < 11.5 ∧ SIGM ODConf erence ≥ 8.5 ∧ P ODS ≥ 3.5 ←→ Alf onsKemper ≥ 7.5 ∧ JunY ang ≥ 0.5 ∨ Alf onsKemper < 7.5 ∧ RakeshAgrawal < 0.5 ∧ GioW iederhold ≥ 0.5 ∨ Alf onsKemper < 7.5 ∧ RakeshAgrawal ≥ 0.5 ∧ F lipKorn ≥ 2 Appendix B Redescription Sets from experiments with DBLP data Set Table B.4: Redescriptions mined by Algorithm 2 from DBLP data set (with DBSCAN binarization routine; IG-impurity measure; min bucket= P Li 100 ); LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 - support 105 E1,1 7 5 p-val. 0.000 0.000 1.000 1.000 2 3 0.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000 3 2 1 1 1 4 0.000 0.000 0.000 0.000 0.000 0.000 1.000 1.000 1.000 0.969 1 1 1 31 0.000 0.000 0.000 0.000 0.951 0.949 2210 2217 0.067 0.284 0.947 2194 0.021 Redescription F OCS ≥ 19.5 ∧ ST OC ≥ 15.5 ∧ ICDT < 1.5 ←→ JohanHastad ≥ 2.5 ∧ SilvioM icali ≥ 0.5 ∧ KunalT alwar < 1 V LDB < 20.5 ∧ SIGM ODConf erence ≥ 21 ∧ EDBT ≥ 7 ∨ V LDB ≥ 20.5 ∧ P ODS ≥ 8.5 ∧ F OCS < 0.5 ←→ S.Sudarshan < 7.5 ∧ N arainH.Gehani ≥ 4.5 ∧ ShaulDar ≥ 1.5 ∨ S.Sudarshan ≥ 7.5 ∧ JohannesGehrke ≥ 0.5 ∧ M inosN.Garof alakis < 3 V LDB ≥ 26.5 ∧ KDD ≥ 0.5 ∧ W W W < 3 ←→ N arainH.Gehani ≥ 4.5 ∧ JigneshM.P atel ≥ 2.5 ∧ SaktiP.Ghosh < 1 ICDT < 7.5 ∧ P ODS ≥ 21.5 ∧ V LDB ≥ 4 ∨ ICDT ≥ 7.5 ∧ SODA ≥ 0.5 ∧ SDM < 0.5 ←→ IoanaM anolescu < 3.5∧Y aronKanza ≥ 10.5∧W ernerN utt ≥ 4.5∨IoanaM anolescu ≥ 3.5∧SophieCluet ≥ 6.5∧N eoklisP olyzotis ≥ 0.5 ST ACS ≥ 12.5 ∧ SODA ≥ 1.5 ←→ LeenT orenvliet ≥ 1.5 ∧ DietervanM elkebeek ≥ 0.5 W W W ≥ 6.5 ∧ F OCS ≥ 1 ←→ R.Guha ≥ 2.5 ∧ AnirbanDasgupta ≥ 0.5 P KDD ≥ 11 ←→ ShojiHirano ≥ 15 ICDM ≥ 15.5 ←→ Kun − LungW u ≥ 24 EDBT ≥ 9.5 ←→ M ichaelGillmann ≥ 8 ICDM < 7.5∧SDM ≥ 4.5∧V LDB ≥ 2.5∨ICDM ≥ 7.5∧EDBT ≥ 1.5∧ICDT < 1.5 ←→ Chang −ShingP erng < 5.5∧Kun−LungW u ≥ 6∧AleksandarLazarevic < 1.5∨Chang−ShingP erng ≥ 5.5∧W eiW ang ≥ 3∧KeW ang ≥ 0.5 U AI ≥ 36 ←→ BrendanJ.F rey ≥ 1.5 ECM L ≥ 7.5 ∧ P KDD ≥ 3 ←→ JamesCussens ≥ 1.5 ∧ N adaLavrac ≥ 3.5 ICM L ≥ 16.5 ←→ Jef f G.Schneider ≥ 7.5 F OCS ≥ 20.5 ∧ SODA < 1.5 ∨ F OCS < 20.5 ∧ ST OC < 13.5 ∧ ST ACS ≥ 14.5 ∨ F OCS < 20.5 ∧ ST OC ≥ 13.5 ∧ COLT < 3 ←→ SilvioM icali ≥ 1.5 ∧ Shaf iGoldwasser ≥ 1.5 ∨ SilvioM icali < 1.5 ∧ AviW igderson < 0.5 ∧ LeenT orenvliet ≥ 7.5 ∨ SilvioM icali < 1.5 ∧ AviW igderson ≥ 0.5 ∧ DanRoth < 0.5 ST OC < 8.5 ∨ ST OC ≥ 8.5 ∧ F OCS < 7.5 ←→ AviW igderson < 0.5 ∨ AviW igderson ≥ 0.5 ∧ RonaldL.Rivest < 0.5 ST ACS ≥ 2.5 ∨ ST ACS < 2.5 ∧ SIGM ODConf erence ≥ 0.5 ∧ P ODS < 13.5 ∨ ST ACS < 2.5 ∧ SIGM ODConf erence < 0.5 ∧ ICDT < 0.5 ←→ LeszekGasieniec ≥ 2.5 ∨ LeszekGasieniec < 2.5 ∧ HectorGarcia − M olina ≥ 0.5 ∧ CatrielBeeri < 2.5 ∨ LeszekGasieniec < 2.5 ∧ HectorGarcia − M olina < 0.5 ∧ DilysT homas < 1.5 ST OC < 8.5 ∨ ST OC ≥ 8.5 ∧ F OCS < 7.5 ←→ AviW igderson < 0.5 ∨ AviW igderson ≥ 0.5 ∧ M oniN aor < 0.5 Continued Appendix B Redescription Sets from experiments with DBLP data Set J 1.000 1.000 106 Table B.5: Redescriptions mined by Algorithm 2 from DBLP data set (with k-means (5 clusters) binarization routine; Gini-impurity measure; Li min bucket = 100 ) LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 support ; CON F [a−b] - author submitted from a to b papers for conference CONF . J 0.945 E1,1 121 p-val. 0.000 0.922 2144 0.170 0.909 110 0.000 0.898 184 0.000 0.880 44 0.000 0.847 116 0.000 0.833 15 0.000 0.826 271 0.000 0.800 4 0.000 107 Redescription SODA < 7.5 ∧ ST OC < 6.5 ∧ ST ACS ≥ 14.5 ∨ SODA < 7.5 ∧ ST OC ≥ 6.5 ∧ ICDT < 0.5 ∨ SODA ≥ 7.5 ∧ F OCS ≥ 8.5 ∧ V LDB ≥ 0.5 ←→ M osesCharikar < 0.5 ∧ AviW igderson < 0.5 ∧ LeenT orenvliet ≥ 7 ∨ M osesCharikar < 0.5 ∧ AviW igderson ≥ 0.5 ∧ HectorGarcia − M olina < 0.5 ∨ M osesCharikar ≥ 0.5 ∧ BonnieBerger ≥ 0.5 ∧ JohnE.Hopcrof t ≥ 0.5 U AI ≥ 2.5 ∨ U AI < 2.5 ∧ ICDE ≥ 0.5 ∧ V LDB < 6.5 ∨ U AI < 2.5 ∧ ICDE < 0.5 ∧ SIGM ODConf erence < 7.5 ←→ T omiSilander ≥ 0.5 ∨ T omiSilander < 0.5 ∧ GioW iederhold ≥ 0.5 ∧ M ichaelJ.Carey < 1.5 ∨ T omiSilander < 0.5 ∧ GioW iederhold < 0.5 ∧ SophieCluet < 7 P KDD ≥ 2.5∧SIGM ODConf erence < 0.5∨P KDD < 2.5∧ICDM ≥ 3.5∧KDD ≥ 1.5∨SIGM ODConf erence ≥ 0.5 ∧ V LDB < 11 ←→ StephaneLallich ≥ 0.5 ∧ RaymondT.N g < 1.5 ∨ StephaneLallich < 0.5 ∧ HaixunW ang ≥ 3.5 ∧ P hilipS.Y u ≥ 0.5 ∨ RaymondT.N g ≥ 1.5 ∧ W eiW ang < 3.5 SDM ≥ 0.5 ∧ ST OC < 0.5 ∨ SDM < 0.5 ∧ ICDM ≥ 0.5 ∧ EDBT ≥ 1.5 ∨ ST OC ≥ 0.5 ∧ F OCS < 7 ←→ V ipinKumar ≥ 0.5 ∧ SridharRajagopalan < 2 ∨ V ipinKumar < 0.5 ∧ M atthiasSchubert ≥ 0.5 ∧ P eterKunath ≥ 0.5 ∨ SridharRajagopalan ≥ 2 ∧ V enkatesanGuruswami < 0.5 SIGM ODConf erence < 11.5 ∧ V LDB ≥ 15.5 ∧ ICDE < 19.5 ∨ SIGM ODConf erence ≥ 11.5 ∧ SODA < 1.5 ∧ P ODS < 6.5 ←→ BanuOzden < 1.5 ∧ JigneshM.P atel ≥ 1.5 ∧ U meshwarDayal < 0.5 ∨ BanuOzden ≥ 1.5 ∧ M arioSzegedy < 0.5 ∧ SophieCluet < 0.5 SDM ≥ 0.5 ∧ P KDD < 1.5 ∨ SDM < 0.5 ∧ ICDM ≥ 2.5 ∧ KDD ≥ 3.5 ∨ P KDD ≥ 1.5 ∧ ICDE < 0.5 ←→ ArindamBanerjee ≥ 2 ∧ GiuseppeM anco < 0.5 ∨ ArindamBanerjee < 2 ∧ HaixunW ang ≥ 0.5 ∧ JiaweiHan ≥ 6 ∨ GiuseppeM anco ≥ 0.5 ∧ DinoP edreschi < 7 ICDE ≥ 12.5 ∧ EDBT ≥ 2 ∨ ICDE < 12.5 ∧ SIGM ODConf erence ≥ 19.5 ∧ W W W ≥ 0.5 ←→ F lipKorn ≥ 0.5 ∧ KrithiRamamritham < 4.5 ∨ F lipKorn < 0.5 ∧ SudarshanS.Chawathe ≥ 3 ∧ M ayankBawa ≥ 0.5 SODA ≥ 10.5 ∨ SODA < 10.5 ∧ F OCS < 3.5 ∧ ST OC ≥ 6.5 ∨ SODA < 10.5 ∧ F OCS ≥ 3.5 ∧ P ODS < 1.5 ←→ M artinP al ≥ 1.5 ∨ M artinP al < 1.5 ∧ AviW igderson < 0.5 ∧ RichardJ.Anderson ≥ 0.5 ∨ M artinP al < 1.5 ∧ AviW igderson ≥ 0.5 ∧ RonaldF agin < 1.5 P ODS < 13.5 ∧ ICDT ≥ 7.5 ∨ P ODS ≥ 13.5 ∧ V LDB ≥ 4.5 ←→ M arianoP.Consens < 1.5 ∧ SophieCluet ≥ 7.5 ∨ M arianoP.Consens ≥ 1.5 ∧ DiveshSrivastava ≥ 0.5 Continued Appendix B Redescription Sets from experiments with DBLP data Set Table B.5: Redescriptions mined by Algorithm 2 from DBLP data set (with k-means (5 clusters) binarization routine; Gini-impurity measure; Li min bucket = 100 ) LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 support ; CON F [a−b] - author submitted from a to b papers for conference CONF . E1,1 4 p-val. 0.000 0.800 4 0.000 0.800 0.800 0.671 4 4 47 0.000 0.000 0.000 0.667 2 0.000 0.667 0.615 2 40 0.000 0.000 0.605 52 0.000 0.600 3 0.000 0.571 4 0.000 0.556 5 0.000 0.524 43 0.000 0.500 6 0.000 Redescription V LDB < 14.5 ∧ SIGM ODConf erence ≥ 19.5 ∧ EDBT ≥ 7 ∨ V LDB ≥ 14.5 ∧ ICDE ≥ 13.5 ∧ P ODS < 0.5 ←→ SungRanCho < 1 ∧ N arainH.Gehani ≥ 4.5 ∧ ShaulDar ≥ 2 ∨ SungRanCho ≥ 1 ∧ BengChinOoi ≥ 0.5 ∧ ShojiroN ishio ≥ 0.5 V LDB ≥ 26.5 ∧ ICDT < 1 ∨ ICDT ≥ 1 ∧ W W W < 1 ←→ N arainH.Gehani ≥ 4.5 ∧ Y uriBreitbart < 0.5 ∨ Y uriBreitbart ≥ 0.5 ∧ RaymondT.N g ≥ 1 COLT ≥ 22.5 ∧ ICM L ≥ 4.5 ←→ DavidP.Helmbold ≥ 3.5 ∧ SallyA.Goldman ≥ 0.5 KDD ≥ 16.5 ∧ ICDT < 2 ←→ KeW ang ≥ 5.5 ∧ BelaBollobas < 0.5 F OCS ≥ 11.5 ∧ COLT < 19.5 ∨ F OCS < 11.5 ∧ ST OC ≥ 13.5 ∧ SODA < 1.5 ←→ SantoshV empala ≥ 2.5 ∧ W ayneEberly < 0.5 ∨ SantoshV empala < 2.5 ∧ SilvioM icali ≥ 0.5 ∧ Shaf iGoldwasser ≥ 1.5 SODA ≥ 25 ∧ ICDT ≥ 0.5 ∧ ST OC ≥ 4.5 ←→ M arkH.Overmars ≥ 3.5 ∧ RajmohanRajaraman ≥ 1 ∧ HerbertEdelsbrunner < 2.5 SDM ≥ 5.5 ∨ SDM < 5.5 ∧ ICDM ≥ 15.5 ←→ HuiXiong ≥ 6.5 ∨ HuiXiong < 6.5 ∧ Kun − LungW u ≥ 22 SIGM ODConf erence < 7.5 ∧ V LDB ≥ 10.5 ∧ P ODS ≥ 0.5 ∨ SIGM ODConf erence ≥ 7.5 ∧ ICDE ≥ 2.5 ∧ SDM ≥ 1.5 ←→ M ichaelJ.Carey < 0.5 ∧ H.V.Jagadish ≥ 2.5 ∧ ErichJ.N euhold < 1.5 ∨ M ichaelJ.Carey ≥ 0.5 ∧ P hilipS.Y u ≥ 1.5 ∧ JiaweiHan ≥ 0.5 ST OC < 8.5 ∧ F OCS ≥ 8 ∧ COLT < 1.5 ∨ ST OC ≥ 8.5 ∧ SODA ≥ 5.5 ∧ V LDB ≥ 2.5 ←→ M oniN aor < 0.5 ∧ AviW igderson ≥ 0.5 ∧ Jef f reyC.Jackson < 0.5 ∨ M oniN aor ≥ 0.5 ∧ M osesCharikar ≥ 0.5 ∧ N inaM ishra ≥ 1 ICDT ≥ 5.5 ∨ ICDT < 5.5 ∧ P ODS ≥ 19.5 ∧ V LDB ≥ 10 ←→ F rankN even ≥ 3.5 ∨ F rankN even < 3.5 ∧ SophieCluet ≥ 8.5 ∧ T irthankarLahiri ≥ 1 V LDB ≥ 18.5 ∧ EDBT ≥ 4.5 ∨ V LDB < 18.5 ∧ SIGM ODConf erence ≥ 22 ∧ P ODS ≥ 9.5 ←→ AbrahamSilberschatz ≥ 0.5 ∧ ShaulDar ≥ 1.5 ∨ AbrahamSilberschatz < 0.5 ∧ P raveenSeshadri ≥ 5.5 ∧ JohannesGehrke ≥ 1.5 COLT ≥ 16.5 ∧ F OCS ≥ 1.5 ∨ COLT < 16.5 ∧ ICM L ≥ 9.5 ∧ U AI ≥ 1.5 ←→ JohnLangf ord ≥ 1.5 ∧ SallyA.Goldman ≥ 0.5 ∨ JohnLangf ord < 1.5 ∧ DavidCohn ≥ 0.5 ∧ JustinA.Boyan ≥ 1 SODA ≥ 17.5 ∨ SODA < 17.5 ∧ F OCS ≥ 10.5 ∧ ST OC ≥ 9.5 ←→ RichardCole ≥ 0.5 ∨ RichardCole < 0.5 ∧ LaszloLovasz ≥ 0.5 ∧ JurisHartmanis < 0.5 ICDM ≥ 6.5 ∨ ICDM < 6.5 ∧ SDM ≥ 10.5 ←→ P hilipS.Y u ≥ 11.5 ∨ P hilipS.Y u < 11.5 ∧ Kun − LungW u ≥ 22 Continued Appendix B Redescription Sets from experiments with DBLP data Set J 0.800 108 Table B.5: Redescriptions mined by Algorithm 2 from DBLP data set (with k-means (5 clusters) binarization routine; Gini-impurity measure; Li min bucket = 100 ) LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 support ; CON F [a−b] - author submitted from a to b papers for conference CONF . J 0.500 E1,1 39 p-val. 0.000 0.500 0.418 1 133 0.001 0.000 0.409 199 0.000 0.375 18 0.000 0.353 0.337 6 33 0.000 0.000 0.333 4 0.000 0.333 0.333 0.328 1 2 19 0.001 0.000 0.000 0.314 11 0.000 109 Redescription U AI ≥ 9.5 ∧ V LDB < 0.5 ∨ U AI < 9.5 ∧ ICM L < 0.5 ∧ COLT ≥ 7.5 ∨ U AI < 9.5 ∧ ICM L ≥ 0.5 ∧ ST OC ≥ 0.5 ←→ M axHenrion ≥ 0.5 ∧ DmitryP avlov < 1 ∨ M axHenrion < 0.5 ∧ RobertE.Schapire < 0.5 ∧ P eterL.Bartlett ≥ 1.5 ∨ M axHenrion < 0.5 ∧ RobertE.Schapire ≥ 0.5 ∧ M anf redK.W armuth ≥ 0.5 KDD ≥ 20.5 ∧ P KDD ≥ 3 ←→ AnthonyK.H.T ung ≥ 7 ∧ ShojiroN ishio ≥ 1 U AI < 23 ∧ SDM < 0.5 ∧ KDD ≥ 1.5 ∨ U AI < 23 ∧ SDM ≥ 0.5 ∧ ICDM ≥ 0.5 ∨ U AI ≥ 23 ∧ ICM L < 1.5 ∧ COLT < 0.5 ←→ DavidHeckerman < 2.5 ∧ V ipinKumar < 0.5 ∧ King − IpLin ≥ 0.5 ∨ DavidHeckerman < 2.5 ∧ V ipinKumar ≥ 0.5 ∧ P ascalP oncelet < 1 ∨ DavidHeckerman ≥ 2.5 ∧ M oisesGoldszmidt < 0.5 ∧ BlaiBonet < 0.5 ST ACS ≥ 5.5 ∧ V LDB < 1 ∨ ST ACS < 5.5 ∧ F OCS < 5.5 ∧ ST OC ≥ 2.5 ∨ ST ACS < 5.5 ∧ F OCS ≥ 5.5 ∧ SODA < 5.5 ←→ RiccardoSilvestri ≥ 1.5 ∧ KennethW.Regan < 0.5 ∨ RiccardoSilvestri < 1.5 ∧ AviW igderson < 0.5 ∧ M oniN aor ≥ 0.5 ∨ RiccardoSilvestri < 1.5 ∧ AviW igderson ≥ 0.5 ∧ M osesCharikar < 0.5 COLT ≥ 7.5 ∨ COLT < 7.5 ∧ ICM L < 3.5 ∧ U AI ≥ 14.5 ∨ COLT < 7.5 ∧ ICM L ≥ 3.5 ∧ F OCS ≥ 0.5 ←→ M anf redK.W armuth ≥ 1.5 ∨ M anf redK.W armuth < 1.5 ∧ DavidA.M cAllester < 0.5 ∧ DavidM axwellChickering ≥ 0.5 ∨ M anf redK.W armuth < 1.5 ∧ DavidA.M cAllester ≥ 0.5 ∧ Y oavF reund ≥ 1.5 P ODS ≥ 11.5 ∨ P ODS < 11.5 ∧ ICDT ≥ 7.5 ←→ F rankN even ≥ 0.5 ∨ F rankN even < 0.5 ∧ SophieCluet ≥ 7.5 ST OC ≥ 8.5 ∨ ST OC < 8.5 ∧ F OCS ≥ 8.5 ∧ SODA < 1.5 ←→ AviW igderson ≥ 0.5 ∨ AviW igderson < 0.5 ∧ SalilP.V adhan ≥ 4.5 ∧ Shaf iGoldwasser ≥ 0.5 W W W ≥ 4.5 ∨ W W W < 4.5 ∧ V LDB ≥ 25.5 ∧ EDBT ≥ 4 ←→ RudiStuder ≥ 6.5 ∨ RudiStuder < 6.5 ∧ N arainH.Gehani ≥ 4.5 ∧ ShivakumarV enkataraman < 0.5 P KDD ≥ 8 ←→ EllaBingham ≥ 2.5 V LDB ≥ 28 ←→ JigneshM.P atel ≥ 1.5 ECM L < 2.5 ∧ ICM L < 0.5 ∧ COLT ≥ 7.5 ∨ ECM L < 2.5 ∧ ICM L ≥ 0.5 ∧ ST OC ≥ 0.5 ∨ ECM L ≥ 2.5 ∧ KDD ≥ 2.5 ∧ U AI ≥ 2.5 ←→ SasoDzeroski < 1.5 ∧ RobertE.Schapire < 0.5 ∧ P eterL.Bartlett ≥ 1.5 ∨ SasoDzeroski < 1.5 ∧ RobertE.Schapire ≥ 0.5 ∧ M anf redK.W armuth ≥ 0.5 ∨ SasoDzeroski ≥ 1.5 ∧ T omiSilander ≥ 3.5 ∧ P eterGrunwald ≥ 1 P ODS ≥ 5.5 ∨ P ODS < 5.5 ∧ ICDT ≥ 2.5 ∧ SIGM ODConf erence ≥ 11 ←→ LeonidLibkin ≥ 0.5 ∨ LeonidLibkin < 0.5 ∧ M arcGyssens ≥ 0.5 ∧ ChrisClif ton ≥ 0.5 Continued Appendix B Redescription Sets from experiments with DBLP data Set Table B.5: Redescriptions mined by Algorithm 2 from DBLP data set (with k-means (5 clusters) binarization routine; Gini-impurity measure; Li min bucket = 100 ) LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 support ; CON F [a−b] - author submitted from a to b papers for conference CONF . E1,1 5 p-val. 0.000 0.258 16 0.000 0.228 18 0.000 0.200 0.192 1 14 0.004 0.000 0.175 11 0.000 0.167 0.167 0.166 1 4 28 0.003 0.000 0.000 0.155 43 0.000 0.139 5 0.000 0.132 48 0.000 Redescription ICM L ≥ 8.5 ∨ ICM L < 8.5 ∧ KDD ≥ 8.5 ∧ ICDM ≥ 7.5 ←→ T homasHof mann ≥ 0.5 ∨ T homasHof mann < 0.5 ∧ KeW ang ≥ 5.5 ∧ W eiW ang ≥ 6 COLT ≥ 3.5 ∨ COLT < 3.5 ∧ ICM L < 3.5 ∧ U AI ≥ 14.5 ∨ COLT < 3.5 ∧ ICM L ≥ 3.5 ∧ F OCS ≥ 2.5 ←→ N aderH.Bshouty ≥ 1.5 ∨ N aderH.Bshouty < 1.5 ∧ DavidA.M cAllester < 0.5 ∧ DavidM axwellChickering ≥ 0.5 ∨ N aderH.Bshouty < 1.5 ∧ DavidA.M cAllester ≥ 0.5 ∧ SallyA.Goldman ≥ 0.5 SIGM ODConf erence ≥ 2.5 ∧ SDM ≥ 1.5 ∨ SIGM ODConf erence < 2.5 ∧ V LDB ≥ 4.5 ∧ P ODS ≥ 1.5 ←→ HectorGarcia − M olina ≥ 0.5 ∧ P hilipS.Y u ≥ 2.5 ∨ HectorGarcia − M olina < 0.5 ∧ H.V.Jagadish ≥ 0.5 ∧ CatrielBeeri ≥ 0.5 ST ACS ≥ 12.5 ←→ StephenA.F enner ≥ 3 P KDD ≥ 4.5 ∨ P KDD < 4.5 ∧ KDD < 14.5 ∧ SDM ≥ 1.5 ∨ P KDD < 4.5 ∧ KDD ≥ 14.5 ∧ ICDM ≥ 4.5 ←→ ArnoJ.Knobbe ≥ 0.5∨ArnoJ.Knobbe < 0.5∧KeW ang < 5.5∧P hilipS.Y u ≥ 4.5∨ArnoJ.Knobbe < 0.5∧KeW ang ≥ 5.5 ∧ DmitryP avlov < 1 ECM L ≥ 0.5∧P KDD ≥ 0.5∨ECM L < 0.5∧ICM L ≥ 0.5∧U AI ≥ 3.5 ←→ SasoDzeroski ≥ 0.5∧Stef anKramer ≥ 0.5 ∨ SasoDzeroski < 0.5 ∧ SatinderP.Singh ≥ 0.5 ∧ CraigBoutilier ≥ 0.5 ICDM ≥ 10 ∨ ICDM < 10 ∧ SDM ≥ 10.5 ←→ ShengM a ≥ 1.5 ∨ ShengM a < 1.5 ∧ Kun − LungW u ≥ 24 KDD ≥ 7.5 ∨ KDD < 7.5 ∧ ICDE ≥ 24.5 ←→ T omF awcett ≥ 0.5 ∨ T omF awcett < 0.5 ∧ KeW ang ≥ 6.5 SIGM ODConf erence ≥ 4.5 ∨ SIGM ODConf erence < 4.5 ∧ V LDB ≥ 8.5 ∧ P ODS ≥ 2.5 ←→ H.V.Jagadish ≥ 0.5 ∨ H.V.Jagadish < 0.5 ∧ S.Seshadri ≥ 0.5 ∧ Y uriBreitbart ≥ 0.5 P ODS ≥ 5.5 ∧ F OCS ≥ 13 ∨ P ODS < 5.5 ∧ ICDT < 2.5 ∧ SIGM ODConf erence ≥ 3.5 ∨ P ODS < 5.5 ∧ ICDT ≥ 2.5 ∧ V LDB ≥ 9.5 ←→ M osheY.V ardi ≥ 11.5 ∧ M.R.Garey ≥ 0.5 ∨ M osheY.V ardi < 11.5 ∧ EmmanuelW aller < 0.5 ∧ M ichaelJ.Carey ≥ 0.5 ∨ M osheY.V ardi < 11.5 ∧ EmmanuelW aller ≥ 0.5 ∧ SophieCluet ≥ 4.5 SIGM ODConf erence < 4.5 ∧ V LDB ≥ 5.5 ∧ ICDT ≥ 3.5 ∨ SIGM ODConf erence ≥ 4.5 ∧ P ODS ≥ 0.5 ∧ ICDE ≥ 5.5 ←→ H.V.Jagadish < 0.5 ∧ DiveshSrivastava ≥ 0.5 ∧ AlonY.Levy ≥ 2.5 ∨ H.V.Jagadish ≥ 0.5 ∧ RaghuRamakrishnan ≥ 2.5 ∧ SurajitChaudhuri ≥ 2.5 SODA ≥ 2.5 ∨ SODA < 2.5 ∧ ST OC ≥ 1.5 ∧ F OCS ≥ 6.5 ←→ KurtM ehlhorn ≥ 0.5 ∨ KurtM ehlhorn < 0.5 ∧ AviW igderson ≥ 0.5 ∧ M oniN aor ≥ 0.5 Continued Appendix B Redescription Sets from experiments with DBLP data Set J 0.294 110 Table B.5: Redescriptions mined by Algorithm 2 from DBLP data set (with k-means (5 clusters) binarization routine; Gini-impurity measure; Li min bucket = 100 ) LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 support ; CON F [a−b] - author submitted from a to b papers for conference CONF . J 0.130 E1,1 6 p-val. 0.000 0.122 5 0.000 0.102 5 0.000 0.065 7 0.000 0.060 4 0.000 0.056 1 0.020 Redescription SDM ≥ 1.5 ∨ SDM < 1.5 ∧ ICDM < 4.5 ∧ ICDT ≥ 6.5 ∨ SDM < 1.5 ∧ ICDM ≥ 4.5 ∧ ICDE ≥ 20 ←→ SrujanaM erugu ≥ 1.5 ∨ SrujanaM erugu < 1.5 ∧ JianyongW ang < 0.5 ∧ SophieCluet ≥ 8.5 ∨ SrujanaM erugu < 1.5 ∧ JianyongW ang ≥ 0.5 ∧ W eiW ang ≥ 6 W W W ≥ 2.5 ∨ W W W < 2.5 ∧ EDBT ≥ 5.5 ∧ V LDB ≥ 26.5 ←→ LyleH.U ngar ≥ 1.5 ∨ LyleH.U ngar < 1.5 ∧ N arainH.Gehani ≥ 4.5 ∧ ShaulDar ≥ 1.5 W W W ≥ 1.5 ∨ W W W < 1.5 ∧ ICDE ≥ 21.5 ∧ P KDD ≥ 1.5 ←→ DanielF.Lieuwen ≥ 5 ∨ DanielF.Lieuwen < 5 ∧ CharuC.Aggarwal ≥ 3.5 ∧ W eiW ang ≥ 4 EDBT ≥ 1.5 ∨ EDBT < 1.5 ∧ V LDB ≥ 11.5 ∧ F OCS ≥ 0.5 ←→ ElenaBaralis ≥ 0.5 ∨ ElenaBaralis < 0.5 ∧ H.V.Jagadish ≥ 2.5 ∧ Y ossiM atias ≥ 1.5 ICDE ≥ 6.5 ∨ ICDE < 6.5 ∧ V LDB ≥ 12.5 ∧ P KDD ≥ 3 ←→ HenryF.Korth ≥ 7.5 ∨ HenryF.Korth < 7.5 ∧ LaksV.S.Lakshmanan ≥ 9 ∧ ShojiroN ishio ≥ 1.5 F OCS ≥ 27 ←→ F riedhelmM eyerauf derHeide ≥ 1.5 Appendix B Redescription Sets from experiments with DBLP data Set Table B.5: Redescriptions mined by Algorithm 2 from DBLP data set (with k-means (5 clusters) binarization routine; Gini-impurity measure; Li min bucket = 100 ) LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 support ; CON F [a−b] - author submitted from a to b papers for conference CONF . 111 E1,1 1 2239 p-val. 0 0.022 0.959 745 0 0.959 1557 0 0.957 707 0 0.949 2197 0.015 0.949 636 0 0.919 711 0 0.904 2094 0.091 0.879 2034 0.091 Redescription ICDM ≥ 15.5 ←→ Kun − LungW u ≥ 24 U AI < 2.5 ∧ ICDE < 0.5 ∨ U AI ≥ 2.5 ∧ F OCS < 1.5 ∨ ICDE ≥ 0.5 ∧ SIGM ODConf erence < 8.5 ∨ F OCS ≥ 1.5 ∧ ST OC < 8.5 ←→ T omiSilander < 0.5 ∧ SurajitChaudhuri < 0.5 ∨ T omiSilander ≥ 0.5 ∧ AviW igderson < 0.5 ∨ SurajitChaudhuri ≥ 0.5 ∧ RakeshAgrawal < 0.5 ∨ AviW igderson ≥ 0.5 ∧ SridharRajagopalan < 1.5 SIGM ODConf erence < 1.5 ∧ V LDB ≥ 0.5 ∧ SODA < 1.5 ∨ SIGM ODConf erence ≥ 1.5 ∧ ICDE ≥ 3.5 ∧ KDD ≥ 3 ←→ H.V.Jagadish < 0.5 ∧ M ichaelJ.Carey ≥ 0.5 ∧ M osesCharikar < 1.5 ∨ H.V.Jagadish ≥ 0.5 ∧ JiaweiHan ≥ 2.5 ∧ BengChinOoi ≥ 0.5 ST OC ≥ 1.5 ∧ F OCS < 0.5 ∨ ST OC < 1.5 ∧ SODA < 0.5 ∨ F OCS ≥ 0.5 ∧ SIGM ODConf erence ≥ 0.5 ∨ SODA ≥ 0.5 ∧ P ODS ≥ 1.5 ←→ F rankT homsonLeighton ≥ 0.5 ∧ AviW igderson < 0.5 ∨ F rankT homsonLeighton < 0.5 ∧ SergeA.P lotkin < 0.5 ∨ AviW igderson ≥ 0.5 ∧ CatrielBeeri ≥ 0.5 ∨ SergeA.P lotkin ≥ 0.5 ∧ M ayurDatar ≥ 0.5 ST OC < 0.5 ∧ F OCS ≥ 0.5 ∧ V LDB < 1.5 ∨ ST OC ≥ 0.5 ∧ SODA ≥ 4.5 ∧ SIGM ODConf erence ≥ 1 ←→ F riedhelmM eyerauf derHeide < 1.5 ∧ AviW igderson ≥ 0.5 ∧ AlbertoO.M endelzon < 0.5 ∨ F riedhelmM eyerauf derHeide ≥ 1.5 ∧ SantoshV empala ≥ 0.5 ∧ M ayurDatar ≥ 2 W W W ≥ 1.5 ∨ W W W < 1.5 ∧ F OCS < 0.5 ∨ F OCS ≥ 0.5 ∧ ST OC < 8.5 ←→ SridharRajagopalan ≥ 0.5 ∨ SridharRajagopalan < 0.5 ∧ AviW igderson < 0.5 ∨ AviW igderson ≥ 0.5 ∧ M adhuSudan < 0.5 ST OC < 1.5 ∧ F OCS ≥ 0.5 ∧ V LDB < 1.5 ∨ ST OC ≥ 1.5 ∧ SODA ≥ 4.5 ∧ COLT ≥ 2.5 ←→ M ichaelE.Saks < 0.5 ∧ AviW igderson ≥ 0.5 ∧ AlbertoO.M endelzon < 0.5 ∨ M ichaelE.Saks ≥ 0.5 ∧ AmosF iat ≥ 0.5 ∧ N aderH.Bshouty ≥ 0.5 ST OC < 0.5 ∧ F OCS ≥ 0.5 ∧ SIGM ODConf erence < 0.5 ∨ ST OC ≥ 0.5 ∧ SODA ≥ 0.5 ∧ ICDE < 0.5 ←→ T omasF eder < 0.5 ∧ AviW igderson ≥ 0.5 ∧ CatrielBeeri < 0.5 ∨ T omasF eder ≥ 0.5 ∧ AmosF iat ≥ 0.5 ∧ SergeA.P lotkin < 1.5 W W W ≥ 0.5 ∨ W W W < 0.5 ∧ F OCS < 0.5 ∨ F OCS ≥ 0.5 ∧ ST OC < 8.5 ←→ AmelieM arian ≥ 1.5 ∨ AmelieM arian < 1.5 ∧ AviW igderson < 0.5 ∨ AviW igderson ≥ 0.5 ∧ M oniN aor < 0.5 COLT ≥ 3.5 ∨ COLT < 3.5 ∧ V LDB ≥ 0.5 ∧ SIGM ODConf erence < 7.5 ∨ COLT < 3.5 ∧ V LDB < 0.5 ∧ ICDE < 0.5 ←→ N icoloCesa − Bianchi ≥ 0.5 ∨ N icoloCesa − Bianchi < 0.5 ∧ M ichaelJ.Carey ≥ 0.5 ∧ DiveshSrivastava < 0.5 ∨ N icoloCesa − Bianchi < 0.5 ∧ M ichaelJ.Carey < 0.5 ∧ Hans − P eterKriegel < 7.5 Continued Appendix B Redescription Sets from experiments with DBLP data Set J 1 0.965 112 Table B.6: Redescriptions mined by Algorithm 2 from DBLP data set (with hierarchical (5 clusters) binarization routine; IG-impurity measure;) LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 - support. J 0.878 E1,1 361 p-val. 0 0.872 2014 0.052 0.864 1996 0.101 0.861 1992 0.157 0.842 717 0 0.839 0.822 0.815 0.789 1929 1900 1876 830 0.026 0.095 0.054 0 0.734 80 0 0.733 641 0 0.698 1586 0.02 0.676 606 0 113 Redescription ST OC ≥ 2.5 ∧ ICDE < 1.5 ∨ ST OC < 2.5 ∧ F OCS ≥ 1.5 ∧ SODA ≥ 9.5 ∨ ICDE ≥ 1.5 ∧ EDBT < 0.5 ←→ M oniN aor ≥ 0.5 ∧ N ickKoudas < 0.5 ∨ M oniN aor < 0.5 ∧ M adhuSudan ≥ 0.5 ∧ SudiptoGuha ≥ 0.5 ST OC ≥ 1.5 ∧ F OCS < 5.5 ∨ ST OC < 1.5 ∧ SODA < 0.5 ←→ AviW igderson < 0.5 ∨ AviW igderson ≥ 0.5 ∧ OdedGoldreich < 0.5 COLT ≥ 1.5 ∨ COLT < 1.5 ∧ ICDE ≥ 0.5 ∧ SIGM ODConf erence < 8.5 ∨ COLT < 1.5 ∧ ICDE < 0.5 ∧ V LDB < 0.5 ←→ LeonardP itt ≥ 0.5 ∨ LeonardP itt < 0.5 ∧ SurajitChaudhuri ≥ 0.5 ∧ RakeshAgrawal < 0.5 ∨ LeonardP itt < 0.5 ∧ SurajitChaudhuri < 0.5 ∧ M ichaelJ.Carey < 0.5 COLT ≥ 1.5 ∨ COLT < 1.5 ∧ V LDB ≥ 0.5 ∧ SIGM ODConf erence < 7.5 ∨ COLT < 1.5 ∧ V LDB < 0.5 ∧ ICDE < 0.5 ←→ StephenA.F enner ≥ 3 ∨ StephenA.F enner < 3 ∧ M ichaelJ.Carey ≥ 0.5 ∧ DiveshSrivastava < 0.5 ∨ StephenA.F enner < 3 ∧ M ichaelJ.Carey < 0.5 ∧ Hans − P eterKriegel < 7.5 V LDB ≥ 1.5 ∨ V LDB < 1.5 ∧ SIGM ODConf erence ≥ 0.5 ∧ ICDM < 0.5 ←→ RakeshAgrawal ≥ 0.5 ∨ RakeshAgrawal < 0.5 ∧ HamidP irahesh ≥ 0.5 ∧ JiaweiHan < 0.5 ST OC < 1.5 ∨ ST OC ≥ 1.5 ∧ F OCS < 2.5 ←→ AviW igderson < 0.5 ∨ AviW igderson ≥ 0.5 ∧ OdedGoldreich < 0.5 KDD < 0.5 ←→ JiaweiHan < 0.5 P ODS < 0.5 ←→ AbrahamSilberschatz < 0.5 SIGM ODConf erence ≥ 3.5∨SIGM ODConf erence < 3.5∧V LDB < 0.5∧ICDE ≥ 0.5∨SIGM ODConf erence < 3.5 ∧ V LDB ≥ 0.5 ∧ P KDD < 0.5 ←→ F lipKorn ≥ 0.5 ∨ F lipKorn < 0.5 ∧ M ichaelJ.Carey < 0.5 ∧ Hans − P eterKriegel ≥ 7.5 ∨ F lipKorn < 0.5 ∧ M ichaelJ.Carey ≥ 0.5 ∧ JinyanLi < 1 ICM L ≥ 3.5 ∧ ICDE < 0.5 ∨ ICM L < 3.5 ∧ ECM L ≥ 1 ∧ P KDD ≥ 1.5 ∨ ICDE ≥ 0.5 ∧ ICDM < 0.5 ←→ DoinaP recup ≥ 1.5 ∧ JiaweiHan < 0.5 ∨ DoinaP recup < 1.5 ∧ SasoDzeroski ≥ 0.5 ∧ Stef anKramer ≥ 0.5 ∨ JiaweiHan ≥ 0.5 ∧ W ei − Y ingM a < 0.5 ST OC ≥ 1.5 ∨ ST OC < 1.5 ∧ F OCS < 0.5 ∧ SODA ≥ 1.5 ∨ ST OC < 1.5 ∧ F OCS ≥ 0.5 ∧ P ODS < 0.5 ←→ AviW igderson ≥ 0.5 ∨ AviW igderson < 0.5 ∧ F rankT homsonLeighton < 0.5 ∧ P iotrIndyk ≥ 0.5 ∨ AviW igderson < 0.5 ∧ F rankT homsonLeighton ≥ 0.5 ∧ P hokionG.Kolaitis < 0.5 ST OC ≥ 1.5 ∧ F OCS < 0.5 ∨ ST OC < 1.5 ∧ SODA < 1.5 ∨ SODA ≥ 1.5 ∧ V LDB < 1.5 ←→ M ichaelE.Saks ≥ 0.5 ∧ AviW igderson < 0.5 ∨ M ichaelE.Saks < 0.5 ∧ SudiptoGuha < 0.5 V LDB ≥ 1.5 ∨ V LDB < 1.5 ∧ SIGM ODConf erence ≥ 0.5 ∧ SDM < 0.5 ←→ W olf gangKaf er ≥ 0.5 ∨ W olf gangKaf er < 0.5 ∧ HamidP irahesh ≥ 0.5 ∧ P hilipS.Y u < 3.5 Continued Appendix B Redescription Sets from experiments with DBLP data Set Table B.6: Redescriptions mined by Algorithm 2 from DBLP data set (with hierarchical (5 clusters) binarization routine; IG-impurity measure;) LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 - support. E1,1 58 p-val. 0 0.649 0.637 1488 1399 0.084 0 0.621 195 0 0.571 4 0 0.564 211 0 0.558 373 0 0.417 0.411 5 92 0 0 0.342 13 0 0.333 31 0 0.313 5 0 Redescription ECM L ≥ 2.5 ∧ V LDB < 0.5 ∨ ECM L < 2.5 ∧ ICM L ≥ 0.5 ∧ U AI ≥ 3.5 ∨ V LDB ≥ 0.5 ∧ ICDT < 3 ←→ Stef anKramer ≥ 1.5 ∧ JianP ei < 0.5 ∨ Stef anKramer < 1.5 ∧ SatinderP.Singh ≥ 0.5 ∧ DavidHeckerman ≥ 0.5 ∨ JianP ei ≥ 0.5 ∧ M ichaelBenedikt < 0.5 ICDE < 0.5 ←→ SurajitChaudhuri < 0.5 F OCS < 10.5 ∧ ST OC < 6.5 ∧ SIGM ODConf erence < 0.5 ∨ F OCS < 10.5 ∧ ST OC ≥ 6.5 ∧ ICDT < 0.5 ∨ F OCS ≥ 10.5 ∧ SODA ≥ 2.5 ∧ V LDB ≥ 3 ←→ RichardM.Karp < 1.5 ∧ AviW igderson < 0.5 ∧ HamidP irahesh < 0.5 ∨ RichardM.Karp < 1.5 ∧ AviW igderson ≥ 0.5 ∧ HectorGarcia − M olina < 0.5 ∨ RichardM.Karp ≥ 1.5 ∧ S.M uthukrishnan ≥ 0.5 ∧ JosephN aor ≥ 3.5 SDM ≥ 1.5 ∨ SDM < 1.5 ∧ ICDM ≥ 0.5 ∧ KDD ≥ 0.5 ←→ P hilipS.Y u ≥ 4.5 ∨ P hilipS.Y u < 4.5 ∧ V ipinKumar ≥ 0.5 ∧ SunilP rabhakar < 1.5 ICDM ≥ 8.5 ∧ SIGM ODConf erence ≥ 3.5 ∧ ICDT < 1.5 ←→ JiongY ang ≥ 2.5 ∧ JianP ei ≥ 2.5 ∧ ChristianBohm < 11 SODA ≥ 0.5 ∧ ST ACS < 2.5 ∨ SODA < 0.5 ∧ ST OC ≥ 0.5 ∧ F OCS ≥ 6.5 ←→ N icoleImmorlica ≥ 0.5 ∧ HarryBuhrman < 0.5 ∨ N icoleImmorlica < 0.5 ∧ AviW igderson ≥ 0.5 ∧ OdedGoldreich ≥ 0.5 SODA ≥ 5.5 ∨ SODA < 5.5 ∧ F OCS < 2.5 ∧ ST OC ≥ 1.5 ∨ SODA < 5.5 ∧ F OCS ≥ 2.5 ∧ P ODS < 1.5 ←→ M osesCharikar ≥ 0.5 ∨ M osesCharikar < 0.5 ∧ AviW igderson < 0.5 ∧ F rankT homsonLeighton ≥ 0.5 ∨ M osesCharikar < 0.5 ∧ AviW igderson ≥ 0.5 ∧ P hokionG.Kolaitis < 0.5 U AI ≥ 17 ←→ DavidM axwellChickering ≥ 0.5 SDM ≥ 0.5 ∧ SIGM ODConf erence < 4 ∨ SDM < 0.5 ∧ ICDM ≥ 0.5 ∧ KDD ≥ 0.5 ∨ SIGM ODConf erence ≥ 4 ∧ V LDB < 2.5 ←→ ArindamBanerjee ≥ 2 ∧ M artinL.Kersten < 0.5 ∨ ArindamBanerjee < 2 ∧ V ipinKumar ≥ 0.5 ∧ P hilipS.Y u ≥ 0.5 ∨ M artinL.Kersten ≥ 0.5 ∧ JiaweiHan ≥ 2.5 P ODS ≥ 2.5 ∨ P ODS < 2.5 ∧ ICDT ≥ 0.5 ∧ ST ACS ≥ 0.5 ←→ CatrielBeeri ≥ 0.5 ∨ CatrielBeeri < 0.5 ∧ LeonidLibkin ≥ 0.5 ∧ T homasSchwentick ≥ 0.5 ST OC ≥ 1.5 ∨ ST OC < 1.5 ∧ F OCS < 0.5 ∧ ICDT ≥ 3.5 ∨ ST OC < 1.5 ∧ F OCS ≥ 0.5 ∧ SODA ≥ 12.5 ←→ AviW igderson ≥ 0.5 ∨ AviW igderson < 0.5 ∧ N ogaAlon < 0.5 ∧ EmmanuelW aller ≥ 0.5 ∨ AviW igderson < 0.5 ∧ N ogaAlon ≥ 0.5 ∧ M artinF arach ≥ 2.5 ICDT ≥ 4.5 ←→ T ovaM ilo ≥ 2.5 Continued Appendix B Redescription Sets from experiments with DBLP data Set J 0.674 114 Table B.6: Redescriptions mined by Algorithm 2 from DBLP data set (with hierarchical (5 clusters) binarization routine; IG-impurity measure;) LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 - support. J 0.253 E1,1 19 p-val. 0 0.245 39 0 0.224 24 0 0.22 29 0 0.212 25 0 0.203 41 0 0.2 26 0 0.2 0.194 2 7 0 0 0.192 37 0 Redescription SODA < 5.5 ∧ F OCS ≥ 0.5 ∧ ST OC ≥ 10 ∨ SODA ≥ 5.5 ∧ V LDB ≥ 2 ∧ W W W < 4 ←→ M osesCharikar < 0.5 ∧ AviW igderson ≥ 0.5 ∧ OdedGoldreich ≥ 0.5 ∨ M osesCharikar ≥ 0.5 ∧ SureshV enkatasubramanian ≥ 1.5 ∧ T.S.Jayram < 4 SODA ≥ 5.5 ∨ SODA < 5.5 ∧ F OCS ≥ 0.5 ∧ ST OC ≥ 8.5 ←→ M osesCharikar ≥ 0.5 ∨ M osesCharikar < 0.5 ∧ AviW igderson ≥ 0.5 ∧ M oniN aor ≥ 0.5 ECM L ≥ 2.5 ∨ ECM L < 2.5 ∧ ICM L < 0.5 ∧ COLT ≥ 3.5 ∨ ECM L < 2.5 ∧ ICM L ≥ 0.5 ∧ F OCS ≥ 0.5 ←→ Stef anKramer ≥ 0.5 ∨ Stef anKramer < 0.5 ∧ RobertE.Schapire < 0.5 ∧ Rolf W iehagen ≥ 0.5 ∨ Stef anKramer < 0.5 ∧ RobertE.Schapire ≥ 0.5 ∧ M anf redK.W armuth ≥ 0.5 SIGM ODConf erence ≥ 2.5 ∨ SIGM ODConf erence < 2.5 ∧ V LDB < 0.5 ∧ ST ACS ≥ 8.5 ∨ SIGM ODConf erence < 2.5 ∧ V LDB ≥ 0.5 ∧ ICDE ≥ 8.5 ←→ DiveshSrivastava ≥ 0.5 ∨ DiveshSrivastava < 0.5 ∧ DavidJ.DeW itt < 0.5 ∧ StephenA.F enner ≥ 0.5 ∨ DiveshSrivastava < 0.5 ∧ DavidJ.DeW itt ≥ 0.5 ∧ JosephM.Hellerstein ≥ 0.5 U AI ≥ 0.5 ∨ U AI < 0.5 ∧ ICM L < 0.5 ∧ COLT ≥ 3.5 ∨ U AI < 0.5 ∧ ICM L ≥ 0.5 ∧ F OCS ≥ 0.5 ←→ JudeaP earl ≥ 0.5 ∨ JudeaP earl < 0.5 ∧ RobertE.Schapire < 0.5 ∧ Rolf W iehagen ≥ 0.5 ∨ JudeaP earl < 0.5 ∧ RobertE.Schapire ≥ 0.5 ∧ M anf redK.W armuth ≥ 0.5 ST ACS ≥ 0.5 ∨ ST ACS < 0.5 ∧ F OCS ≥ 0.5 ∧ ST OC ≥ 8.5 ←→ HarryBuhrman ≥ 0.5 ∨ HarryBuhrman < 0.5 ∧ AviW igderson ≥ 0.5 ∧ M oniN aor ≥ 0.5 P ODS ≥ 3.5 ∨ P ODS < 3.5 ∧ ICDT < 0.5 ∧ W W W ≥ 3.5 ∨ P ODS < 3.5 ∧ ICDT ≥ 0.5 ∧ ST OC ≥ 13 ←→ LeonidLibkin ≥ 0.5 ∨ LeonidLibkin < 0.5 ∧ Y ehoshuaSagiv < 0.5 ∧ SridharRajagopalan ≥ 0.5 ∨ LeonidLibkin < 0.5 ∧ Y ehoshuaSagiv ≥ 0.5 ∧ M ihalisY annakakis ≥ 1.5 KDD ≥ 8.5 ∧ W W W ≥ 0.5 ∧ V LDB ≥ 2.5 ←→ KeW ang ≥ 1.5 ∧ Kun − LungW u ≥ 1 ∧ JaySethuraman ≥ 0.5 ICDM ≥ 3.5 ∧ ICDE ≥ 4.5 ∨ ICDM < 3.5 ∧ SDM ≥ 0.5 ∧ SIGM ODConf erence ≥ 2.5 ←→ ShengM a ≥ 0.5 ∧ RaymondT.N g ≥ 0.5 ∨ ShengM a < 0.5 ∧ P hilipS.Y u ≥ 2.5 ∧ Jef f reyXuY u ≥ 1.5 W W W ≥ 2.5∨W W W < 2.5∧EDBT < 5.5∧P ODS ≥ 2.5∨W W W < 2.5∧EDBT ≥ 5.5∧SIGM ODConf erence < 14.5 ←→ AndrewT omkins ≥ 0.5 ∨ AndrewT omkins < 0.5 ∧ RadekV ingralek < 0.5 ∧ Y ehoshuaSagiv ≥ 0.5 ∨ AndrewT omkins < 0.5 ∧ RadekV ingralek ≥ 0.5 ∧ AshishGupta < 0.5 Continued Appendix B Redescription Sets from experiments with DBLP data Set Table B.6: Redescriptions mined by Algorithm 2 from DBLP data set (with hierarchical (5 clusters) binarization routine; IG-impurity measure;) LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 - support. 115 E1,1 24 p-val. 0 0.165 44 0 0.162 23 0 0.154 49 0 0.143 0.143 2 11 0 0 0.138 30 0 0.132 23 0 0.125 57 0 0.115 0.114 7 39 0 0 0.114 39 0 Redescription ECM L ≥ 2.5 ∨ ECM L < 2.5 ∧ ICM L < 0.5 ∧ COLT ≥ 3.5 ∨ ECM L < 2.5 ∧ ICM L ≥ 0.5 ∧ F OCS ≥ 0.5 ←→ P eterGrunwald ≥ 1.5 ∨ P eterGrunwald < 1.5 ∧ RobertE.Schapire < 0.5 ∧ Rolf W iehagen ≥ 0.5 ∨ P eterGrunwald < 1.5 ∧ RobertE.Schapire ≥ 0.5 ∧ M anf redK.W armuth ≥ 0.5 SODA ≥ 0.5 ∨ SODA < 0.5 ∧ ST OC ≥ 0.5 ∧ F OCS ≥ 6.5 ←→ ErikD.Demaine ≥ 0.5 ∨ ErikD.Demaine < 0.5 ∧ AviW igderson ≥ 0.5 ∧ M adhuSudan ≥ 0.5 ECM L ≥ 1.5 ∨ ECM L < 1.5 ∧ ICM L < 0.5 ∧ COLT ≥ 3.5 ∨ ECM L < 1.5 ∧ ICM L ≥ 0.5 ∧ F OCS ≥ 0.5 ←→ JuhoRousu ≥ 0.5 ∨ JuhoRousu < 0.5 ∧ RobertE.Schapire < 0.5 ∧ Rolf W iehagen ≥ 0.5 ∨ JuhoRousu < 0.5 ∧ RobertE.Schapire ≥ 0.5 ∧ M anf redK.W armuth ≥ 0.5 EDBT ≥ 2.5∨EDBT < 2.5∧ICDE < 12.5∧SIGM ODConf erence ≥ 3.5∨EDBT < 2.5∧ICDE ≥ 12.5∧P ODS ≥ 3.5 ←→ Y uriBreitbart ≥ 0.5 ∨ Y uriBreitbart < 0.5 ∧ F lipKorn < 0.5 ∧ M ichaelJ.Carey ≥ 0.5 ∨ Y uriBreitbart < 0.5 ∧ F lipKorn ≥ 0.5 ∧ RakeshAgrawal ≥ 1.5 P KDD ≥ 4.5 ←→ ArnoJ.Knobbe ≥ 0.5 ICM L ≥ 4.5 ∨ ICM L < 4.5 ∧ ECM L ≥ 0.5 ∧ P KDD ≥ 1.5 ←→ T omM.M itchell ≥ 0.5 ∨ T omM.M itchell < 0.5 ∧ SasoDzeroski ≥ 0.5 ∧ Stef anKramer ≥ 1.5 ICDE ≥ 0.5 ∨ ICDE < 0.5 ∧ V LDB ≥ 0.5 ∧ SIGM ODConf erence ≥ 7.5 ←→ SurajitChaudhuri ≥ 0.5 ∨ SurajitChaudhuri < 0.5 ∧ M ichaelJ.Carey ≥ 0.5 ∧ RaymondT.N g ≥ 0.5 P ODS ≥ 0.5 ∨ P ODS < 0.5 ∧ ICDT ≥ 0.5 ∧ V LDB ≥ 15.5 ←→ AbrahamSilberschatz ≥ 0.5 ∨ AbrahamSilberschatz < 0.5 ∧ Y ehoshuaSagiv ≥ 0.5 ∧ JosephM.Hellerstein ≥ 0.5 ICDT ≥ 0.5 ∨ ICDT < 0.5 ∧ P ODS < 0.5 ∧ SIGM ODConf erence ≥ 1.5 ∨ ICDT < 0.5 ∧ P ODS ≥ 0.5 ∧ V LDB ≥ 9.5 ←→ DanSuciu ≥ 0.5 ∨ DanSuciu < 0.5 ∧ AbrahamSilberschatz < 0.5 ∧ HamidP irahesh ≥ 0.5 ∨ DanSuciu < 0.5 ∧ AbrahamSilberschatz ≥ 0.5 ∧ H.V.Jagadish ≥ 2.5 ST ACS ≥ 4.5 ←→ RiccardoSilvestri ≥ 0.5 ICDM ≥ 2.5 ∨ ICDM < 2.5 ∧ SDM < 3.5 ∧ ICDE ≥ 3.5 ∨ ICDM < 2.5 ∧ SDM ≥ 3.5 ∧ V LDB ≥ 11 ←→ QiangY ang ≥ 1.5 ∨ QiangY ang < 1.5 ∧ JianyongW ang < 0.5 ∧ H.V.Jagadish ≥ 0.5 ∨ QiangY ang < 1.5 ∧ JianyongW ang ≥ 0.5 ∧ W eiW ang ≥ 6 W W W ≥ 1.5 ∨ W W W < 1.5 ∧ ICDE < 0.5 ∧ V LDB ≥ 0.5 ∨ W W W < 1.5 ∧ ICDE ≥ 0.5 ∧ SIGM ODConf erence ≥ 8.5 ←→ DanielF.Lieuwen ≥ 5 ∨ DanielF.Lieuwen < 5 ∧ SurajitChaudhuri < 0.5 ∧ M ichaelJ.Carey ≥ 0.5 ∨ DanielF.Lieuwen < 5 ∧ SurajitChaudhuri ≥ 0.5 ∧ RakeshAgrawal ≥ 0.5 Continued Appendix B Redescription Sets from experiments with DBLP data Set J 0.185 116 Table B.6: Redescriptions mined by Algorithm 2 from DBLP data set (with hierarchical (5 clusters) binarization routine; IG-impurity measure;) LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 - support. J 0.108 E1,1 70 p-val. 0 0.108 93 0 0.102 33 0 0.101 42 0 0.095 20 0 0.089 40 0 0.085 17 0 0.084 55 0 0.084 12 0 0.081 37 0 0.081 9 0 Redescription ICDT ≥ 1.5 ∨ ICDT < 1.5 ∧ P ODS < 3.5 ∧ V LDB ≥ 1.5 ∨ ICDT < 1.5 ∧ P ODS ≥ 3.5 ∧ SIGM ODConf erence ≥ 2.5 ←→ GeorgGottlob ≥ 0.5 ∨ GeorgGottlob < 0.5 ∧ CatrielBeeri < 0.5 ∧ M ichaelJ.Carey ≥ 0.5 ∨ GeorgGottlob < 0.5 ∧ CatrielBeeri ≥ 0.5 ∧ AbrahamSilberschatz ≥ 0.5 ICDM ≥ 0.5 ∧ ICM L < 2.5 ∨ ICDM < 0.5 ∧ KDD < 0.5 ∧ ICDE ≥ 0.5 ∨ ICDM < 0.5 ∧ KDD ≥ 0.5 ∧ V LDB ≥ 0.5 ←→ V ipinKumar ≥ 0.5 ∧ DmitryP avlov < 2.5 ∨ V ipinKumar < 0.5 ∧ JiaweiHan < 0.5 ∧ GioW iederhold ≥ 0.5 ∨ V ipinKumar < 0.5 ∧ JiaweiHan ≥ 0.5 ∧ SurajitChaudhuri ≥ 0.5 EDBT ≥ 0.5 ∨ EDBT < 0.5 ∧ ICDE ≥ 0.5 ∧ SIGM ODConf erence ≥ 8.5 ←→ RakeshAgrawal ≥ 1.5 ∨ RakeshAgrawal < 1.5 ∧ SurajitChaudhuri ≥ 0.5 ∧ RaghuRamakrishnan ≥ 0.5 SIGM ODConf erence ≥ 1.5 ∨ SIGM ODConf erence < 1.5 ∧ V LDB ≥ 0.5 ∧ ICDE ≥ 4.5 ←→ HamidP irahesh ≥ 0.5 ∨ HamidP irahesh < 0.5 ∧ M ichaelJ.Carey ≥ 0.5 ∧ RaymondT.N g ≥ 0.5 SDM ≥ 0.5 ∨ SDM < 0.5 ∧ ICDM ≥ 0.5 ∧ KDD ≥ 1.5 ←→ HuanLiu ≥ 1.5 ∨ HuanLiu < 1.5 ∧ W eiF an ≥ 0.5 ∧ HaixunW ang ≥ 0.5 ICDE ≥ 1.5 ∨ ICDE < 1.5 ∧ V LDB ≥ 2.5 ∧ SIGM ODConf erence ≥ 11.5 ←→ N ickKoudas ≥ 0.5 ∨ N ickKoudas < 0.5 ∧ RakeshAgrawal ≥ 0.5 ∧ Jef f reyF.N aughton ≥ 0.5 P KDD ≥ 0.5 ∨ P KDD < 0.5 ∧ KDD ≥ 0.5 ∧ ICDE ≥ 2.5 ←→ XingquanZhu ≥ 3 ∨ XingquanZhu < 3 ∧ JiaweiHan ≥ 1.5 ∧ RaymondT.N g ≥ 0.5 EDBT ≥ 1.5 ∨ EDBT < 1.5 ∧ ICDE < 2.5 ∧ V LDB ≥ 1.5 ∨ EDBT < 1.5 ∧ ICDE ≥ 2.5 ∧ SIGM ODConf erence ≥ 1.5 ←→ ElenaBaralis ≥ 0.5 ∨ ElenaBaralis < 0.5 ∧ H.V.Jagadish < 0.5 ∧ BruceG.Lindsay ≥ 0.5 ∨ ElenaBaralis < 0.5 ∧ H.V.Jagadish ≥ 0.5 ∧ HectorGarcia − M olina ≥ 0.5 KDD ≥ 2.5 ∨ KDD < 2.5 ∧ SDM ≥ 0.5 ∧ ICDE ≥ 4.5 ←→ XiongW ang ≥ 2 ∨ XiongW ang < 2 ∧ HaixunW ang ≥ 0.5 ∧ W eiW ang ≥ 3 SIGM ODConf erence ≥ 5.5∨SIGM ODConf erence < 5.5∧V LDB < 4.5∧ICDE ≥ 1.5∨SIGM ODConf erence < 5.5 ∧ V LDB ≥ 4.5 ∧ P ODS ≥ 7.5 ←→ RaymondT.N g ≥ 0.5 ∨ RaymondT.N g < 0.5 ∧ DiveshSrivastava < 0.5 ∧ N ickKoudas ≥ 0.5 ∨ RaymondT.N g < 0.5 ∧ DiveshSrivastava ≥ 0.5 ∧ InderpalSinghM umick ≥ 0.5 ICM L ≥ 1.5 ∨ ICM L < 1.5 ∧ ECM L ≥ 0.5 ∧ P KDD ≥ 1.5 ←→ AndrewW.M oore ≥ 0.5 ∨ AndrewW.M oore < 0.5 ∧ N adaLavrac ≥ 0.5 ∧ RakeshAgrawal ≥ 0.5 Continued Appendix B Redescription Sets from experiments with DBLP data Set Table B.6: Redescriptions mined by Algorithm 2 from DBLP data set (with hierarchical (5 clusters) binarization routine; IG-impurity measure;) LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 - support. 117 E1,1 66 p-val. 0 0.079 70 0 0.068 10 0 0.068 48 0 0.068 20 0 0.065 20 0 0.062 52 0 0.058 33 0 Redescription ICDM < 0.5 ∧ KDD < 0.5 ∧ ICDE ≥ 0.5 ∨ ICDM < 0.5 ∧ KDD ≥ 0.5 ∧ V LDB ≥ 0.5 ∨ ICDM ≥ 0.5 ∧ SDM ≥ 0.5 ∧ EDBT ≥ 1.5 ←→ V ipinKumar < 0.5 ∧ JiaweiHan < 1.5 ∧ GioW iederhold ≥ 0.5 ∨ V ipinKumar < 0.5 ∧ JiaweiHan ≥ 1.5 ∧ SurajitChaudhuri ≥ 0.5 ∨ V ipinKumar ≥ 0.5 ∧ P hilipS.Y u ≥ 0.5 ∧ XueminLin ≥ 0.5 P KDD ≥ 2.5 ∨ P KDD < 2.5 ∧ KDD < 0.5 ∧ ICDE ≥ 0.5 ∨ P KDD < 2.5 ∧ KDD ≥ 0.5 ∧ V LDB ≥ 0.5 ←→ StephaneLallich ≥ 0.5 ∨ StephaneLallich < 0.5 ∧ JiaweiHan < 0.5 ∧ GioW iederhold ≥ 0.5 ∨ StephaneLallich < 0.5 ∧ JiaweiHan ≥ 0.5 ∧ SurajitChaudhuri ≥ 0.5 ICM L ≥ 0.5 ∨ ICM L < 0.5 ∧ ECM L ≥ 0.5 ∧ U AI ≥ 3.5 ←→ P eterA.F lach ≥ 0.5 ∨ P eterA.F lach < 0.5 ∧ StephenM uggleton ≥ 0.5 ∧ N irF riedman ≥ 0.5 V LDB ≥ 1.5 ∨ V LDB < 1.5 ∧ SIGM ODConf erence < 0.5 ∧ ICDT ≥ 0.5 ∨ V LDB < 1.5 ∧ SIGM ODConf erence ≥ 0.5 ∧ ICDE ≥ 0.5 ←→ RakeshAgrawal ≥ 0.5 ∨ RakeshAgrawal < 0.5 ∧ HectorGarcia − M olina < 0.5 ∧ DilysT homas ≥ 1.5 ∨ RakeshAgrawal < 0.5 ∧ HectorGarcia − M olina ≥ 0.5 ∧ JiaweiHan ≥ 0.5 KDD ≥ 0.5 ∨ KDD < 0.5 ∧ ICDM ≥ 0.5 ∧ V LDB ≥ 0.5 ←→ Y asuhikoM orimoto ≥ 0.5 ∨ Y asuhikoM orimoto < 0.5 ∧ JiaweiHan ≥ 0.5 ∧ JianP ei ≥ 0.5 ICDM ≥ 0.5 ∨ ICDM < 0.5 ∧ KDD ≥ 0.5 ∧ ICDE ≥ 0.5 ←→ Geof f reyI.W ebb ≥ 0.5 ∨ Geof f reyI.W ebb < 0.5 ∧ JiaweiHan ≥ 0.5 ∧ P hilipS.Y u ≥ 3.5 V LDB ≥ 5.5 ∨ V LDB < 5.5 ∧ SIGM ODConf erence < 7.5 ∧ ICDE ≥ 0.5 ∨ V LDB < 5.5 ∧ SIGM ODConf erence ≥ 7.5 ∧ P ODS ≥ 3.5 ←→ P hilipA.Bernstein ≥ 0.5 ∨ P hilipA.Bernstein < 0.5 ∧ RakeshAgrawal < 0.5 ∧ GioW iederhold ≥ 0.5 ∨ P hilipA.Bernstein < 0.5 ∧ RakeshAgrawal ≥ 0.5 ∧ F lipKorn ≥ 2 ST OC < 1.5 ∧ F OCS ≥ 0.5 ∨ ST OC ≥ 1.5 ∧ SODA ≥ 0.5 ∧ W W W ≥ 2.5 ←→ AviW igderson < 0.5 ∧ Y ossiAzar ≥ 0.5 ∨ AviW igderson ≥ 0.5 ∧ M osesCharikar ≥ 0.5 Appendix B Redescription Sets from experiments with DBLP data Set J 0.081 118 Table B.6: Redescriptions mined by Algorithm 2 from DBLP data set (with hierarchical (5 clusters) binarization routine; IG-impurity measure;) LHS is a left-hand side part of the redescription; RHS is a right hand-side part of a redescription; J - Jaccard similarity E1,1 - support.