Download Discovering Relational Emerging Patterns

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Discovering Relational Emerging Patterns
Annalisa Appice, Michelangelo Ceci, Carlo Malgieri, and Donato Malerba
Dipartimento di Informatica, Università degli Studi di Bari
via Orabona, 4 - 70126 Bari - Italy
{appice,ceci,malerba}@di.uniba.it
Abstract. The discovery of emerging patterns (EPs) is a descriptive
data mining task defined for pre-classified data. It aims at detecting patterns which contrast two classes and has been extensively investigated
for attribute-value representations. In this work we propose a method,
named Mr-EP, which discovers EPs from data scattered in multiple tables of a relational database. Generated EPs can capture the differences
between objects of two classes which involve properties possibly spanned
in separate data tables. We implemented Mr-EP in a pre-existing multirelational data mining system which is tightly integrated with a relational
DBMS, and then we tested it on two sets of geo-referenced data.
1
Introduction
The discovery of emerging patterns (EPs) is a descriptive data mining task
aiming at the detection of significant differences between objects belonging to
separate classes. EPs are introduced in [4] as a particular kind of patterns (or
multi-variate features) whose support significantly changes from one data class
to another: the larger the difference of pattern support, the more interesting the
patterns. Due to the sharp change in support, EPs can be used to characterize
object classes. For example, EPs have been used to predict the likelihood of diseases such as acute lymphoblastic leukemia [13] and to explore high-dimensional
data such as gene expression data [12].
Several algorithms [18,4,10] have been proposed to discover EPs from data
belonging to separate classes (data populations) and stored in a single relational
table. Independent units of each data population Di are described by a fixed
vector S of explanatory attributes X1 , X2 , . . . , Xm and are tagged with a class
label Y = Ci . The EPs which distinguish a target data population Di from the
background data Dj are in the form P (GRDj →Di (P )), where P is a set of items
(P ⊆ S) and GRDj →Di (P ) is the support ratio (or growth rate) of P over Dj
sDi (P )
to Di . Formally GRDj →Di (P ) = sD
(P ) , where sDi (P ) (sDj (P )) is the support
j
of P on Di (Dj ). Since an item refers to an attribute-value pair, the itemset P
can be interpreted as a conjunction of attribute values. Formally, given a growth
rate threshold minGR ≥ 1, an EP from Dj to Di is an itemset P whose growth
rate from Dj to Di is greater than minGR.
Although research on EPs has reached relative maturity over the last years,
there is still a number of interesting issues which remain open. One issue concerns
R. Basili and M.T. Pazienza (Eds.): AI*IA 2007, LNAI 4733, pp. 206–217, 2007.
c Springer-Verlag Berlin Heidelberg 2007
Discovering Relational Emerging Patterns
207
the need to face the challenges of real-world data mining tasks involving complex
and heterogeneous data with different properties which are modeled by as many
relations as the number of object types. Mining data scattered over the multiple
tables of a relational database (relational data) poses the problem of taking into
account attributes of related (i.e. task-relevant ) objects when investigating properties of some reference objects which are the main subject of analysis. Classical
EPs discovery methods do not distinguish task-relevant from reference objects,
nor do they allow the representation of any kind of interaction. Therefore, we
propose to resort to a Multi-Relational Data Mining (MRDM) approach [6] in
order to deal with both relational data and relational patterns.
In this paper, we propose a novel method, called Mr-EP (M ulti-Relational
E merging P atterns), which discovers EPs from relational data and is capable to
capture the change in properties of separate classes of data spanned in multiple
data tables. The class variable is associated with the reference objects, while
explanatory attributes refer to either the reference objects or the task-relevant
objects which are someway related to the reference objects. The structural information required to mine such relational EPs can be automatically obtained
from the database schema by navigating foreign key constraints. For each class,
relational EPs are expressed as SQL queries stored in XML format.
The paper is organized as follows. In the next section the background of this
research and related works are discussed, while in Section 3 the problem of EPs
discovery is formalized in the multi-relational framework. Relational emerging
patterns discovery is described in Section 4. Lastly, experimental results are
reported in Section 5 and then some conclusions are drawn.
2
Related Works
The combination of relational representation with pattern discovery has been
deeply investigated for several data mining tasks. Data mining research has
provided several solutions for the task of frequent pattern and association rule
discovery both in a propositional and a relational setting, but, at the best of our
knowledge, this work represents the first attempt at extracting relational EPs.
In [1], a frequent pattern is defined as an itemset whose support is greater than
a predefined minimum threshold value (minimum support), while an association
rule is an implication in the form A ⇒ C(s, c), where A and C are itemsets and
A∩C = . The support s provides an estimate of the probability p(A∪C), while
the confidence c provides an estimate of the probability p(C|A). An association
rule A → C (s%, c%) is strong if the pattern A ∪ C (s%) is frequent and the
confidence of the rule is greater than a predefined minimum threshold value
(minimum confidence).
The two best known association rule discovery methods defined in the relational framework are WARMR [3] and SPADA [14]. They are both based on an
Inductive Logic Programming (ILP) approach, where both data and background
(or domain) knowledge is represented in a first-order logic formalism, such as
Horn clausal logic. Relational frequent patterns are discovered according to the
208
A. Appice et al.
levelwise method described in [16], which consists in a level-by-level exploration
of the lattice of patterns ordered by θ-subsumption [17]. Strong association rules
are then generated from frequent patterns.
Unlike frequent patterns and/or association rules which capture regularities
in data describing unclassified objects, EPs capture changes in data describing
objects of different classes. This adds one main source of complexity to the
learning task, since the monotonicity property does not hold for EPs. Suppose
a pattern P is not an EP from Dj to Di , that is, the growth rate of P from Dj
to Di is not greater than the user-defined threshold. For any super-pattern Q of
P (P ⊆ Q), its support is less than or equal to that of P for both classes Ci and
Cj , while its growth rate (the support ratio) is free to be any real value between
0 and ∞. Therefore, a superset of a non-EP may or may not be an EP.
In the seminal work by Dong and Li [4], EPs are discovered by assuming
that each data set is stored in a single data table. A border-based approach is
adopted to discover the EPs discriminating between separate classes. Borders
are used to represent both candidates and subsets of EPs; the border differential
operation is then used to discover the EPs. Zhang et al. [18] have described an
efficient method, called ConsEPMiner, which adopts a level-wise generate-andtest approach to discover EPs which satisfy several constraints (e.g., growth-rate
improvement). Finally, Fan and Ramamohanarao [7] have proposed a method
which improves the efficiency of EPs discovery by adopting a CP-tree data structure to register the counts of both the positive and negative class.
A further direction of research concerns the usage of EPs in learning accurate
data classifiers [5,8,11,9]. EP-based classification is related to the associative
classification framework [15] where classifiers are built by carefully selecting high
quality association rules. The advantage of EPs over association rules is that EPs
provide features which better discriminate objects of distinct classes.
3
Problem Definition
In this work we assume that both reference objects and task-relevant objects
are tuples stored in tables of a relational database D according to a schema S.
The set R of reference objects is the collection of tuples stored in a table T of
D called target table. Similarly, each set Ri of task-relevant objects corresponds
to a distinct table of D. The inherent “structure” of data, that is, the relations
between reference and task-relevant objects, is expressed in the schema S by
foreign key constraints (F K). Foreign keys make it possible to navigate the data
schema and retrieve all the task-relevant objects in D which are related to a
reference object and, thus, are capable of discriminating between the values of
the target attribute Y .
Before providing a formal definition of the problem to be solved, some other
definitions need to be introduced.
Definition 1 (Key predicate). Let S be a database schema and T be a table of S representing the target table for the task at hand. The “key predicate”
Discovering Relational Emerging Patterns
209
associated with T in S is a first order unary predicate p(t) such that p denotes
the table T and the term t is a variable that represents the primary key of T .
Definition 2 (Structural predicate). Let S be a database schema and {Ti , Tj }
be a pair of tables in S such that there exists a foreign key F K in S between Ti and
Tj . A “structural predicate” associated with the pair of tables {Ti , Tj } in S is a
first order binary predicate p(t, s) such that p denotes F K and the term t (s) is a
variable that represents the primary key of Ti (Tj ).
Definition 3 (Property predicate). Let S be a database schema, Ti a table
of S and AT T be an attribute of Ti which is neither primary key nor foreign
key for Ti in S. A “property predicate” associated with the attribute AT T of the
table Ti is a binary predicate p(t, s) such that p denotes the attribute AT T , the
term t is a variable representing the primary key of Ti and s is a constant which
represents a value belonging to the range of AT T in Ti .
A relational pattern over S is a conjunction of predicates consisting of the key
predicate and one or more (structural or property) predicates over S. More
formally, a relational pattern is defined as follows:
Definition 4 (Relational pattern). Let S be a database schema. A “relational
pattern” P over S is a conjunction of predicates:
p0 (t01 ), p1 (t11 , t12 ), p2 (t21 , t22 ), . . . , pm (tm1 , tm2 )
where p0 (t01 ) is the key predicate associated with the target table of the task at
hand and ∀i = 1, . . . , m pi (ti1 , ti2 ) is either a structural predicate or a property
predicate over S.
Henceforth, we will also use the set notation for relational patterns, that is, a
relational pattern is considered a set of atoms.
Definition 5 (Key linked predicate).
Let P = p0 (t01 ), p1 (t11 , t12 ), p2 (t21 , t22 ), . . . , pm (tm1 , tm2 ) be a relational pattern over the database schema S. For each i = 1, . . . , m, the (structural or property) predicate pi (ti1 , ti2 ) is “key linked” in P if
– pi (ti1 , ti2 ) is a predicate with t01 = ti1 or t01 = ti2 , or
– there exists a structural predicate pj (tj1 , tj2 ) in P such that pj (tj1 , tj2 ) is
key linked in P and ti1 = tj1 ∨ ti2 = tj1 ∨ ti1 = tj2 ∨ ti2 = tj2 .
Definition 6 (Completely linked relational pattern). Let S be a database
schema. A “completely linked” relational pattern is a relational pattern P =
p0 (t01 ), p1 (t11 , t12 ), . . . , pm (tm1 , tm2 ) such that ∀i = 1 . . . m, pi (ti1 , ti2 ) is a
predicate which is key linked in P .
Definition 7 (Relational emerging patterns). Let D be an instance of a
database schema S that contains a set of reference objects labeled with Y ∈
{C1 , . . . , CL } and stored in the target table T of S. Given a minimum growth
210
A. Appice et al.
rate value (minGR) and a minimum support value (minsup), P is a “relational
emerging pattern” in D if P is a completely linked relational pattern over S
and some class label Ci exists such that GRDi →Di (P ) > minGR and sDi (P ) >
minsup, where:
– Di is an instance of database schema S such that Di .T = {t ∈ D.T |D.T.Y =
Ci } and ∀T ∈ S, T = T : Di .T = {t ∈ D.T | all foreign key constraints FK
are satisfied in Di }.
– Di is an instance of database schema S such that Di .T = {t ∈ D.T |D.T.Y =
Ci } and ∀T ∈ S, T = T : Di .T = {t ∈ D.T | all foreign key constraints FK
are satisfied in Di }.
The support sDi (P ) of P on database Di is computed as follows:
sDi (P ) =
|OP |
,
|O|
(1)
where O denotes the set of reference objects stored as tuples of Di .T , while OP
denotes the subset of reference objects in O which are covered by the pattern P .
The growth rate of P for distinguishing Di from Di is the following:
GRDi →Di (P ) =
sDi (P )
sDi (P )
(2)
As in [4], we assume that GR(P ) = 00 = 0 and GR(P ) = >0
0 = ∞.
The problem of discovering relational EPs can now be formalized as follows.
Given:
–
–
–
–
a relational database D with a data schema S,
a set R of reference objects tagged with a class label Y ∈ {C1 , C2 , . . . , CL },
some sets Ri , 1 ≤ i ≤ h of task-relevant objects,
a pair of thresholds, that is, the minimum growth rate (minGR ≥ 1) and
the minimum support (minsup > 0).
Find:
the set of relational emerging patterns which discriminate between reference
objects belonging to distinct classes in D.
In this work, we resort to the relational algebra formalism to express a relational emerging pattern P by means of an SQL query. The SELECT statement
selects primary key values for distinct reference objects of the task at hand. The
FROM statement describes the joins between all the tables of S which are involved in P (i.e., the target table associated with the key predicate and the tables
which are included in separate structural predicates of the pattern P ). A structural predicate is translated into a join condition. The property of linkedness in
relational patterns guarantees the soundness of joins. The WHERE statement
describes the conditions expressed in the property predicates. Reference objects
contributing to the support of the EP on each Di are obtained as result set by
running this SQL query on the database instance Di .
Discovering Relational Emerging Patterns
211
Example 1. Let us consider a set of molecules (reference objects) described in
terms of the “logP” property and the “mutagenicity” class. Each molecule is
composed by one or more atoms (task-relevant objects) and each atom is described by the “charge”. An example of relational pattern P is the following:
molecule(MolID), logPInMolecule(MolId,[5..10[),atom(MolId,AtomId1),
chargeInAtom(AtomId1,[3.2,5.8[),atom(MolID,AtomId2),
chargeInAtom(AtomId2,[5.0,7.1[)
P can be expressed by means of the SQL query:
SELECT distinct M.MolID
FROM (Molecule M INNER JOIN Atom A1 on M.MolId=A1.MolId)
INNER JOIN Atom A2 on M.MolId=A2.MolId
WHERE M.logP>=5 AND M.logP<10 AND
A1.Charge >=3.2 AND A1.Charge<5.8 AND
A2.Charge >=5.0 AND A2.Charge<7.1
4
Relational EPs Discovery
We address EP discovery by adapting the algorithms proposed for frequent pattern discovery to the special case of EPs. The blueprint for the frequent patterns
discovery algorithms is the levelwise method [16] that explores level-by-level the
lattice of patterns ordered according to a generality relation () between patterns. Formally, given two patterns P 1 and P 2, P 1 P 2 denotes that P 1 (P 2)
is more general (specific) than P 2 (P 1). The search proceeds from the the most
general pattern and iteratively alternates the candidate generation and candidate evaluation phases.
In this paper, we propose an enhanced version of the aforementioned levelwise method which works on EPs rather than frequent patterns. The space of
candidate EPs is structured according to the θ-subsumption generality order
[17].
Definition 8 (θ-subsumption). Let P 1 and P 2 be two relational patterns on a
data schema S such that both P 1 and P 2 are key completely linked patterns with
respect to a target table T in S. P 1 θ-subsumes P 2 if and only if a substitution
θ exists such that P 2 θ ⊆ P 1.
Having introduced θ-subsumption, we now go to define generality order between
completely linked relational patterns.
Definition 9 (Generality order under θ-subsumption). Let P 1 and P 2
be two completely linked relational patterns. P 1 is more general than P 2 under
θ-subsumption, denoted as P 1 θ P 2, if and only if P 2 θ-subsumes P 1.
θ-subsumption defines a quasi-ordering, since it satisfies the reflexivity and transitivity property but not the anti-symmetric property. The quasi-ordered set spanned
by θ can then be searched according to a downward refinement operator which
computes the set of refinements for a completely linked relational pattern.
212
A. Appice et al.
Definition 10 (Downward refinement operator under θ-subsumption).
Let G, θ be the space of completely linked relational patterns ordered according
to θ . A downward refinement operator under θ-subsumption is a function ρ such
that ρ(P ) ⊆ {Q ∈ G|P θ Q}.
We now define the downward refinement operator ρ to explore the space of
candidate EPs for distinguishing Di from Di .
Definition 11 (Downward refinement operator for EPs). Let P be a relational EP for distinguishing Di from Di . Then ρ (P ) = {P ∪ {p(t1, t2)}|p(t1, t2)
is a structural or property predicate key linked in P ∪{p(t1, t2)} and P ∪{p(t1, t2)}
is an EP for distinguishing Di from Di }.
The downward refinement operator for EPs is a refinement operator under θsubsumption. In fact, it can be easily proved that P θ Q for all Q ∈ ρ (P ). This
makes Mr-EP able to perform a levelwise exploration of the lattice of EPs ordered
by θ-subsumption. More precisely, for each class Ci , the EPs for distinguishing
Di from Di are discovered by searching the pattern space one level at a time,
starting from the most general EP (the EP that contains only the key predicate)
and iterating between candidate generation and evaluation phases. In Mr-EP,
the number of levels in the lattice to be explored is limited by the user-defined
parameter M AXM ≥ 1. In other terms, M AXM limits the maximum number
of structural predicates (joins) within a candidate EP. Since joins affects the
computational complexity of the method, a low value of M AXM guarantees
the applicability of the algorithm to reasonably large data The monotonicity
property of the generality order θ with respect to the support value (i.e., a
superset of an infrequent pattern cannot be frequent) is exploited to avoid the
generation of infrequent relational patterns. In fact, an infrequent pattern on Di
cannot be an EP for distinguishing Di from Di .
Proposition 1 (Property of θ-subsumption monotonicity). Let G, θ be the space of relational completely linked patterns ordered according to θ . P 1
and P 2 are two patterns of G, θ with P 1 θ P 2 then OP 1 ⊇ OP 2 .
Therefore, when P 1 θ P 2, we have sDi (P 1) ≥ sDi (P 2) and sDi (P 1) ≥ sDi (P 2)
∀i = 1, . . . , L. This is the counterpart of one of the properties exploited in the family
of the Apriori-like algorithms [1] to prune the space of candidate patterns. To efficiently discover relational EPs, Mr-EP prunes the search space by exploiting the
θ-subsumption monotonicity of support (prune1 criterion). Let P be a refinement
of a pattern P . If P is an infrequent pattern on Di (sDi (P ) < minsup), then P has
a support on Di that is lower than the user-defined threshold (minsup). According
to the definition of EP, P cannot be an EP for distinguishing Di from Di , hence
Mr-EP does not refine patterns which are infrequent on Di .
Unluckily, the monotonicity property does not hold for the growth rate: a
refinement of an EP whose growth rate is lower than the threshold minGR
may or may not be an EP. Anyway, as in the propositional case [18], some
mathematical considerations on the growth rate formulation can be usefully
exploited to define two further pruning criteria.
Discovering Relational Emerging Patterns
213
First (prune2 criterion), Mr-EP avoids generating the refinements of a pattern P in the case that GRDi →Di (P ) = ∞ (i.e., sDi (P ) > 0 and sDi (P ) =
0). Indeed, due to the θ-subsumption monotonicity of support ∀P ∈ ρ (P ):
sDi (P ) ≥ sDi (P ) then sDi (P ) = 0. Thereby, GRDi →Di (P ) = 0 in the case
that sDi (P ) = 0, while GRDi →Di (P ) = ∞ in the case that sDi (P ) > 0. In
the former case, P is not worth to be considered (prune1 ). In the latter case,
P θ P and sDi (P ) ≥ sDi (P ). Therefore, P is useless since P has the same
discriminating ability than P (GRDi →Di (P ) = GRDi →Di (P ) = ∞). We prefer
P to P based on the Occams razor principle, according to which all things being
equal, the simplest solution tends to be the best one.
Second (prune3 criterion), Mr-EP avoids generating the refinements of a pattern P which add a property predicate in the case that the refined patterns have
the same support of P on Di . We denote by:
SameSupport(P )Di = {P ∈ ρ (P )|sDi (P ) = sDi (P ), P = P ∧ p(t1, t2),
p(t1, t2) is a property predicate}.
For the monotonicity property, ∀P ∈ SameSupportDi (P ): sDi (P ) ≥ sDi (P ).
This means that GRDi →Di (P ) ≥ GRDi →Di (P ). P is more specific than P
but, at the same time, P has a lower discriminating power than P . This pruning criterion prunes EPs that could be generated as refinements of patterns in
SameSupportDi (P ). However, it is possible that some of them may be of interest
for our discovery process. Their identification is guaranteed by the following:
Proposition 2. Let P ∈ SameSupportDi (P ) such that P = P ∪ {p(t1, t2)}
with p(t1, t2) being a property predicate. Let P ∈ ρ (P ) such that P = P ∪
{q(t3, t4)} with q(t3, t4) being a property predicate. If P is an EP discriminating Di from Di and sDi (P ) = sDi (P ) then P = P ∪ {q(t3, t4)} ∈
/
SameSupportDi (P ).
Proof: Let OPDi denote the set of reference objects in Di covered by a pattern
i
. Since
P . By construction, P ∈ ρ (P ) ∩ ρ (P ) and OPDi = OPDi ∩ OPD
Di
P ∈ SameSupportDi (P ), we have sDi (P ) = sDi (P ), that is, OP = OPDi .
i
⊆ OPDi .
Moreover, for the θ-subsumption monotonicity property, we have OPD
i
i
i
Therefore, we have: OPDi = OPDi ∩ OPD
=OPDi ∩ OPD
=OPD
. Since sDi (P ) =
sDi (P ) by hypothesis, then it is also true that sDi (P ) = sDi (P ). Therefore,
/ SameSupportDi (P ).
P ∈
According to proposition 2, we can prune P (but not P ) without preventing
the generation of EPs more specific than P . It is noteworty to observe that
this pruning criterion operates only when p(t1, t2) is a property predicate. Differently, pruning of structural predicates would avoid the introduction of a new
variable thus avoiding the discovery of further EPs obtained by adding property
or structural predicates involving such variable.
Finally, additional candidates not worth being evaluated are those equivalent
under θ-subsumption to some other candidate (prune4 ).
214
5
A. Appice et al.
Experimental Results
Mr-EP has been implemented as a module of the MRDM system MURENA
(MUlti RElational aNAlyzer) which interfaces the Oracle 10g DBMS. We tested
the method on two real world geo-referenced data sets: the North-West England
Census Data and the Munich Census Data. Both data sets include numeric
attributes, which are handled through an equal-width discretization to partition
the range of values into a fixed number of bins. EPs have been discovered with
minGR = 1.1, minsup = 0.1. M AXM is set to 3 for North-West England
Census Dataset and to 5 for Munich Census Dataset. In this work, we present
only a qualitative interpretation of EPs. Each EP is analyzed in terms of a
human interpretable pattern that is descriptive of characteristics discriminating
between separate classes of relational data.
The North-West England Census Data. Data were obtained from both census and digital maps provided by the European project SPIN! (http://www.ais.
fraunhofer.de/KD/SPIN/project.html). They concern Greater Manchester, one
of the five counties of North West England (NWE). Greater Manchester is divided into into 214 census sections (wards). Census data are available at ward
level and provide socio-economic statistics (e.g. mortality rate) as well as some
measures of the deprivation of each ward according to information provided by
Census combined into single index scores. We employed the Jarman score that
estimates the need for primary care, the indices developed by Townsend and
Carstairs to perform health-related analyses, and the DoE index which is used
in targeting urban regeneration funds. The higher the index value the more
deprived the ward. In this application, the mortality percentage rate (target attribute) takes values in the finite set {low = [0.001, 0.01], high =]0.01, 0.18]}.
The analysis we performed was based on deprivation factors and geographical
factors represented in topographic maps of the area. Vectorized boundaries of
the 1998 census wards as well as of other Ordnance Survey digital maps of NWE
are available for several layers such as urban area (115 lines), green area (9 lines),
road net (1687 lines), rail net (805 lines) and water net (716 lines). Objects of
each layer are stored as tuples of relational tables including information on the
object type (TYPE). For instance, an urban area may be either a “large urban
area” or a “small urban area”. Topological relationships between wards and objects in these layers are materialized as relational tables expressing non-disjoint
relations. The number of materialized “non disjoint” relationships is 5313.
Mr-EP discovered 60 EPs to discriminate high mortality rate wards from the
class of wards with low mortality rate and 55 EPs to discriminate low mortality
rate wards from high mortality rate wards. An example of EP extracted for the
class mortality rate=high is:
wards(A) ∧ wards rails(A, B) ∧ wards doeindex(A, [6.598..9.232])
where wards(A) is the key predicate, wards rails(A, B) is the structural predicate representing an interaction between the ward A and a ward B (this means
that A is crossed by at least one railway) and wards doeindex(A, [6.598..9.232])
(i.e. A is a deprived zone to be considered as target zone for regeneration
Discovering Relational Emerging Patterns
215
fundings) is a property predicate. This pattern presents a support of 0.22 and
growth rate 3.77. This means that wards crossed by railways and with a relatively high doeindex value present a high percentage of mortality. This could be
due to urban decay condition of the area. The pattern corresponds to the SQL
query:
SELECT distinct W.ID
FROM (WARDS W INNER JOIN WARDS RAILS WR on W.ID=WR.WardID)
WHERE W.DOEINDEX <= 9.232 AND W.DOEINDEX >= 6.598
A different conclusion can be drawn from the following relational EP extracted
for the class mortality rate=low :
wards(A) ∧ wards townsendidx(A, [−3.86431.. − 2.01452])
∧ wards greenareas(A, B)
This pattern has a support of 0.113 and a growth rate of 2.864. It captures
the event that a ward with a relative low Townsend deprivation level (i.e., the
ward A cannot be considered as deprived with respect to health-related analyses)
and overlaps at least one green area (i.e., a park) discriminates wards with low
mortality rate from the others.
The Munich Census Data. These data concern the level of monthly rent per
square meter for flats in Munich expressed in German Marks. They have been
collected on 1998 to develop the 1999 Munich rental guide and describe 2180
flats located in the 446 subquarters of Munich obtained by dividing the Munich
metropolitan area up into three areal zones and decomposing each of these zones
into 64 districts. The vectorized boundaries of subquarters, districts and zones
as well as the map of public transport stops (56 subway (U-Bahn) stops, 15 rapid
train (S-Bahn) stops and 1 railway station) within Munich are available for this
study (http://www.di.uniba.it/∼ceci/mic Files/munich db.tar.gz). The objects
included in these layers are stored in different relational tables (SUBQUARTERS, TRANSPORT STOPS and APARTMENTS). Information on the “area”
of subquarters is stored in the corresponding table. Transport stops are described
by means of their type (U-Bahn, S-Bahn or Railway station), while flats are
described by means of their “monthly rent per square meter”, “floor space in
square meters” and “year of construction”. The monthly rent per square meter
(target attribute) has been discretized into the two intervals low = [2.0, 14.0] or
high =]14.0, 35.0]. The “close to” relation between subquarters areas and the “inside” relation between public train stops and metropolitan subquarters are materialized into relational tables (ward close to ward and apartment inside district).
Similarly, the “cross” relation between districts and public train stops is materialized into the relational table district crossedby tranStop.
Mr-EP discovered 31 (31) EPs to discriminate apartment with high (low) rent
rate per square meters from the class of apartments with low (high) rent rate per
square meters. An example of EP extracted for the class rate per squaremeters
=high is:
apartment(A) ∧ apartment inside district(A, B) ∧
district close to district(B, C) ∧ district ext 19 69(B, [0.875..1.0])
216
A. Appice et al.
This pattern has a support of 0.125 and a growth rate of 1.723. It represents the
event that an apartment A is inside a district B which contains a high percentage
(between 87.5% and 100%) of apartments with a relatively low extension (between 19 m2 and 69 m2 ). This pattern distriminates apartments with high rate
per square meters form the others. It can be motivated by considering that the
rent rate is not directly proportional to the apartment extension but it includes
fixed expenses that do not vary with the apartment size.
For the class rate per squaremeters=low the following EP was discovered:
apartment(A) ∧ apartment inside district(A, B) ∧
district crossedby tranStop(B, C) ∧ apartment year(A, [1893..1899])
This pattern has a support of 0.265 and a growth rate of 2.343. It represents the
event that an apartment A built between 1893 and 1899 is inside a district B
that contains a railway public stop. This pattern discriminates apartments with
low rate per square meters from the others. It can be motivated by considering
that old buildings do not offer the same facilities of a recently built apartment.
6
Conclusions
In this paper, we presented a novel MRDM method, called Mr-EP, which discovers a characterization of classes in terms of relational EPs thus providing a
human-interpretable description of the differences between separate classes. The
method was implemented in a MRDM system which is tightly integrated with
a relational DBMS. The tight-coupling with the database makes the knowledge
on data structure available free of charge to guide the search in the relational
pattern space. Experimental results have been obtained by running Mr-EP to
capture data changes among several populations of geo-referenced data. As future work, we plan to exploit relational EPs for associative classification tasks
and to compare results with those already reported in a previous study [2].
Acknowledgment
This work is partial fulfillment of the research objective of ATENEO-2007 project
“Metodi di scoperta della conoscenza nelle basi di dati: evoluzioni rispetto allo
schema unimodale”. The authors thank Nicola Barile for his help in reading a
first draft of this paper.
References
1. Agrawal, R., Imielinski, T., Swami, A.N.: Mining association rules between sets of
items in large databases. In: Buneman, P., Jajodia, S. (eds.) International Conference on Management of Data, pp. 207–216 (1993)
2. Ceci, M., Appice, A.: Spatial associative classification: propositional vs structural
approach. Journal of Intelligent Information Systems 27(3), 191–213 (2006)
3. Dehaspe, L., Toivonen, H.: Discovery of frequent datalog patterns. Journal of Data
Mining and Knowledge Discovery 3(1), 7–36 (1999)
Discovering Relational Emerging Patterns
217
4. Dong, G., Li, J.: Efficient mining of emerging patterns: Discovering trends and
differences. In: International Conference on Knowledge Discovery and Data Mining,
pp. 43–52. ACM Press, New York (1999)
5. Dong, G., Zhang, X., Wong, L., Li, J.: CAEP: Classification by aggregating
emerging patterns. In: Arikawa, S., Furukawa, K. (eds.) DS 1999. LNCS (LNAI),
vol. 1721, pp. 30–42. Springer, Heidelberg (1999)
6. Džeroski, S., Lavrač, N.: Relational Data Mining. Springer, Heidelberg (2001)
7. Fan, H., Ramamohanarao, K.: An efficient singlescan algorithm for mining essential
jumping emerging patterns for classification. In: Chen, M.-S., Yu, P.S., Liu, B.
(eds.) PAKDD 2002. LNCS (LNAI), vol. 2336, pp. 456–462. Springer, Heidelberg
(2002)
8. Fan, H., Ramamohanarao, K.: A bayesian approach to use emerging patterns for
classification. In: Australasian Database Conference, vol. 143, pp. 39–48. Australian
Computer Society, Inc. (2003)
9. Fan, H., Ramamohanarao, K.: A weighting scheme based on emerging patterns for
weighted support vector machines. In: Hu, X., Liu, Q., Skowron, A., Lin, T.Y.,
Yager, R.R., Zhang, B. (eds.) IEEE International Conference on Granular Computing, pp. 435–440. IEEE Computer Society Press, Los Alamitos (2005)
10. Li, J.: Mining Emerging Patterns to Construct Accurate and Efficient Classifiers.
PhD thesis, University of Melbourne (2001)
11. Li, J., Dong, G., Ramamohanarao, K., Wong, L.: DeEPs: A new instance-based
lazy discovery and classification system. Machine Learning 54(2), 99–124 (2004)
12. Li, J., Liu, H., Downing, J.: Simple rules underlying gene expression profiles of
more than six subtypes of acute lymphoblastic leukemia. Bioinformatics 19(1),
71–78 (2003)
13. Li, J., Liu, H., Ng, S.-K., Wong, L.: Discovery of significant rules for classifying
cancer diagnosis data. In: European Conference on Computational Biology, Supplement of Bioinformatics, pp. 93–102 (2003)
14. Lisi, F.A., Malerba, D.: Inducing multi-level association rules from multiple relations. Machine Learning 55, 175–210 (2004)
15. Liu, B., Hsu, W., Ma, Y.: Integrative classification and association rule mining. In:
Proceedings of AAAI Conference of Knowledge Discovery in Databases (1998)
16. Mannila, H., Toivonen, H.: Levelwise search and borders of theories in knowledge
discovery. Data Mining and Knowledge Discovery 1(3), 241–258 (1997)
17. Plotkin, G.D.: A note on inductive generalization 5, 153–163 (1970)
18. Zhang, X., Dong, G., Ramamohanarao, K.: Exploring constraints to efficiently
mine emerging patterns from large high-dimensional datasets. In: Terano, T., Chen,
A.L.P. (eds.) PAKDD 2000. LNCS, vol. 1805, pp. 310–314. Springer, Heidelberg
(2000)