Download Proceedings of The Workshop on Mining Complex Patterns

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Proceedings of
The Workshop on Mining
Complex Patterns
Editors
Annalisa Appice (University of Bari, Italy)
Michelangelo Ceci (University of Bari, Italy)
Corrado Loglisci (University of Bari, Italy)
Giuseppe Manco (ICAR-CNR, Italy)
Preface
The International Workshop on Mining Complex Patterns (MCP 2011) was held in
Mondello (Palermo), Italy, on September 17th 2011 in conjunction with AI*IA 2011: the
12th International Conference of Italian Association for Artificial Intelligence (AI*IA
2011).
During the last two decades, studies in Machine Learning have paved the way to the
definition of efficient and stable data mining and knowledge discovery algorithms. Data
mining and knowledge discovery can today be considered as stable fields with numerous
efficient algorithms and studies that have been proposed to extract knowledge in different
forms from data. Although, most existing data mining approaches look for patterns in
tabular data (which are typically obtained from relational databases), algorithmic
extensions are recently investigated to massive datasets representing complex interactions
between several entities from a variety of sources. These interactions may be spanned at
multiple levels of granularity as well as at spatial and temporal dimensions.
Our purpose in this workshop was to bring together researchers and practitioners of data
mining interested in methods and applications where complex patterns in expressive
languages are extracted from text/hypertext data, networks and graphs, event or log data,
biological sequences, spatio-temporal data, sensor data and streams, and so on.
Twelve contributions were originally submitted, of which seven were accepted for the
oral presentation. Each submission was evaluated by three independent referees. Besides
paper presentations, the scientific programme also featured an invited talk by Sašo
Džeroski (Department of Knowledge Technologies, Jozef Stefan Institute, Ljubljana,
Slovenia).
We would like to thank the invited speaker, all the authors who submitted papers and all
the workshop participants. We are also grateful to members of the program committee
for their thorough work in reviewing submitted contributions with expertise and patience,
and the members of AI*IA.
Mondello, September 2011
Annalisa Appice
Michelangelo Ceci
Corrado Loglisci
Giuseppe Manco
Organization
Workshop Chairs
Annalisa Appice
Università degli Studi di Bari Aldo Moro
webpage: http://www.di.uniba.it/~appice/
mailto: [email protected]
Michelangelo Ceci
Università degli Studi di Bari Aldo Moro
webpage: http://www.di.uniba.it/~ceci/
mailto: [email protected]
Corrado Loglisci
Università degli Studi di Bari Aldo Moro
webpage: http://www.di.uniba.it/~loglisci/
mailto: [email protected]
Giuseppe Manco
Institute for High Performance Computing and Networks Italian National Research
Council, Rende (CS)
webpage: http://www.icar.cnr.it/manco
mailto: [email protected]
Program Committee
Fabrizio Angiulli (Università della Calabria)
Tania Cerquitelli (Politecnico di Torino)
Saso Dzeroski (Jozef Stefan Institute)
Nicola Fanizzi (Università degli Studi di Bari "Aldo Moro")
Stefano Ferilli (Università degli Studi di Bari "Aldo Moro")
Joao Gama (University of Porto)
Elio Masciari (Institute for High Performance Computing and Networks Italian National
Research Council, Rende (CS))
Rosa Meo (Università degli Studi di Torino)
Andrea Passerini (Università di Trento)
Zbigniew W. Ras (University of North Carolina and Warsaw University of Technology)
Chiara Renso (KDD Lab, Pisa)
Fabrizio Riguzzi (Università di Ferrara)
Alessandro Sperduti (Università degli studi di Padova)
Franco Turini (Università di Pisa)
Alfonso Urso (Institute for High Performance Computing and Networks Italian National
Research Council, Palermo)
Table of Contents
An Ontology of Data Mining .....................................................................................
Sašo Džeroski
1
Cooperating Techniques for Extracting Conceptual Taxonomies from Text.............
Stefano Ferilli, Fabio Leuzzi and Fulvio Rotella
2
PatTexSum: A Pattern-based Text Summarizer.........................................................
Elena Baralis, Luca Cagliero, Alessandro Fiori and Saima Jabeen
14
An Expectation Maximization Algorithm for Probabilistic Logic Programs.............
Elena Bellodi and Fabrizio Riguzzi
26
Clustering XML Documents by Structure: a Hierarchical Approach.........................
Gianni Costa, Giuseppe Manco, Riccardo Ortale, Ettore Ritacco
38
Outlier Detection For XML Documents.....................................................................
Giuseppe Manco and Elio Masciari
46
P2P support for OWL-S discovery.............................................................................
Domenico Redavid and Stefano Ferilli and Floriana Esposito
54
Marine Traffic Engineering through Relational Data Mining....................................
Antonio Bruno and Annalisa Appice
66
An Ontology of Data Mining
Sašo Džeroski
Jožef Stefan Institute, Department of Knowledge Technologies,
Ljubljana, Slovenia
[email protected]
Abstract. We have developed OntoDM[2][1][3], an ontology of the scientific domain of data mining aimed at describing data mining investigations. It represents entities such as data, data mining tasks and algorithms, and generalizations (output by the latter). In contrast to other
ontologies of data mining, OntoDM is a deep ontology, general purpose
rather than tailor made, and compliant to best practices in ontology
engineering.
OntoDM allows us to cover a large part of the diversity in data mining research, including recently developed approaches to mining structured data and constraint-based data mining. The talk will describe the
OntoDM ontology and how standard and more recent data mining approaches are represented within it. Two use cases will be described, one
of which concerns QSAR modeling in drug design investigations.
References
1. P. Panov, S. Džeroski, and L. Soldatova. Ontodm: An ontology of data mining. In
Proceedings of the 2008 IEEE International Conference on Data Mining Workshops,
pages 752–760, Washington, DC, USA, 2008. IEEE Computer Society.
2. P. Panov, L. Soldatova, and S. Džeroski. Representing entities in the ontodm data
mining ontology. In S. Džeroski, B. Goethals, and P. Panov, editors, Inductive
Databases and Constraint-Based Data Mining, pages 27–55. Springer, 2010.
3. P. Panov, L. N. Soldatova, and S. Džeroski. Towards an ontology of data mining
investigations. In J. Gama, V. S. Costa, A. M. Jorge, and P. Brazdil, editors,
Discovery Science, volume 5808 of Lecture Notes in Computer Science, pages 257–
271. Springer, 2009.
1
Cooperating Techniques for Extracting
Conceptual Taxonomies from Text
S. Ferilli1,2 , F. Leuzzi1 , and F. Rotella1
1
2
Dipartimento di Informatica – Università di Bari
[email protected]
{fabio.leuzzi, rotella.fulvio}@gmail.com
Centro Interdipartimentale per la Logica e sue Applicazioni – Università di Bari
Abstract. The current abundance of electronic documents requires automatic techniques that support the users in understanding their content and extracting useful information. To this aim, it is important to
have conceptual taxonomies that express common sense and implicit
relationships among concepts. This work proposes a mix of several techniques that are brought to cooperation for learning them automatically.
Although the work is at a preliminary stage, interesting initial results
suggest to go on extending and improving the approach.
1
Introduction
The spread of electronic documents and document repositories has generated the
need for automatic techniques to understand and handle the documents content,
in order to help the user in satisfying his information needs without being overwhelmed by the huge amount of available data. Since most of these data are in
textual form, and since text explicitly refers to concepts, most work has focussed
on Natural Language Processing (NLP). Obtaining automatically Full Text Understanding is not trivial, due to the intrinsic ambiguity of natural language
and to the huge amount of required common sense and linguistic/conceptual
background knowledge. Nevertheless, even small portions of such a knowledge
may significantly improve understanding performance, at least in limited domains. Although standard tools, techniques and representation formalisms are
still missing, lexical and/or conceptual taxonomies can provide a useful support
to many NLP tasks. Unfortunately, manually building this kind of resources is
very costly and error-prone, which is a strong motivation towards automatic
construction of conceptual networks by mining large amounts of documents in
natural language. This work aims at partially simulating some human abilities
in this field, such as extracting the concepts expressed in given texts and assessing their relevance; obtaining a practical description of the concepts underlying
the terms, which in turn would allow to generalize concepts having similar descriptions; and applying some kind of reasoning ‘by association’, that looks for
possible indirect connections between two identified concepts. The system will
take in input texts in natural language, and process them to build a conceptual
network that supports the above objectives. The resulting network can be visualized by means of a suitable interface and translated into a First-Order Logic
2
(FOL) formalism, to allow the subsequent exploitation of logic inference engines
in applications that use that knowledge.
Our proposal consists in a mix of existing tools and techniques, that are
brought to cooperation in order to reach the above objectives, extended and
supported by novel techniques when needed. The next section briefly recalls
related work. Then, Section 3 describes the mixed approach, and discusses the
novel parts in more detail. A preliminary evaluation of the proposal is reported
in Section 4, while Section 5 concludes the paper and outlines future work issues.
2
Related Work
Many works exist aimed at building taxonomies and ontologies from text. A
few examples: [10, 9] build ontologies from natural language text by labelling
the taxonomical relations only, while we label also non-taxonomical ones with
actions (verbs); [14] builds a taxonomy considering only concepts that are present
in a domain but do not appear in others, while we are interested in all recognized
concepts independently of their being generic or domain-specific; [13] defines a
language to build formal ontologies, while we are interested in the lexical level.
As regards our proposal, a first functionality that we needed is syntactic
analysis of the input text. We exploited the Stanford Parser and Stanford Dependencies [7, 1], two very effective tools that can identify the most likely syntactic structure of sentences (including active and passive forms), and label their
components as ‘subject’ or ‘(direct/indirect) object’. Moreover, they normalize
the words in the input text using lemmatization instead of stemming, which
allows to distinguish their grammatical role and is more comfortable to read by
humans. We also exploited the Weka project [5], that provides a set of tools to
carry out several learning and Data Mining (DM) tasks, including clustering,
classification, regression, discovery of association rules and visualization.
Another technique that inspired our work is the one described in [8] to semiautomatically extract a domain-specific ontology from free text, without using
external resources but focussing the analysis on Hub Words (i.e., words having
high frequency). After building the ontology, the adaptation of a Hub Word t is
ranked according to its ‘Hub Weight’:
W (t) = αw0 + βn + γ
n
X
w(ti )
i=1
where w0 is a given initial weight, n is the number of relationships in which t is
involved, w(ti ) is the tf ∗idf weight of the i-th word related to t, and α+β+γ = 1.
A task aimed at identifying most important words in a text, to be used
as main concepts for inclusion in the taxonomy, is Keyword Extraction (KE).
Among the several proposals available in the literature, we selected two techniques that can work on single documents (rather than requiring a whole corpus)
and are based on different and complementary approaches, so that they can together provide an added value. The quantitative approach in [12] is based on
3
the assumption that the relevance of a term in a document is proportional to
how frequently it co-occurs with a subset of most frequent terms in that document. The χ2 statistic is exploited to check whether the co-occurrences establish a significant deviation from chance. To improve orthogonality, the reference
frequent terms are preliminarily grouped exploiting similarity-based clustering
(using similar distribution of co-occurrence with other terms) and pairwise clustering (based on frequent co-occurrences). The qualitative approach in [3], based
on WordNet [2] and its extension WordNet Domains [11], focusses on the meaning of terms instead of their frequency and determines keywords as terms associated to the concepts referring to the main subject domain discussed in the
text. It exploits a density measure that determines how much a term is related
to different concepts (in case of polysemy), how much a concept is associated to
a given domain, and how relevant a domain is for a text.
Lastly, we need in some steps of our technique to assess the similarity among
concepts in a given conceptual taxonomy. A classical, general measure, is the
Hamming distance [6], that works on pairs of equal-lenght vectorial descriptions
and counts the minimum number of changes required to turn one into the other.
Other measures, specific for conceptual taxonomies, are sf F a [4] (that adopts
a global approach based on the whole set of hypernyms) and sf W P [16] (that
focuses on a particular path between the nodes to be compared).
3
Proposed Approach
In the following, we will assume that each term in the text corresponds to an
underlying concept (phrases can be preliminarily extracted using suitable techniques, and handled as single terms). A concept is described by a set of characterizing attributes and/or by the concepts that interact with it in the world
described by the corpus. The outcome is a graph, where nodes are the concepts recognized in the text, and edges represent the relationships among these
nodes, expressed by verbs in the text (whose direction denotes their role in the
relationship). This can be interpreted as a semantic network.
3.1
Identification of Relevant Concepts
The input text is preliminarily processed by the Stanford Parser in order to
extract the syntactic structure of the sentences that make it up. In particular,
we are interested only in (active or passive) sentences of the form subject-verb(direct/indirect)complement, from which we extract the corresponding triples
hsubject, verb, complement i that will provide the concepts (the subject s and
complement s) and attributes (verbs) for the taxonomy. Indirect complements are
treated as direct ones, by embedding the corresponding preposition into the verb:
e.g., to put, to put on and to put across are considered as three different verbs,
and sentence John puts on a hat returns the triple hJohn,put on,hati, in which
John and hat are concepts associated to attribute put on, indicating that John
can put on something, while a hat can be put on). Triples/sentences involving
4
verb ‘to be’ or nouns with adjectives provide immediate hints to build the subclass structure in the taxonomy: for instance, “The dog is a domestic animal...”
yields the relationships is a(dog, animal) and is a(domestic animal,animal).
The whole set of triples is represented in a Concepts×Attributes matrix V
that recalls the classical Terms×Documents Vector Space Model (VSM) [15].
The matrix is filled according to the following scheme (resembling tf · idf ):
fi,j
|A|
Vi,j = P
· log
|{j : ci ∈ aj }|
k fk,j
where:
– fP
i,j is the frequency of the i-th concept co-occurring with the j-th attribute;
–
k fk,j is the sum of the co-occurrences of all concepts with the j-th attribute;
– A is the entire set of attributes;
– |{j : ci ∈ aj }| is the number of attributes with which the concept ci cooccurrs (i.e., for which fi,j 6= 0).
Its values represent the term frequency tf, as an indicator of the relevance of the
term in the text at hand (no idf is considered, to allow the incremental addition
of new texts without the need of recomputing this statistic).
A clustering step (typical in Text Mining) can be performed on V to identify
groups of elements having similar features (i.e., involved in the same verbal
relationships). The underlying idea is that concepts belonging to the same cluster
should share some semantics. For instance, if concepts dog, John, bear, meal,
cow all share attributes eat, sleep, drink, run, they might be sufficiently close to
each other to fall in the same cluster, indicating a possible underlying semantic
(indeed, they are all animals). Since the number of clusters to be found is not
known in advance, we exploit the EM clustering approach provided by Weka
based on the Euclidean distance applied row vectors representing concepts in V.
Then, the application on the input texts of various Keyword Extraction techniques, based on different (and complementary) aspects, perspectives and theoretical principles, allows to identify relevant concepts. We use the quantitative
approach based on co-occurrences kc [12], the qualitative one based on WordNet
kw [3] and a psychological one based on word positions kp . The psychological
approach is novel, and is based on the consideration that humans tend to place
relevant terms/concepts toward the start and end of sentences and discourses,
where the attention of the reader/listener is higher. In our approach, the chance
of a term being a keyword is assigned simply according to its position in the
sentence/discourse, according to a mixture model determined by mixing two
Gaussian curves whose peaks are placed around the extremes of the portion of
text to be examined.
The information about concepts and attributes is exploited to compute a
Relevance Weight W (·) for each node in the network. Then, nodes are ranked
by decreasing Relevance Weight, and a suitable cutpoint in the ranking is determined to distinguish relevant concepts from irrelevant ones. We cut the list at
5
the first item ck in the ranking such that:
W (ck ) − W (ck+1 ) ≥ p ·
max
(W (ci ) − W (ci+1 ))
i=0,...,n−1
i.e., the difference in relevance weight from the next item is greater or equal
than the maximum difference between all pairs of adjacent items, smoothed by
a user-defined parameter p ∈ [0, 1].
Computation of Relevance Weight Identifying key concepts in a text is
more complex than just identifying keywords. Inspired to the Hub Words approach, we compute for each extracted concept a Relevance Weight expressing
its importance in the extracted network, by combining different values associated
to different perspectives: given a node/concept c,
P
w(c)
k(c)
e(c)
dM − d(c)
(c,c) w(c)
W (c) = α
+ǫ
+β
+γ
+δ
maxc w(c)
maxc e(c)
e(c)
dM
maxc k(c)
where α, β, γ, δ, ǫ are weights summing up to 1, and:
–
–
–
–
–
–
w(c) is an initial weight assigned to node c;
e(c) is the number of edges of any kind involving node c;
(c, c) denotes an edge involving node c;
dM is the largest distance between any two nodes in the whole vector space;
d(c) is the distance of node c from the center of the corresponding cluster;
k(c) is the keyword weight associated to node c.
The first term represents the initial weight provided by V, normalized by the
maximum initial weight among all nodes. The second term considers the number
of connections (edges) of any category (verbal or taxonomic relationships) in
which c is involved, normalized by the maximum number of connections of any
node in the network. The third term (Neighborhood Weight Summary) considers
the average initial weight of all neighbors of c (just summing up the weights the
final value would be proportional to the number of neighbors, that is already
considered in the previous term). The fourth term represents the Closeness to
Center of the cluster, i.e. the distance of c from the center of its cluster, normalized by the maximum distance between any two instances in the whole vector
space. The last term takes into account the outcome of the three KE techniques
on the given text, suitably weighted:
k(c) = ζkc (c) + ηkw (c) + θkp (c)
where ζ, η and θ are weights ranging in [0, 1] and summing up to 1. These terms
were designed to be independent of each other. A partial interaction is present
only between the second and the third ones, but is significantly smoothed due
to the applied normalizations.
6
3.2
Generalization of Similar Concepts
To generalize two or more concepts (G generalizes A if anything that can be
labeled as A can be labeled as G as well, but not vice-versa), we propose to
exploit WordNet and use the set of connections of each concept with its direct
neighbors as a description of the underlying concept. Three steps are involved
in this procedure:
1. Grouping similar concepts, in which all concepts are grossly partitioned to
obtain subsets of similar concepts;
2. Word Sense Disambiguation, that associates a single synset to each term by
solving possible ambiguities using the domain of discourse (Algorithm 1);
3. Computation of taxonomic similarity, in which WordNet is exploited to confirm the validity of the groups found in step 1 (Algorithm 2).
As to step 1, we build a Concepts×Concepts matrix C where Ci,j = 1 if
there is at least a relationship between concepts i and j, or Ci,j = 0 otherwise.
Each row in C can be interpreted as a description of the associated concept
in terms of its relationships to other concepts, and exploited for applying a
pairwise clustering procedure based on Hamming distance. In detail, for each
possible pair of different row and column items whose corresponding row and
column are not null and whose similarity passes a given threshold: if neither
is in a cluster yet, a new cluster containing those objects is created; otherwise,
if either item is already in a cluster, the other is added to the same cluster;
otherwise (both already belong to different clusters) their clusters are merged.
Items whose similarity with all other items does not pass the threshold result in
singleton clusters.
This clustering procedure alone might not be reliable, because terms that
occur seldom in the corpus have few connections (which would affect their cluster
assignment due to underspecification) and because the expressive power of this
formalism is too low to represent complex contexts (which would affect even
more important concepts). For this reason, the support of an external resource
might be desirable. We consider WordNet as a sensible candidate for this, and
try to map each concept in the network to the corresponding synset (a non trivial
problem due to the typical polysemy of many words) using the one domain per
discourse assumption as a simple criterion for Word Sense Disambiguation: the
meanings of close words in a text tend to refer to the same domain, and such a
domain is probably the dominant one among the words in that portion of text.
Thus, WordNet allows to check and confirm/reject the similarity of concepts
belonging to the same cluster, by considering all possible pairs of words whose
similarity is above a given threshold. The pair (say {A, B}) with largest similarity
value is generalized with their most specific common subsumer (hypernym) G in
WordNet; then the other pairs in the same cluster that share at least one of the
currently generalized terms, and whose least common hypernym is again G, are
progressively added to the generalization. Similarity is determined using a mix
of the measures proposed in [4] and in [16], to consider both the global similarity
7
Algorithm 1 Find “best synset” for a word
Input: word t, list of domains with
Output: best synset for word t.
weights.
best synset ← empty
best domain ← empty
for all synset(st ) do
max weight ← −∞
optimal domain ← empty
for all domains(ds ) do
if weight(ds ) > max weight then
max weight ← weight(ds )
optimal domain ← ds
end if
end for
if max weight > weight(best domain) then
best synset ← st
best domain ← optimal domain
end if
end for
and the actual viability of the specific candidate generalization:
sf (A, B) = sfF a (A, B) · sfW P (A, B)
3.3
Reasoning ‘by association’
Reasoning ‘by association’ means finding a path of pairwise related concepts that
establishes an indirect interaction between two concepts c′ and c′′ in the semantic
network. We propose to look for such a path using a Breadth-First Search (BFS)
technique, applied to both concepts under consideration.The expansion steps of
the two processes are interleaved, checking at each step whether the new set of
concepts just introduced has a non-empty intersection with the set of concepts
of the other process. When this happens, all the concepts in such an intersection
identify one or more shortest paths connecting c′ and c′′ , that can be retrieved
by tracing back the parent nodes at each level in both directions up to the roots
c′ and c′′ . Since this path is made up of concepts only, to obtain a more sensible
‘reasoning’ it must be filled with the specific kind of interaction represented by
the labels of edges (verbs) that connect adjacent concepts in the chain.
4
Evaluation
The proposed approach was evaluated using ad-hoc tests that may indicate its
strengths and weaknesses. Due to lack of space, only a few selected outcomes
will be reported here. Although preliminary, these results seem enough to suggest
that the approach is promising. The following default weights for the Relevance
Weight components were empirically adopted:
– α = 0.1 to increase the impact of most frequent concept (according to tf );
8
Algorithm 2 Effective generalization research.
Input: the set of C clusters returned by pair-wise
Output: set of candidate generalizations.
clustering;
T
similarity threshold.
generalizations ← empty set
for all c ∈ C do
good pairs ← empty set
for all pair(Oi , Oj ) | i, j ∈ c do
if similarity score(pair(Oi , Oj )) > T then
good pairs.add(pair(Oi, Oj ), wordnet hypernym(pair(Oi, Oj )))
end if
if good pairs 6= empty set then
new set ← {good pairs.getBestP air, good pairs.getSimilarP airs}
generalizations.add(new set)
end if
end for
end for
good pairs → all pairs that passed T , with the most specific common hypernym discovered
in
WordNet
good pairs.getBestPair → the pair that has the best similarity score.
good pairs.getSimilarPairs → the pairs that involve one of two objects
of the best pair, that
have satisfied the similarity score and have the same hypernym as the best pair
wordnet hypernym →
the most specific common hypernym discovered in WordNet for the two
passed object.
– β = 0.1 to keep low the impact of co-occurrences between nodes;
– γ = 0.3 to increase the impact of less frequent nodes if they are linked to
relevant nodes;
– δ = 0.25 to increase the impact of the clustering outcome;
– ǫ = 0.25 as for δ, to increase the impact of keywords.
while those for the KE techniques were taken as ζ = 0.45, η = 0.45 and θ =
0.1 (to reduce the impact of the psychological perspective, that is more naive
compared to the others).
4.1
Recognition of relevant concepts
We exploited a dataset made up of documents concerning social networks of
socio-political and economic subject. Table 1 shows on the top the settings used
for three different runs, concerning the Relevance Weight components:
W =A+B+C +D+E
and the cutpoint value for selecting relevant concepts. The corresponding outcomes (at the bottom) show that the default set of parameter values yields 3
relevant concepts, having very close weights. Component D determines the inclusion of the very unfrequent concepts (see column A) access and subset (0.001
and 6.32 E-4, respectively) as relevant ones. They benefit from the large initial
weight of network, to which they are connected. Using the second set of parameter values, the predominance of component A in the overall computation,
9
Table 1. Three parameter choices and corresponding outcome of relevant concepts.
Test # α β
γ
δ
ǫ p
1
0.10 0.10 0.30 0.25 0.25 1.0
2
0.20 0.15 0.15 0.25 0.25 0.7
3
0.15 0.25 0.30 0.15 0.15 1.0
Test # Concept #
A
B
C
D
E
network
0.100 0.100 0.021 0.178 0.250
1
access
0.001 0.001 0.154 0.239 0.250
subset
6.32E-4 0.001 0.150 0.239 0.250
2
network
0.200 0.150 0.0105 0.178 0.250
network
0.150 0.25 0.021 0.146 0.150
user
0.127 0.195 0.022 0.146 0.150
3
number
0.113 0.187 0.022 0.146 0.150
individual 0.103 0.174 0.020 0.146 0.150
W
0.649
0.646
0.641
0.789
0.717
0.641
0.619
0.594
Table 2. Pairwise clustering statistics.
Dataset MNC (0.001) MNC (0.0001) Vector size
B
3
2
1838
P
3
1
1599
B+P
5
1
3070
and the cutpoint threshold lowered to 70%, cause the frequency-based approach
associated to the initial weight to give neat predominance to the first concept in
the ranking. Using the third set of parameter values, the threshold is again 100%
and the other weights are such that the frequency-based approach expressed by
component A is balanced by the number of links affecting the node and by the
weight of its neighbors. Thus, both nodes with highest frequency and nodes that
are central in the network are considered relevant. Overall, concept network is
always present, while the other concepts significantly vary depending on the
parameter values.
4.2
Concept Generalization
Two toy experiments are reported for concept generalization. The maximum
threshold for the Hamming distance was set to 0.001 and 0.0001, respectively,
while the minimum threshold of taxonomic similarity was fixed at 0.4 in both.
Two datasets on Social networks were exploited: a book (B) and a collection
of scientific papers (P ) concerning socio-political and economic discussions. Observing the outcome, three aspects can be emphasized: the level of detail of the
concept descriptions that in pairwise clustering satisfy the criterion, the intuitivity of the generalizations supported by WordNet Domains, and the values of
the single conceptual similarity measures applied to synsets in WordNet.
In Table 2, MNC is the Max Number of Connections detected among all
concept descriptions that have been agglomerated at least once in the pairwise
clustering. Note that all descriptions which have never been agglomerated, are
10
Table 3. Generalizations for different pairwise clustering thresholds (Thr.) and minimum similarity threshold 0.4 (top), and corresponding conceptual similarity scores
(bottom).
Thr. Dataset Subsumer
parent
B
[110399491]
human action
P
0.001
[100030358]
Subs. Domain Concepts
adopter [109772448]
person
dad [109988063]
discussion [107138085]
factotum
judgement [100874067]
psychiatrist [110488016]
B + P dr. [110020890]
medicine
abortionist [109757175]
specialist [110632576]
physiological state
dependence [114062725]
B
physiology
[114034177]
affliction [114213199]
mental attitude
marxism [106215618]
0.0001
P
psychology
[106193203]
standpoint [106210363]
feeling
psychological dislike [107501545]
B+P
[100026192]
features
satisfaction [107531255]
#
Pairs
F a score W P score
1
adopter, dad
0.733
0.857
2 discussion, judgement
0.731
0.769
psychiatrist, abortionist 0.739
0.889
3
psychiatrist, specialist
0.728
0.889
4 dependence, affliction
0.687
0.750
5 marxism, standpoint
0.661
0.625
6
dislike, satisfaction
0.678
0.714
Conc. Domain
factotum
person
factotum
law
medicine
medicine
medicine
physiology
medicine
politics
factotum
psychological features
psychological features
Score
0.628
0.562
0.657
0.647
0.516
0.413
0.485
considered as single instance in a separated cluster. Hence, the concepts recognized as similar have very few neighbors, suggesting that concepts become
ungeneralizable as their number of connections grows. Although in general this
is a limitation, such a cautious behavior is to be preferred until an effective generalization technique is provided, that ensures the quality of its outcomes (wrong
generalizations might spoil subsequent results in cascade).
It is worth emphasizing that not only sensible generalizations are returned,
but their domain is also consistent with those of the generalized concepts. This
happens with both thresholds (0.001 and 0.0001), that return respectively 23
and 30 candidate generalizations (due to space limitations, Table 3 reports only
a representative sample, including a generalization for each dataset used). Analyzing the two conceptual similarity measures used for generalization reveals
that, for almost all pairs, both yield very high values, leading to final scores that
neatly pass the 0.4 threshold, and sf W P is always greater than sf F a . Since the
former is more related to a specific path, and hence to the goodness of the chosen
subsumer, this confirms the previous outcomes (suggesting that the chosen subsumer is close to the generalized concepts). In the sample reported in Table 3,
only case 5 disagrees with these considerations.
4.3
Reasoning by association
Table 4 shows a sample of outcomes of reasoning by association. E.g., case 5
explains the relationship between freedom and internet as follows: the adult
11
Table 4. Exampes of reasoning by associations (start and target nodes in emphasis).
#
1
2
3
4
5
Subject
Verb
flexibility
convert
people settle, desire, do at, extinguish
people
use, revolution
myspace
develop
people
member
people
erode, combine
computer
acknowledge
internet
extend
majority
erode
majority
erode, do
adult
use
facebook
acknowledge
adult
write
adult
use
technology
acknowledge
internet
acknowledge
Complement
life
life
myspace
headline
threat
technology
technology
neighbor
internet
facebook
platform
platform
freedom
platform
platform
technology
write about freedom, and use platform, that is recognized as a technology, as
well as internet.
5
Conclusions
This work proposed an approach to automatic conceptual taxonomy extraction
from natural language texts. It works by mixing different techniques in order
to identify relevant terms/concepts in the text, group them by similarity and
generalize them to identify portions of a hierarchy. Preliminary experiments show
that the approach can be viable, although extensions and refinements are needed
to improve its effectiveness. In particular, a study on how to set standard suitable
weights for concept relevance assessment is needed. A reliable outcome might
help users in understanding the text content and machines to automatically
performing some kind of reasoning on the resulting taxonomy.
References
[1] Marie-Catherine de Marneffe, Bill MacCartney, and Christopher D. Manning.
Generating typed dependency parses from phrase structure trees. In LREC, 2006.
[2] Christiane Fellbaum, editor. WordNet: An Electronic Lexical Database. MIT
Press, Cambridge, MA, 1998.
[3] S. Ferilli, M. Biba, T.M. Basile, and F. Esposito. Combining qualitative and
quantitative keyword extraction methods with document layout analysis. In Postproceedings of the 5th Italian Research Conference on Digital Library Management
Systems (IRCDL-2009), pages 22–33, 2009.
12
[4] S. Ferilli, M. Biba, N. Di Mauro, T.M. Basile, and F. Esposito. Plugging taxonomic
similarity in first-order logic horn clauses comparison. In Emergent Perspectives
in Artificial Intelligence, Lecture Notes in Artificial Intelligence, pages 131–140.
Springer, 2009.
[5] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I.H. Witten.
The weka data mining software: an update. SIGKDD Explorations, 11(1):10–18,
2009.
[6] R.W. Hamming. Error detecting and error correcting codes. Bell System Technical
Journal, 29(2):147–160, 1950.
[7] Dan Klein and Christopher D. Manning. Fast exact inference with a factored
model for natural language parsing. In Advances in Neural Information Processing
Systems, volume 15. MIT Press, 2003.
[8] Sang Ok Koo, Soo Yeon Lim, and Sang-Jo Lee. Constructing an ontology based
on hub words. In ISMIS’03, pages 93–97, 2003.
[9] A. Maedche and S. Staab. Mining ontologies from text. In EKAW, pages 189–202,
2000.
[10] A. Maedche and S. Staab. The text-to-onto ontology learning environment. In
ICCS-2000 - Eight International Conference on Conceptual Structures, Software
Demonstration, 2000.
[11] Bernardo Magnini and Gabriela Cavagli. Integrating subject field codes into wordnet. pages 1413–1418, 2000.
[12] Yutaka Matsuo and Mitsuru Ishizuka. Keyword extraction from a single document using word co-occurrence statistical information. International Journal on
Artificial Intelligence Tools, 13:2004, 2003.
[13] N. Ogata. A formal ontology discovery from web documents. In Web Intelligence:
Research and Development, First Asia-Pacific Conference (WI 2001), number
2198 in Lecture Notes on Artificial Intelligence, pages 514–519. Springer-Verlag,
2001.
[14] Alessandro Cucchiarelli Paola Velardi, Roberto Navigli and Francesca Neri. Evaluation of OntoLearn, a methodology for automatic population of domain ontologies. In Paul Buitelaar, Philipp Cimiano, and Bernardo Magnini, editors, Ontology
Learning from Text: Methods, Applications and Evaluation. IOS Press, 2006.
[15] G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing.
Commun. ACM, 18:613–620, November 1975.
[16] Zhibiao Wu and Martha Palmer. Verbs semantics and lexical selection. In Proceedings of the 32nd annual meeting on Association for Computational Linguistics, pages 133–138, Morristown, NJ, USA, 1994. Association for Computational
Linguistics.
13
PatTexSum : A pattern-based text summarizer
Elena Baralis, Luca Cagliero, Alessandro Fiori, and Saima Jabeen
elena.baralis,luca.cagliero,alessandro.fiori,[email protected]
Affiliation: Politecnico di Torino. Corso Duca degli Abruzzi, 24 10129 Torino, Italy.
Tel: +390110907194 Fax:+390110907099
Abstract. In the last decade the growth of the Internet has made a
huge amount of textual documents available in the electronic form. Text
summarization is commonly based on clustering or graph-based methods
and usually considers the bag-of-word sentence representation. Frequent
itemset mining is a widely exploratory technique to discover relevant
correlations among data. The well-established application of frequent
itemsets to large transactional datasets prompts their usage in the context of document summarization as well.
This paper proposes a novel multi-document summarizer, namely PatTexSum (Pattern-based Text Summarizer), that is mainly based on a
pattern-based model, i.e., a model composed of frequent itemsets. Unlike
previously proposed approaches, PatTexSum selects most representative
and not redundant sentences to include in the summary by considering
both (i) the most informative and non-redundant itemsets extracted from
document collections tailored to the transactional data format, and (ii) a
sentence score, based on the tf-idf statistics. Experiments conducted on
a collection of real news articles show the effectiveness of the proposed
approach.
1
Introduction
From the birth of the Internet on, analysts may progressively access and analyze
larger data collections. Since the large majority of the information is available
in textual form, a challenging task is to convey the most relevant information
provided by textual documents into short and concise summaries.
Many document summarization approaches have been proposed in literature. Most of them select the most representative sentences to include in the
summaries by means of the following approaches: (i) clustering (e.g., [13, 20]),
(ii) graph-based methods (e.g., [12]), and (iii) linear programming (e.g., [15]).
Clustering-based approaches exploit clustering algorithms to group sentences
and select representatives among each group. For instance, MEAD [13] evaluates the similarity between the document sentences and the centroids and selects, similarly to [6], the most relevant sentences among each document cluster
based on the tf-idf statistical measure [16]. Differently, in [20] an incremental
hierarchical clustering algorithm is exploited to update summaries over time.
The graph-based approaches try to represent correlations among sentences by
14
means of a graph-based model. According to this model, sentences are represented by graph nodes, while the edges weigh the strength of the correlation
between couples of sentences. The most representative sentences are selected according to graph-based indexing strategies. For instance, [12] proposes to rank
sentences based on the eigenvector centrality computed by means of the wellknown PageRank algorithm [5]. Finally, the linear programming methods identify the most representative sentences by maximizing ad-hoc object functions.
For instance, in [15] the authors formalized the extractive summarization task as
a maximum coverage problem with the Knapsack constraints based on the the
bag-of word sentence representation and enforce additional constraints based on
sentence relevance within each document. Most the aforementioned approaches
rely on the bag-of-word sentence representation and make use of well-founded
statistical measures (e.g., the tf-idf measure [16]).
Frequent itemset mining is a widely exploratory technique, first introduced
in [1] in the context of market basket analysis, to discover correlations that frequently occur in the analyzed data. A number of approaches focus on discovering
frequent itemsets from transactional data and then selecting their most informative yet non-redundant subset by means of postpruning. To address this issue,
static approaches (e.g., [4, 8]) compare the observed frequency (i.e., the support)
of each itemset in the source transactional data against some null hypotheses (i.e., their expected frequency). Differently, dynamic approaches (e.g., [9,
18]) make often use of the maximum entropy model to take previously selected
patterns into account and, thus, reduce model redundancy. Although the discovery and selection of valuable frequent itemsets from transactional data is
well-established, to the best of our knowledge their usage in document summarization has never been investigated yet.
PatTexSum (Pattern-based Text Summarizer) is a novel multi-document
summarization approach that exploits a pattern-based model to select the most
representative and not redundant sentences belonging to the document collection. It focuses on combining the effectiveness of pattern-based models, composed of highly informative and non-redundant itemsets, to represent correlations among data with the discriminating power of a sentence evaluation measure, based the tf-idf statistics. Pattern-based model generation focuses on extracting and selecting valuable frequent itemsets from a transactional representation of the document collection. To this aim, an efficient and effective approach,
recently proposed in [11] in the context of transactional data, is adopted. [11]
succinctly summarizes transactional data by adopting an heuristics to solve the
maximum entropy model that allows on-the-fly evaluating itemsets during their
extraction. This feature makes this approach particularly appealing for its application in text summarization. To effectively discriminate among sentences, an
evaluation score, computed from their bag-of-word representation and based on
the well-founded tf-idf statistic [16], is also considered. PatTexSum combines
the information discovered from both transactional and bag-of-word data representations and adopts an effective greedy approach, first proposed in [2], to solve
the problem of selecting sentences that cover at best the pattern-based model.
15
To evaluate the PatTexSum performance a suite of experiments on a collection of news articles has been performed. Results, reported in Section 3, show
that PatTexSum significantly outperforms mostly used previous summarizers
in terms of precision, recall, and F-measure.
This paper is organized as follows. Section 2 presents the proposed method
and thoroughly describes its main steps. Section 3 assesses the effectiveness of
the PatTexSum framework in summarizing textual documents, while Section 4
draws conclusions and presents future developments of this work.
2
The PatTexSum method
PatTexSum focuses on summarizing collections of textual documents by exploiting a two-way data representation. Pattern-based model generation relies
on a transactional representation of the document sentences, while the relevance
score evaluation, based on the tf-idf statistic, is based on the bag-of-word sentence representation. A greedy approach is used to effectively combine knowledge
discovered from both data representations and select most representative sentences to include in the summary. Figure 1 shows the main steps behind the
proposed approach, which will be thoroughly described in the following.
Fig. 1. The PatTexSum method
2.1
Document representation
PatTexSum exploits two different document/sentence representations: (i) the
traditional bag-of-word (BOW) representation and (ii) the transactional data
16
format. The raw document content is first preprocessed to make it suitable for the
data mining and knowledge discovery process. Stopwords, numbers, and website
URLs are removed to avoid noisy information, while the Wordnet stemming
algorithm [3] is applied to reduce document words to their base or root form (i.e.,
the stem). Let D={d1 , . . . , dn } be a document collection, where each document
dk is composed of a set sentences Sk ={s1k , . . . , szk }. Documents are composed
of a sequence of sentences, each one composed of a set of words. The BOW
representation of the j-th sentence sjk belonging to the k-th document dk of the
collection D is the set of all word stems (i.e., terms) occurring in sjk .
Consider now the set trjk ={w1 , . . . , wl } where trjk ⊆ sjk and wq 6= wr ∀ q 6=
r. It includes the subset of distinct terms occurring in the sentence sjk . To tailor
document sentences to the transactional data format, we consider each document
sentence as a transaction whose items are distinct terms taken from its BOW
representation, i.e., trjk is the transaction that corresponds to the document
sentence sjk . A transactional representation T of the document collection D is
the union of all transactions trjk corresponding to each sentence sjk belonging
to any document dk ∈ D.
The document collection is associated with the statistical measure of the term
frequency-inverse document frequency (tf-idf) that evaluates the relevance of a
word in the whole collection. A more detailed description of the tf-idf statistic
follows. The whole document content could be represented in a matrix form T C,
in which each row represents a distinct term of the document collection while
each column corresponds to a document. Each element tcik of the matrix T C is
the tf-idf value associated with a term wi in the document dk belonging to the
whole collection D. It is computed as follows:
tcik = P
nik
r∈{q : wq ∈dk }
nrk
· log
|D|
|{dk ∈ D : wi ∈ dk }|
(1)
where nik is the number of occurrences
P of i-th term wi in the k-th document dk ,
D is the collection of documents, r∈{q : wq ∈dk } nrk is the sum of the number
of occurrences of all terms in the k-th document dk , and log |{dk ∈D|D|
: wi ∈dk }|
represents the inverse document frequency of term wi .
2.2
The pattern-based model generation
Frequent itemset mining is a well-established data mining approach that focuses
on discovering recurrences, i.e., itemsets, that frequently occur in the source
data. An itemset I of length k, i.e., a k-itemset, is a set of k distinct items. Let
T be the document collection in the transactional data format. We denote as
D(I) the set of transactions supported by I, i.e., D(I) = {trjk ∈ T | I ⊆ trjk }.
The support of an itemset I is the observed frequency of occurrence of I in D,
i.e., sup(I)= D(I)
|T | . Since the problem of discovering all itemsets in a transactional
dataset is computationally intractable [1], itemset mining is commonly driven
by a minimum support threshold min sup.
17
Given a minimum support threshold min sup and a model size p, PatTexSum generates a pattern-based model that includes the most informative yet
non-redundant set of p frequent itemsets discovered from the document collection T tailored to the transactional data format (Cf. Section 2.1).
Among the large set of previously proposed approaches focused on succinctly
representing transactional data by means of itemsets [8, 17, 18], we adopt a
method recently proposed in [11]. Unlike previous approaches, it exploits an
entropy-based heuristic to drive the mining process and select most informative yer not redundant itemsets without the need of postpruning. Its efficiency
and effectiveness in discovering succinct transactional data summaries makes it
particularly suitable for the application to text summarization.
2.3
Sentence evaluation and selection
The PatTexSum method exploits the pattern-based model to evaluate and
select most relevant sentences to include in the summary. Sentence evaluation
and selection steps consider (i) a sentence relevance score that combines the
tf-idf statistic [16] associated with each sentence term, and (ii) the sentence
coverage with respect to the generated pattern-based model (Cf. Section 2.2).
In the following we formalize both sentence coverage and relevance.
Sentence relevance score The relevance score of a sentence is evaluated by
using the bag-of-word document representation. It is computed as the sum of
the tf-idf values (Cf. Formula 1) of each term belonging to the sentence in the
document collection. In Formula 2 the score expression for a generic sentence
sjk belonging to the document collection D is reported
P
i | wi ∈sjk tcik
SR(sjk ) =
(2)
|tjk |
P
where |tjk | is the number of distinct terms occurring in sjk , and i | wi ∈sjk tcik
is the sum of the tf-idf values associated with terms (i.e., word stems) in sjk (Cf.
Formula 1).
Sentence model coverage The sentence coverage measures the pertinence of
each sentence to the generated pattern-based model. To this aim, it considers
document sentences tailored to the transactional data format. Let D be the
collection of documents, i.e., a set of sentences. We first associate with each
sentence sjk ∈ D a binary vector, denoted in the following as sentence coverage
vector (SC), SCjk ={sc1 , . . . , scp } where p is the number of itemsets belonging
to the model and sci = 1trjk (Ii ) indicates whether itemset Ii is included or not
in trjk . More formally, 1trjk is an indicator function defined as follows:
(
1 if Ii ⊆ trjk ,
1trjk (Ii ) =
(3)
0 otherwise
18
Algorithm 1 Sentence selection - Greedy approach
Input: set of sentence relevance scores SR, set of sentence coverage vectors SC, tf-idf matrix T C
Output: summary S
1: {Initializations}
2: S = ∅
3: ESC = ∅ {set of eligible sentence coverage vectors}
4: SC ∗ = all zeros() {summary coverage vector with only 0s}
5: {Cycle until either SC ∗ contains only 1s or all the SC vectors contain only zeros}
6: while not (summary coverage vector all ones() or sentence coverage vectors only zeros()) do
7:
{Determine the sentences with the highest number of ones}
8:
ESC = max ones sentences()
9:
if ESC != ∅ then
10:
{Select the sentence with maximum relevance score}
11:
SCbest = ESC[1]
12:
for all t ∈ ESC[2 :] do
13:
if SRt > SRbest then
14:
SCbest =SCt
15:
end if
16:
end for
17:
{Update sets and summary coverage vector}
18:
S = S ∪ SCbest
19:
SC ∗ = SC ∗ OR SCbest
20:
ESC = ESC \ SCbest
21:
{Update the sentence coverage vectors belonging to V}
22:
for all SCi in SC do
23:
SCi = SCi AND SC ∗
24:
end for
25:
else
26:
break
27:
end if
28: end while
29: return S
The coverage of a sentence sjk with respect to the pattern-based model is
defined as the number of 1’s that occur in the corresponding coverage vector
SCjk .
We formalize the problem of selecting the most informative and not redundant sentences according to the pattern-based model as a set covering problem.
The set covering problem A set covering algorithm focuses on selecting the
minimum set of sentences, of arbitrary size l, whose logic OR of coverage vectors,
i.e., SC ∗ =SC1 ∨ . . . ∨ SCl , generates a binary vector composed of all 1’s. This
implies that each itemset belonging to the model covers at least one sentence.
The SC ∗ vector will be denoted as the summary coverage vector throughout the
paper.
The set covering problem is known to be NP-hard. To solve the problem, we
adopt a greedy strategy that we already proved to be effective in summarization of biological microarray data [2]. In order to build an accurate yet concise
summary, the sentence coverage with respect to the pattern-based model is considered as the most discriminative feature, i.e., sentences that cover the maximum number of itemsets belonging to the model are selected firstly. At equal
terms, the sentence with maximal coverage that is characterized by the highest
relevance score SR is preferred.
The adopted algorithm identifies, at each step, the sentence sjk with the
best complementary vector SCjk with respect to the current summary coverage
19
vector SC ∗ . The pseudo-code of the greedy approach is reported in Algorithm 1.
It takes in input the set of sentence relevance scores SR, the set of sentence
coverage vectors SC, and the tf-idf matrix T C. It produces the summary S, i.e.,
the minimal subset of most representative sentences. The first step is the variable
initialization and the sentence coverage vector computation (lines 1-4). Next, the
sentence with maximum coverage, i.e., the one whose coverage vector contains
the highest number of ones, is iteratively selected (line 7). At equal terms, the
sentence with maximum relevance score (Cf. Formula 2) is preferred (lines 12-16).
Finally the selected sentence is included in the summary S while the summary
and sentence coverage vectors are updated (lines 18-24). The procedure iterates
until either the summary coverage vector contains only ones, i.e., the model is
fully covered by the summary, or the remaining sentences are not covered by any
itemset, i.e., the remaining sentences are not pertinent to the model (line 6).
Experimental results, reported in Section 3, show that the proposed summarization method performs better than exclusively considering either sentence
coverage or sentence relevance.
3
Experimental results
We conducted a set of experiments to address the following issues: (i) the effectiveness of the proposed summarization approach against two widely used
summarizers, i.e., the Open Text Summarizer (OTS) [14] and TexLexAn [19]
(Section 3.1), and (ii) the impact of the pattern-based model size and the support threshold on the performance of PatTexSum (Section 3.2).
We evaluated all the summarization approaches on a collection of real-life
news articles. To this aim, the 10 top-ranked news documents, provided by the
Google web search engine (http://www.google.com), that concern the following
recent news topics have been selected:
Natural Disaster: Earthquake in Spain 2011
Royal Wedding: Prince William and Kate Middleton wedding
Technology: Microsoft purchased Skype
Education: Wealthy parents could buy their children places at elite universities
– Sport:Australia defeat Pakistan in Azlan shah Hockey
–
–
–
–
The datasets relative to the above news categories are made available for
research purposes, upon request to the authors.
To compare the results by PatTexSum with OTS [14] and TexLexAn [19],
we used the ROUGE [10] toolkit (version 1.5.5), which is widely applied by
Document Understanding Conference (DUC) for document summarization performance evaluation1 . It measures the quality of a summary by counting the
unit overlaps between the candidate summary and a set of reference summaries.
1
The provided command is: ROUGE-1.5.5.pl -e data -x -m -2 4 -u -c 95 -r 1000 -n 4
-f A -p 0.5 -t 0 -d -a
20
Intuitively, the summarizer that achieves the highest ROUGE scores could be
considered as the most effective one. Several automatic evaluation scores are
implemented in ROUGE. For the sake of brevity, we reported only ROUGE-2
and ROUGE-4 as representative scores. Analogous results have been obtained
for the other scores.
Since a ”golden summary” (i.e., the optimal document collection summary)
is not available for web news document, we performed a leave-one-out cross
validation. More specifically, for each category we summarized nine out of ten
news documents and we compared the resulting summary with the remaining
(not yet considered) document, which has been selected as golden summary
at this stage. Next, we tested all other possible combinations by varying the
golden summary and we computed the average performance results, in terms of
precision, recall, and F-measure, achieved by each summarizer for both ROUGE2 and ROUGE-4.
3.1
Performance comparison and validation
We evaluated the performance, in terms of ROUGE-2 and ROUGE-4 precision
(Pr), recall (R), and F-measure (F), of PatTexSum against OTS and TexLexAn.
For both OTS and TexLexAn we adopted the configuration suggested by the
respective authors. For PatTexSum we enforced a minimum support threshold
min sup=1.5% and we tuned the value of the pattern-based model size p to
its best value for each considered dataset. A more detailed discussion on the
impact of both min sup and p on the performance of PatTexSum is reported
in Section 3.2.
PatTexSum performs better than the other considered summarizers on all
tested datasets. To validate the statistical significance of PatTexSum performance improvement against OTS and TexLexAn, we used the paired t-test [7]
at significance level p− value = 0.05 for all evaluated datasets and measures. For
ROUGE-2, PatTexSum provides significantly better results than OTS, whose
summarization approach is mainly based on tf-idf measure, and TexLexAn in
terms of precision and/or recall on 3 out of 5 datasets (i.e., Natural disaster, Technology and Sports). Moreover, PatTexSum significantly outperforms
TexLexAx and OTS in terms of F-measure (i.e., the harmonic average of precision and recall [16]) on, respectively, 2 and 3 of them (i.e., Natural disaster and
Technology for both, and Sports for TexLexAn). Similar results were obtained
for ROUGE-4.
3.2
PatTexSum parameter analysis
We analyzed the impact of the minimum support threshold and the patternbased model size, i.e., the number of generated itemsets, on the performance
21
PatTexSum
OTS
TexLexAn
p
R
Pr
F
R
Pr
F
R
Pr
F
Natural Disaster 16 0.116 0.288 0.141 0.040 0.120 0.053 0.038 0.114 0.045
Royal Wedding 12 0.036 0.215 0.058 0.034 0.174 0.054 0.030 0.150 0.047
Technology
5 0.141 0.465 0.210 0.042 0.208 0.067 0.042 0.172 0.065
Sports
10 0.145 0.297 0.189 0.055 0.133 0.075 0.071 0.149 0.093
Education
8 0.039 0.241 0.064 0.036 0.170 0.054 0.034 0.150 0.051
Table 1. Performance comparison in terms of ROUGE-2 score.
dataset
PatTexSum
OTS
TexLexAn
p
R
Pr
F
R
Pr
F
R
Pr
F
Natural-Disaster 16 0.060 0.125 0.068 0.005 0.012 0.006 0.005 0.011 0.006
Royal-wedding 12 0.009 0.082 0.015 0.003 0.018 0.005 0.003 0.018 0.005
Technology
5 0.113 0.356 0.167 0.009 0.065 0.016 0.003 0.011 0.005
Sports
10 0.059 0.112 0.077 0.004 0.010 0.006 0.022 0.036 0.027
Education
8 0.017 0.141 0.030 0.003 0.012 0.005 0.003 0.009 0.004
Table 2. Performance comparison in terms of ROUGE-4 score.
dataset
of the PatTexSum summarizer. To also test the impact of the tf-idf statistic
on the performance of the pattern-based summarizer, we entail (i) neglecting
the relevance score evaluation (i.e., by simply selecting the top-ranked maximal
coverage sentence provided by the itemset miner [11]), and (ii) considering other
statistical measures in place of the tf-idf score. Among all the evaluated scores,
the tf-idf statistic turns out to be most effective measure in discriminating among
sentences.
In Figures 2(a) and 2(b) we reported the F-measure achieved by PatTexSum, by either considering or not the relevance score in the sentence evaluation,
and by varying, respectively, the support threshold on Technology and the model
size on the Natural Disaster document collection. For the sake of brevity, we reported only the results obtained with the ROUGE-4 score. Analogous results
have been obtained for the other ROUGE scores, for precision and recall measures, and for all other configurations.
The usage of the relevance score based on the tf-idf statistic always improves
the performance of PatTexSum in the range of those values of p and min sup
yielding the highest F-measure. This improvement is due to its ability to well
discriminate sentence term occurrence among documents. When higher support
thresholds (e.g., 5%) are enforced, many informative patterns are discarded,
thus the model becomes too general to yield high summarization performance.
Oppositely, when very low support thresholds (e.g., 0.1%) are enforced, data
overfitting occurs, i.e., the model is too much specialized to effectively and concisely summarize the whole document collection content. At medium support
thresholds (e.g., 1.5%) the best balancing between model specialization and generalization is achieved, thus, PatTexSum produces very concise yet informative
summaries.
22
0.08
0.2
With SR
Without SR
0.18
0.07
0.16
0.06
F-measure
F-measure
0.14
0.12
0.1
0.08
0.06
0.05
0.04
0.03
0.04
0.02
0.02
With SR
Without SR
0.01
0
0
0.5
1
1.5
2 2.5 3
Minsup (%)
3.5
4
4.5
6
5
(a) Technology. p=5. Impact of the
support threshold.
8
10
12
14
16
p
18
20
22
24
(b)
Natural
Disaster.
min sup=1.5%. Impact of the
pattern-based model size.
Fig. 2. PatTexSum performance analysis by either considering or not of the relevance
score (SR). Rouge-4 score. F-measure.
The model size may also significantly affect the summarization performance.
When a limited number of itemsets (e.g., p = 6) is selected, the relevant knowledge hidden in the news category Natural Disaster is not yet fully covered by
the extracted patterns (see Figure 2(b)), thus the generated summaries are not
highly informative. When p = 16 the pattern-based model provides the most
informative and non-redundant knowledge. Consequently, the multi-document
pattern-based summarization becomes very effective. When a higher number of
itemsets is included in the model, the quality of the generated summaries worsens as the model is still informative but redundant. The best values of model size
and support threshold achieved by each news category depend on the analyzed
document term distribution.
4
Conclusions and future works
This paper presents a multi-document summarizer that combines the knowledge
provided by a pattern-based model, composed of frequent itemsets, with a statistical evaluation, based on the well-founded tf-idf measure, to select the most
representative and not redundant sentences. Albeit the application of frequent
itemsets to represent most valuable correlations among transactional data is
well-established, their usage in text summarization has never been investigated
so far. The proposed summarizer exploits a greedy approach to combine knowledge discovered from two different data representations, i.e., the transactional
and bag-of-word representations, and select the minimal set of most relevant
sentences. Experiments conducted on real-life news articles show both the effectiveness and the efficiency of the proposed text summarization method.
23
Future works will address: (i) the extension of the proposed approach to
address the problem of incremental summary updating, and (ii) the exploitation
of new techniques to address the set covering problem.
References
1. R. Agrawal, T. Imieliński, and A. Swami. Mining association rules between sets of
items in large databases. In ACM SIGMOD Record, volume 22, pages 207–216.
2. E. Baralis, G. Bruno, and A. Fiori. Minimum number of genes for microarray
feature selection. 30th Annual International Conference of the IEEE Engineering
in Medicine and Biology Society (EMBC-08), pages 5692–5695, 2008.
3. S. Bird, E. Klein, and E. Loper. Natural language processing with Python. O’Reilly
Media, 2009.
4. S. Brin, R. Motwani, and C. Silverstein. Beyond market baskets: Generalizing
association rules to correlations. In SIGMOD Conference, pages 265–276, 1997.
5. S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine.
In Proceedings of the seventh international conference on World Wide Web 7, pages
107–117, 1998.
6. J. M. Conroy, J. Goldstein, J. D. Schlesinger, and D. P. Oleary. Left-brain/rightbrain multi-document summarization. In In Proceedings of the Document Understanding Conference, 2004.
7. T. G. Dietterich. Approximate statistical test for comparing supervised classification learning algorithms. Neural Computation, 10(7), 1998.
8. S. Jaroszewicz and D. A. Simovici. Interestingness of frequent itemsets using
bayesian networks as background knowledge. In Proceedings of the tenth ACM
SIGKDD international conference on Knowledge discovery and data mining, pages
178–186, 2004.
9. K.-N. Kontonasios and T. D. Bie. An information-theoretic approach to finding
informative noisy tiles in binary databases. In SIAM International Conference on
Data Mining, pages 153–164, 2010.
10. C.-Y. Lin and E. Hovy. Automatic evaluation of summaries using n-gram cooccurrence statistics. In Proceedings of the 2003 Conference of the North American
Chapter of the Association for Computational Linguistics on Human Language
Technology - Volume 1, pages 71–78, 2003.
11. M. Mampaey, N. Tatti, and J. Vreeken. Tell me what I need to know: Succinctly
summarizing data with itemsets. In Proceedings of the 17th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2011.
12. D. R. Radev. Lexrank: Graph-based lexical centrality as salience in text summarization. Journal of Artificial Intelligence Research, 22:2004, 2004.
13. D. R. Radev, H. Jing, M. Stys, and D. Tam. Centroid-based summarization of
multiple documents. Information Processing and Management, 40(6):919 – 938,
2004.
14. N. Rotem. Open text summarizer (ots). Retrieved July, 3(2006):2006, 2003.
15. H. Takamura and M. Okumura. Text summarization model based on the budgeted
median problem. In Proceeding of the 18th ACM conference on Information and
knowledge management, pages 1589–1592, 2009.
16. P. Tan, M. Steinbach, V. Kumar, et al. Introduction to data mining. Pearson
Addison Wesley Boston, 2006.
24
17. N. Tatti. Probably the best itemsets. In Proceedings of the 16th ACM SIGKDD
international conference on Knowledge discovery and data mining, pages 293–302,
2010.
18. N. Tatti and H. Heikinheimo. Decomposable families of itemsets. In Proceedings of the European conference on Machine Learning and Knowledge Discovery in
Databases - Part II, pages 472–487, 2008.
19. TexLexAn. Texlexan: An open-source text summarizer, 2011.
20. D. Wang and T. Li. Document update summarization using incremental hierarchical clustering. In Proceedings of the 19th ACM international conference on
Information and knowledge management, pages 279–288, 2010.
25
An Expectation Maximization Algorithm for
Probabilistic Logic Programs
Elena Bellodi and Fabrizio Riguzzi
ENDIF – Università di Ferrara – Via Saragat, 1 – 44122 Ferrara, Italy.
{elena.bellodi,fabrizio.riguzzi}@unife.it
Abstract. Recently much work in Machine Learning has concentrated
on representation languages able to combine logic and probability, leading to the birth of a whole field called Statistical Relational Learning.
In this paper we present a technique for parameter learning targeted
to a family of formalisms where uncertainty is represented using Logic
Programming tools - the so-called Probabilistic Logic Programs such as
ICL, PRISM, ProbLog and LPAD. Since their equivalent Bayesian networks contain hidden variables, an EM algorithm is adopted. To speed
the computation expectations are computed directly on the Binary Decision Diagrams that are built for inference. The resulting system, called
EMBLEM for “EM over BDDs for probabilistic Logic programs Efficient
Mining”, has been applied to various datasets and showed good performances both in terms of speed and memory.
1
Introduction
In the field of Statistical Relational Learning (SRL) logical-statistical languages
are used to effectively learn in complex domains involving relations and uncertainty. They have been successfully applied in social networks analysis, entity
recognition, information extraction, etc.
Similarly, a large number of works in Logic Programming has attempted
to combine logic and probability, among which the distribution semantics [11]
is a prominent approach. It underlies for example PRISM [11], the Independent Choice Logic, Logic Programs with Annotated Disjunctions (LPADs) [15],
ProbLog [3] and CP-logic. The approach is appealing because efficient inference
algorithms appeared [3,9], which adopt Binary Decision Diagrams (BDD).
In this paper we present the EMBLEM system for “EM over BDDs for probabilistic Logic programs Efficient Mining” that learns parameters of probabilistic
logic programs under the distribution semantics by using an Expectation Maximization (EM) algorithm: it is an iterative method to estimate some unknown
parameters Θ of a model, given a dataset where some of the data is missing,to
find maximum likelihood estimates of Θ. The translation of these programs into
graphical models requires the use of hidden variables and therefore of EM: the
main characteristic of our system is the computation of expectations using BDDs.
Since there are transformations with linear complexity that can convert a program in a language into the others[2], we will use LPADs for their general syntax.
26
EMBLEM has been tested on the IMDB, Cora and UW-CSE datasets and compared with RIB [10], LeProbLog [3], Alchemy [8] and CEM, an implementation
of EM based on the cplint interpreter [9].
The paper is organized as follows. Section 2 presents LPADs and Section 3
describes EMBLEM. Section 4 presents experimental results. Section 5 discusses
related works and Section 6 concludes the paper.
2
Logic Programs with Annotated Disjunctions
Formally a Logic Program with Annotated Disjunctions [15] consists of a finite
set of annotated disjunctive clauses. An annotated disjunctive clause Ci is of
the form hi1 : Πi1 ; . . . ; hini : Πini : −bi1 , . . . , bimi . In such a clause hi1 , . . . hini
are logical atoms and bi1 , . . . , bimi are P
logical literals, {Πi1 , . . . , Πini } are real
i
numbers in the interval [0, 1] such that P nk=1
Πik ≤ 1. bi1 , . . . , bimi is called the
ni
body and is indicated with body(Ci ). If k=1 Πik < 1 the head of the annotated
disjunctive clause implicitly contains an extra atom null
Pnithat does not appear
in the body of any clause and whose annotation is 1 − k=1
Πik . We denote by
ground(T ) the grounding of an LPAD T .
An atomic choice is a triple (Ci , θj , k) where Ci ∈ T , θj is a substitution
that grounds Ci and k ∈ {1, . . . , ni }. (Ci , θj , k) means that, for the ground
clause Ci θj , the head hik was chosen. In practice Ci θj corresponds to a random
variable Xij and an atomic choice (Ci , θj , k) to an assignment Xij = k. A set of
atomic choices κ is consistent if (C, θ, i) ∈ κ, (C, θ, j) ∈ κ ⇒ i = j, i.e., only one
head is selected for the same ground clause. A composite choice κ is a consistent
set of atomic choices. The probability P (κ) of a composite choiceQ
κ is the product
of the probabilities of the individual atomic choices, i.e. P (κ) = (Ci ,θj ,k)∈κ Πik .
A selection σ is a composite choice that, for each clause Ci θj in ground(T ),
contains an atomic choice (Ci , θj , k). We denote the set of all selections σ of a
program T by ST . A selection σ identifies a normal logic program wσ defined
as wσ = {(hik ← body(Ci ))θj |(Ci , θj , k) ∈ σ}. wσ is called a world of T . Since
selections are composite
choices we can assign a probability to possible worlds:
Q
P (wσ ) = P (σ) = (Ci ,θj ,k)∈σ Πik . We consider only sound LPADs in which
every possible world has a total well-founded model. Subsequently we will write
wσ |= Q to mean that the query Q is true in the well-founded model of the
program wσ .
P The probability of a query Q according to an LPAD T is given by P (Q) =
σ∈E(Q) P (σ) where E(Q) is {σ ∈ ST , wσ |= Q}, i.e., the set of selections
corresponding to worlds where the query is true. To reduce the computational
cost of answering queries in our experiments, random variables can be directly
associated to clauses rather than to their ground instantiations: atomic choices
then take the form (Ci , k), meaning that head hik is selected from program clause
Ci , i.e., that Xi = k.
Example 1. The following LPAD T encodes a very simple model of the development of an epidemic or pandemic:
27
C1
C2
C3
C4
= epidemic : 0.6; pandemic : 0.3 : −f lu(X), cold.
= cold : 0.7.
= f lu(david).
= f lu(robert).
Clause C1 has two groundings, C1 θ1 with θ1 = {X/david} and C1 θ2 with θ2 =
{X/robert}, so there are two random variables X11 and X12 .
The possible worlds in which a query is true can be represented using a Multivalued Decision Diagram (MDD). An MDD represents a function f (X) taking
Boolean values on a set of multivalued variables X by means of a rooted graph
that has one level for each variable. Each node is associated to the variable of
its level and has one child for each possible value of the variable. The leaves
store either 0 or 1. Given values for all the variables X, we can compute the
value of f (X) by traversing the graph starting from the root and returning the
value associated to the leaf that is reached. A MDD can be used to represent
the set E(Q) by considering the multivalued variable Xij associated to Ci θj of
ground(T ). Xij has values {1, . . . , ni } and the atomic choice (Ci , θj , k) corresponds to the propositional
equation
Xij = k. If we represent with an MDD
W
V
the function f (X) = σ∈E(Q) (Ci ,θj ,k)∈σ Xij = k then the MDD will have a
path to a 1-leaf for each possible world where Q is true. While building MDDs
simplification operations can be applied that delete or merge nodes. In this way
a reduced MDD is obtained with respect to a Multivalued Decision Tree (MDT),
i.e., a MDD in which every node has a single parent, all the children belong to
the level immediately below and all the variables have at least one node. For
example, the reduced MDD corresponding to the query epidemic from Example 1 is shown in Figure 1(a). The labels on the edges represent the values of
the variable associated to the node: nodes at first and second level have three
outgoing edges, corresponding to the values of X11 and X12 , since C1 has three
head atoms (epidemic, pandemic, null); similarly X21 has two values since C2
has two head atoms (cold, null), hence the associated node has two outgoing
edges.
X11
k
1
k
k
kkkk
3
kkkk 1
k
k
TTTT
2
TTT2T
1
TTTT
T
X12
1
n1 R
X111
3
2
1
X21
2
X121
X211
3
0
;
n2 n3 Y
V T
R O
L 1
(b) BDD.
(a) MDD.
Fig. 1. Decision diagrams for Example 1.
28
H
0
It is often unfeasible to find all the worlds where the query is true so inference
algorithms find instead explanations for it, i.e. composite choices such that the
query is true in all the worlds whose selections are a superset of them. Explanations however, differently from possible worlds, are not necessarily mutually
exclusive with respect to each other, but exploiting the fact that MDDs split
paths on the basis of the values of a variable and the branches are mutually
disjoint, the probability of the query can be computed.
Most packages for the manipulation of a decision diagram are however restricted to work on Binary Decision Diagrams (BDD), i.e., decision diagrams
where all the variables are Boolean. A node n in a BDD has two children: the
1-child, indicated with child1 (n), and the 0-child, indicated with child0 (n). The
0-branch, the one going to the 0-child, is drawn with a dashed line.
To work on MDDs with a BDD package we must represent multivalued variables by means of binary variables. For a multivalued variable Xij , corresponding to ground clause Ci θj , having ni values we use ni − 1 Boolean variables
Xij1 , . . . , Xijni −1 and we represent the equation Xij = k for k = 1, . . . ni − 1 by
means of the conjunction Xij1 ∧ Xij2 ∧ . . . ∧ Xijk−1 ∧ Xijk , and the equation
Xij = ni by means of the conjunction Xij1 ∧Xij2 ∧. . . ∧Xijni −1 . BDDs obtained
in this way can be used as well for computing the probability of queries by associating to each Boolean variable Xijk a parameter πik that represents P (Xijk = 1).
If we define g(i) = {j|θj is a substitution grounding Ci } then P (Xijk = 1) = πik
for all j ∈ g(i). The parameters are obtained from those of multivalued variables
up to k = ni − 1. Figure 1(b) shows
in this way: πi1 = Πi1 , . . . πik = Qk−1Πik
j=1
(1−πij )
the reduced BDD corresponding to the MDD on the left, with binary variables
for each level.
3
EMBLEM
EMBLEM applies the algorithm for performing EM over BDDs, proposed in
[14,6], to the problem of learning the parameters of an LPAD. EMBLEM takes
as input a number of goals that represent the examples and for each one generates
the BDD encoding its explanations. The examples are organized in a set of interpretations (sets of ground facts) each describing a portion of the domain of interest. The queries correspond to ground atoms whose predicate has been indicated
as “target” by the user. The predicates can be treated as closed-world or open-world. In the first case the body of clauses with a target predicate in the head
is resolved only with facts in the interpretation, in the second case it is resolved
both with facts in the interpretation and with clauses in the theory. If the last
option is set and the theory is cyclic, we use a depth bound on SLD-derivations
to avoid going into infinite loops. Given a program containing the clauses C1 and
C2 from Example 1 and the interpretation {epidemic, f lu(david), f lu(robert)},
we obtain the BDD in Figure 1(b) that represents the query epidemic.
Then EMBLEM enters the EM cycle, in which the steps of expectation and
maximization are repeated until the log-likelihood of the examples reaches a
local maximum. For a single example Q:
29
– Expectation: computes E[cik0 |Q] and E[cik1 |Q] for all rules Ci and k =
1, . . . , ni − 1, where cikx is the number of times a variable
Xijk takes value x
P
for x ∈ {0, 1}, with j in g(i). E[cikx |Q] is given by j∈g(i) P (Xijk = x|Q).
– Maximization: computes πik for all rules Ci and k = 1, . . . , ni − 1: πik =
E[cik1 |Q]
E[cik0 |Q]+E[cik1 |Q]
If we have more than one example the contributions of each example simply sum
up when computing E[cijx ].
P (Xijk =x,Q)
with
P (Xijk = x|Q) is given by P (Xijk = x|Q) =
P (Q)
P (Xijk = x, Q) =
X
P (Q, Xijk = x, σ)
σ∈E(Q)
=
X
P (Q|σ)P (Xijk = x|σ)P (σ)
σ∈E(Q)
=
X
P (Xijk = x|σ)P (σ)
σ∈E(Q)
where P (Xijk = 1|σ) = 1 if (Ci , θj , k) ∈ σ for k = 1, . . . , ni − 1 and 0 otherwise.
Since there is a one to one correspondence between the worlds where Q is
true and the paths to a 1 leaf in a Binary Decision Tree (a MDT with binary
variables),
Y
X
P (Xijk = x|ρ)
π(d)
P (Xijk = x, Q) =
ρ∈R(Q)
d∈ρ
where ρ is a path and if σ corresponds to ρ then P (Xijk = x|σ)=P (Xijk = x|ρ).
R(Q) is the set of paths in the BDD for query Q that lead to a 1 leaf, d is an
edge of ρ and π(d) is the probability associated to the edge: if d is the 1-branch
from a node associated to a variable Xijk , then π(d) = πik , if d is the 0-branch
from a node associated to a variable Xijk , then π(d) = 1 − πik .
Now consider a BDT in which only the merge rule is applied, fusing together
identical sub-diagrams. The resulting diagram, that we call Complete Binary
Decision Diagram (CBDD), is such that every path contains a node for every
level. For a CBDD P (Xijk = x, Q) can be further expanded as
X
Y
P (Xijk = x, Q) =
π(d)
ρ∈R(Q)∧(Xijk =x)∈ρ d∈ρ
where (Xijk = x) ∈ ρ means that ρ contains an x-edge from a node associated
to Xijk . We can then write
X
Y
Y
P (Xijk = x, Q) =
π(d)
π(d)
n∈N (Q)∧v(n)=Xijk ∧ρn ∈Rn (Q)∧ρn ∈Rn (Q,x) d∈ρn
d∈ρn
where N (Q) is the set of nodes of the BDD, v(n) is the variable associated to
node n, Rn (Q) is the set containing the paths from the root to n and Rn (Q, x)
30
is the set of paths from n to the 1 leaf through its x-child.
X
X
Y
Y
X
π(d)
π(d)
P (Xijk = x, Q) =
n∈N (Q)∧v(n)=Xijk ρn ∈Rn (Q) ρn ∈Rn (Q,x) d∈ρn
=
X
X
n∈N (Q)∧v(n)=Xijk
=
X
Y
X
π(d)
ρn ∈Rn (Q) d∈ρn
ρn ∈Rn (Q,x)
d∈ρn
Y
π(d)
d∈ρn
F (n)B(childx (n))πikx
n∈N (Q)∧v(n)=Xijk
where πikx is πik if x=1 and (1 − πik ) if x=0. F (n) is the forward probability [6],
the probability mass of the paths from the root to n, while B(n) is the backward
probability [6], the probability mass of paths from n to the 1 leaf. If root is the
root of a tree for a query Q then B(root) = P (Q).
The expression F (n)B(childx (n))πikx represents the sum of the probabilities
of all the paths passing through the x-edge of node n and is indicated with ex (n).
Thus
X
P (Xijk = x, Q) =
ex (n)
(1)
n∈N (Q),v(n)=Xijk
For the case of a BDD, i.e., a diagram obtained by applying also the deletion rule, Formula 1 is no longer valid since also paths where there is no node
associated to Xijk can contribute to P (Xijk = x, Q). These paths might have
been obtained from a BDD having a node m associated to variable Xijk that is
a descendant of n along the 0-branch and whose outgoing edges both point to
child0 (n). The correction of formula (1) to take into account of this aspect is
applied in the Expectation step.
We now describe EMBLEM in detail. EMBLEM’s main procedure consists of
a cycle in which the procedures Expectation and Maximization are repeatedly called. The first one returns the log likelihood LL of the data that is used in
the stopping criterion: EMBLEM stops when the difference between LL of the
current iteration and the one of the previous iteration drops below a threshold
ǫ or when this difference is below a fraction δ of the current LL.
Procedure Expectation takes as input a list of BDDs, one for each example, and computes the expectation for each one, i.e. P (Q, Xijk = x) for
x
all
P variables Xijk in the BDD. In the procedure we use η (i, k) to indicate
j∈g(i) P (Q, Xijk = x). Expectation first calls GetForward and GetBackward that compute the forward, the backward probability of nodes and η x (i, k)
for non-deleted paths only. Then it updates η x (i, k) to take into account deleted
paths. The expectations are updated in this way: for all rules i and k = 1 to
ni − 1 E[cikx ] = E[cikx ] + η x (i, k)/P (Q),where P (Q) is the backward probability
of the root. Procedure Maximization computes the parameters values for the
next EM iteration.
Procedure GetForward traverses the diagram one level at a time starting
from the root level, where F(root)=1, and for each node n computes its contribution to the forward probabilities of its children. Function GetBackward
31
computes the backward probability of nodes by traversing recursively the tree
from the root to the leaves. More details can be found in [1].
4
Experiments
EMBLEM has been tested over three real world datasets: IMDB1 , UW-CSE2
and Cora3. We implemented EMBLEM in Yap Prolog4 and we compared it
with RIB [10]; CEM, an implementation of EM based on the cplint inference
library [9]; LeProblog [4], and Alchemy [8]. All experiments were performed on
Linux machines with an Intel Core 2 Duo E6550 (2333 MHz) and 4 GB of RAM.
To compare our results with LeProbLog and Alchemy we exploited the translations of LPADs into ProbLog [2] and MLN [10] respectively.
For the probabilistic logic programming systems (EMBLEM, RIB, CEM and
LeProbLog) we consider various options: associating a distinct random variable
to each grounding of a probabilistic clause or a single random variable to a nonground clause, to express whether the clause is used or not (the latter case makes
the problem easier); putting a limit on the depth of derivations, thus eliminating
explanations associated to derivations exceeding the limit (necessary for problems that contain cyclic clauses, such as transitive closure clauses); setting the
number of restarts for EM based algorithms. All experiments for probabilistic
logic programming systems have been performed using open-world predicates.
IMDB regards movies, actors, directors and movie genres and is divided into
five mega-examples. We performed training on four mega-examples and testing
on the remaining one. Then we drew a Precision-Recall curve and computed the
Area Under the Curve (AUCPR and AUCROC). We defined 4 different LPADs,
two for predicting the target predicate sameperson/2, and two for predicting
samemovie/2. We had one positive example for each fact that is true in the
data, while we sampled from the complete set of false facts three times the
number of true instances in order to generate negative examples.
For predicting sameperson/2 we used the same LPAD of [10]. We ran EMBLEM on it with the following settings: no depth bound (theory is acyclic),
random variables associated to instantiations of the clauses (learning time is
very low) and a number of restarts chosen to match the execution time of EMBLEM with that of the fastest other algorithm.
The queries that LeProbLog take as input are obtained by annotating with
1.0 each positive example for sameperson/2 and with 0.0 each negative example for sameperson/2 obtained by random sampling. We ran LeProbLog for a
maximum of 100 iterations or until the difference in Mean Squared Error (MSE)
between two iterations got smaller than 10−5 ; this was done also in all the subsequent experiments. For Alchemy we used the preconditioned rescaled conjugate
1
2
3
4
http://alchemy.cs.washington.edu/data/imdb
http://alchemy.cs.washington.edu/data/uw-cse
http://alchemy.cs.washington.edu/data/cora
http://www.dcc.fc.up.pt/~ vsc/Yap
32
gradient discriminative algorithm for every dataset and in this case we specified
sameperson/2 as the only non-evidence predicate.
A second LPAD, also taken from [10], has been created to evaluate the performance of the algorithms when some atoms are unseen. The settings are the
same as the ones for the previous LPAD. In this experiment Alchemy was run
with the −withEM option that turns on EM learning.
Table 1 shows the AUCPR and AUCROC averaged over the five folds for
EMBLEM, RIB, LeProbLog, CEM and Alchemy. Results for the two LPADs
are shown respectively in the IMDB-SP and IMDBu-SP rows. Table 2 shows the
learning times in hours.
For predicting samemovie/2 we used the LPAD:
samemovie(X,Y):p:- movie(X,M),movie(Y,M),actor(M).
samemovie(X,Y):p:- movie(X,M),movie(Y,M),director(M).
samemovie(X,Y):p:- movie(X,A),movie(Y,B),actor(A),director(B),
workedunder(A,B).
samemovie(X,Y):p:- movie(X,A),movie(Y,B),director(A),director(B),
genre(A,G),genre(B,G).
To test the behaviour when unseen predicates are present, we transformed the
program for samemovie/2 as we did for sameperson/2 [10]. We ran EMBLEM
on them with no depth bound, one variable for each instantiation of a rule and
one random restart. With regard to LeProbLog and Alchemy, we ran them with
the same settings as IMDB-SP and IMDBu-SP, by replacing sameperson with
samemovie. Table 1 shows, in the IMDB-SM and IMDBu-SM rows, the average
AUCPR and AUCROC for EMBLEM, LeProblog and Alchemy. For RIB and
CEM we obtained a lack of memory error (indicated with “me”).
The Cora database contains citations to computer science research papers.
For each citation we know the title, authors, venue and the words that appear in
them. The task is to determine which citations are referring to the same paper,
by predicting the predicate samebib(cit1,cit2).
From the MLN proposed in [13]5 we obtained two LPADs. The first contains
559 rules and differs from the direct translation of the MLN because rules involving words are instantiated with the different constants, only positive literals for
the hasword predicates are used and transitive rules are not included. The Cora
dataset comprises five mega-examples each containing facts for the four predicates samebib/2, samevenue/2, sametitle/2 and sameauthor/2, which have
been set as target predicates. We ran EMBLEM on this LPAD with no depth
bound (theory is acyclic), a single variable for each instantiation of a rule (learning time is reasonable) and a number of restarts chosen to match the execution
time of EMBLEM with that of the fastest other algorithm.
The second LPAD adds to the previous one the transitive rules for the predicates samebib/2, samevenue/2, sametitle/2, for a total of 563 rules. In this
case we had to run EMBLEM with a depth bound equal to two (theory becomes
cyclic and with higher values of depth learning time was overlong) and a single
5
Available at http://alchemy.cs.washington.edu/mlns/er.
33
variable for each non-ground rule (LPAD too complex to be treated with a variable for each instantiation); the number of restarts was one. As for LeProbLog,
we separately learned the four predicates because learning the whole theory at
once would give a lack of memory error. We annotated with 1.0 each positive
example for samebib/2, sameauthor/2, sametitle/2, samevenue/2 and with
0.0 the negative examples for the same predicates, which were contained in the
dataset provided with the MLN. As for Alchemy, we learned weights with the
four predicates as the non-evidence predicates. Table 1 shows in the Cora and
CoraT (Cora transitive) rows the average AUCPR and AUCROC obtained by
training on four mega-examples and testing on the remaining one. CEM and
Alchemy on CoraT gave a memory error while RIB was not applicable because
it was not possible to split the input examples into smaller independent interpretations as required by RIB.
The UW-CSE dataset contains information about the Computer Science department of the University of Washington through 22 different predicates, such
as yearsInProgram/2, advisedBy/2, taughtBy/3 and is split into five mega-examples. The goal here is to predict the advisedBy/2 predicate, namely the
fact that a person is advised by another person: this was our target predicate.
The negative examples have been generated by applying the closed world assumption to advisedBy/2. The theory used was obtained from the MLN of [12]6
and contains 86 rules. We ran EMBLEM on it with a single variable for each
instantiation of a rule, a depth bound of two (cyclic theory) and one random
restart (to limit time, in comparison with the other faster algorithms).
The annotated queries that LeProbLog takes as input have been created by
annotating with 1.0 each positive example and with 0.0 each negative example
for advisedBy/2. As for Alchemy, we learned weights with advisedBy/2 as the
only non-evidence predicate. Table 1 shows the AUCPR and AUCROC averaged
over the five mega-examples for all the algorithms.
Table 3 shows the p-value of a paired two-tailed t-test at the 5% significance level of the difference in AUCPR and AUCROC between EMBLEM and
RIB/LeProbLog/CEM/Alchemy (significant differences in bold).
From the results we can observe that over IMDB EMBLEM has comparable performances with CEM for IMDB-SP, with similar execution time. On
IMDBu-SP it has better performances than all other systems(see AUCPR), with
a learning time equal to the fastest other algorithm. On IMDB-SM it reaches
the highest area value in less time (only one restart is needed). On IMDBu-SM
it still reaches the highest area with one restart but with a longer execution
time. Over Cora it has comparable performances with the best other system
CEM but in a significantly lower time and over CoraT is one of the few systems to be able to complete learning, with better performances in terms of area
(especially AUCPR) and time. Over UW-CSE it has significant better performances with respect to all the algorithms. Longer learning times are needed for
EMBLEM on IMDBu-SM and UW-CSE datasets, but in both cases AUCPR
achieves significantly higher values. LeProblog reveals itself to be the closest
6
Available at http://alchemy.cs.washington.edu/mlns/uw-cse.
34
Table 1. Results of the experiments on all datasets. IMDBu refers to the IMDB dataset
with the theory containing unseen predicates. CoraT refers to the theory containing
transitive rules. Numbers in parenthesis followed by r mean the number of random
restarts (when different from one) to reach the area specified. “me” means memory
error during learning, “no” means that the algorithm was not applicable. AUCPR is
the area under the Precision-Recall curve, AUCROC is the area under the ROC curve,
both averaged over the five folds. E is EMBLEM, R is RIB, L is LeProbLog, C is CEM,
A is Alchemy.
Dataset
E
IMDB-SP 0.202(500r)
IMDBu-SP 0.175(40r)
IMDB-SM
1.000
IMDBu-SM
1.000
Cora
0.995(120r)
CoraT
0.991
UW-CSE
0.883
AUCPR
R
L
0.199 0.096
0.166 0.134
me 0.933
me 0.933
0.939 0.905
no 0.970
me 0.270
C
0.202
0.120
0.537
0.515
0.995
me
0.644
AUCROC
A
E
R
L
C
A
0.107 0.931(500r) 0.929 0.870 0.930 0.907
0.020 0.900(40r) 0.897 0.921 0.885 0.494
0.820
1.000
me 0.983 0.709 0.925
0.338
1.000
me 0.983 0.442 0.544
0.469 1.000(120r) 0.992 0.994 0.999 0.704
me
0.999
no 0.998 me me
0.294
0.993
me 0.932 0.873 0.961
Table 2. Execution time in hours of the experiments on all datasets. R is RIB, L is
LeProbLog, C is CEM and A is Alchemy.
Time(h)
EMBLEM R
L
C
A
IMDB-SP
0.01
0.016 0.35 0.01 1.54
IMDBu-SP
0.01
0.0098 0.23 0.012 1.54
IMDB-SM
0.00036
me 0.005 0.0051 0.0026
IMDBu-SM
3.22
me 0.0121 0.0467 0.0108
Cora
2.48
2.49 13.25 11.95 1.30
CoraT
0.38
no
4.61
me
me
UW-CSE
2.81
me
1.49 0.53 1.95
Dataset
Table 3. Results of t-test on all datasets, relative to AUCPR and AUCROC. p is the
p-value of a paired two-tailed t-test (significant differences at the 5% level in bold)
between EMBLEM and all the others. R is RIB, L is LeProbLog, C is CEM, A is
Alchemy.
p - AUCPR
p - AUCROC
E-R
E-L
E-C
E-A
E-R
E-L
E-C
IMDB-SP 0.2167 0.0126 0.3739 0.0134 0.3436 0.0012 0.3507
IMDBu-SP 0.1276 0.1995
0.001 4.5234e-5 0.2176 0.1402 0.0019
IMDB-SM
me
0.3739 0.0241 0.1790
me
0.3739 0.018
IMDBu-SM me
0.3739
0.2780 2.2270e-4 me
0.3739 0.055
Cora
0.011 0.0729
1
0.0068 0.0493 0.0686 0.4569
CoraT
no
0.0464
me
me
no
0.053
me
UW-CSE
me 1.5017e-4 0.0088 4.9921e-4 me 0.0048 0.2911
Dataset
35
E-A
0.015
1.01e-5
0.2556
6.54e-4
0.0327
me
0.0048
system to EMBLEM from the point of view of performances, able in addition
to always complete learning, but with longer times (except for IMDBu-SM and
UW-CSE). Looking at the overall results, AUCPR and AUCROC are higher
or equal for EMBLEM than the other systems except on IMDBu-SP, where
LeProbLog achieves a non-statistically significant higher AUCROC. Differences
between EMBLEM and the other systems are statistically significant in 22 out
of 43 cases.
5
Related Work
Our work has close connection with various other works. [6] proposed an EM
algorithm for learning the parameters of Boolean random variables given observations of the values of a Boolean function over them, represented by a BDD.
EMBLEM is an application of that algorithm to probabilistic logic programs. Independently [14] also proposed an EM algorithm over BDD to learn parameters
for the CPT-L language. [5] presented the CoPrEM algorithm that performs
EM for the ProbLog language. We differ from this work in the construction of
BDDs: they build a BDD for an interpretation while we build it for single ground
atoms for the specified target predicate(s), the one(s) for which we are interested
in good predictions. Moreover CoPrEM treats missing nodes as if they were
there and updates the counts accordingly.
Other approaches for learning probabilistic logic programs employ constraint
techniques, or use EM, or adopt gradient descent. Among the approaches that
use EM, [7] first proposed to use it to induce parameters and the Structural
EM algorithm to induce ground LPADs structures. Their EM algorithm however
works on the underlying Bayesian network. RIB [10] performs parameter learning
using the information bottleneck approach, which is an extension of EM targeted
especially towards hidden variables. Among the works that use a gradient descent
technique we remind LeProbLog [4], which tries to find the parameters of a
ProbLog program that minimize the MSE of the query probability and uses
BDD to compute the gradient.
Alchemy [8] is a state of the art SRL system that offers various tools for inference, weight learning and structure learning of Markov Logic Networks (MLNs).
MLNs significantly differ from the languages under the distribution semantics
since they extend first-order logic by attaching weights to logical formulas, but
do not allow to exploit logic programming techniques.
6
Conclusions
We have proposed a technique which applies an EM algorithm to BDDs for
learning the parameters of Logic Programs with Annotated Disjunctions. It can
be applied to all languages that are based on the distribution semantics and
exploits the BDDs that are built during inference to efficiently compute the expectations for hidden variables. We executed the algorithm over the real datasets
IMDB, UW-CSE and Cora, and evaluated its performances - together with four
36
other systems - through the AUCPR. These results show that EMBLEM uses
less memory than RIB, CEM and Alchemy, allowing it to solve larger problems.
Moreover its speed allows to perform a high number of restarts making it escape
local maxima. In the future we plan to extend EMBLEM for learning LPADs
structure.
References
1. Bellodi, E., Riguzzi, F.: EM over binary decision diagrams for probabilistic logic
programs. Tech. Rep. CS-2011-01, ENDIF, Università di Ferrara (2011)
2. De Raedt, L., Demoen, B., Fierens, D., Gutmann, B., Janssens, G., Kimmig, A.,
Landwehr, N., Mantadelis, T., Meert, W., Rocha, R., Santos Costa, V., Thon, I.,
Vennekens, J.: Towards digesting the alphabet-soup of statistical relational learning. In: NIPS Workshop on Probabilistic Programming (2008)
3. De Raedt, L., Kimmig, A., Toivonen, H.: ProbLog: A probabilistic prolog and
its application in link discovery. In: International Joint Conference on Artificial
Intelligence. pp. 2462–2467 (2007)
4. Gutmann, B., Kimmig, A., Kersting, K., Raedt, L.D.: Parameter learning in probabilistic databases: A least squares approach. In: European Conference on Machine
Learning. LNCS, vol. 5211, pp. 473–488. Springer (2008)
5. Gutmann, B., Thon, I., De Raedt, L.: Learning the parameters of probabilistic
logic programs from interpretations. Tech. Rep. CW 584, Department of Computer
Science, Katholieke Universiteit Leuven, Belgium (June 2010)
6. Ishihata, M., Kameya, Y., Sato, T., Minato, S.: Propositionalizing the em algorithm
by bdds. Tech. Rep. TR08-0004, CS Dept., Tokyo Institute of Technology (2008)
7. Meert, W., Struyf, J., Blockeel, H.: Learning ground CP-Logic theories by leveraging Bayesian network learning techniques. Fund. Inf. 89(1), 131–160 (2008)
8. Richardson, M., Domingos, P.: Markov logic networks. Mach. Learn. 62(1-2), 107–
136 (2006)
9. Riguzzi, F.: Extended semantics and inference for the Independent Choice Logic.
Log. J. IGPL 17(6), 589–629 (2009)
10. Riguzzi, F., Mauro, N.D.: Applying the information bottleneck to statistical relational learning. Mach. Learn. (2011), to appear
11. Sato, T.: A statistical learning method for logic programs with distribution semantics. In: International Conference on Logic Programming. pp. 715–729. MIT Press
(1995)
12. Singla, P., Domingos, P.: Discriminative training of Markov logic networks. In:
National Conference on Artificial Intelligence. pp. 868–873. AAAI Press/The MIT
Press (2005)
13. Singla, P., Domingos, P.: Entity resolution with Markov logic. In: International
Conference on Data Mining. pp. 572–582. IEEE Computer Society (2006)
14. Thon, I., Landwehr, N., Raedt, L.D.: A simple model for sequences of relational
state descriptions. In: European conference on Machine Learning. LNCS, vol. 5212,
pp. 506–521. Springer (2008)
15. Vennekens, J., Verbaeten, S., Bruynooghe, M.: Logic programs with annotated
disjunctions. In: International Conference on Logic Programming. LNCS, vol. 3131,
pp. 195–209. Springer (2004)
37
Clustering XML Documents by Structure:
a Hierarchical Approach
(Extended Abstract)
G. Costa, G. Manco, R. Ortale, and E. Ritacco
ICAR-CNR
Via Bucci 41c
87036 Rende (CS) - Italy
Abstract. A new parameter-free approach to clustering XML documents by structure is proposed. The idea is to consider various forms
of structural patterns occurring in the XML documents to form a hierarchy of nested clusters. At any level in the hierarchy, clusters explain how the XML documents can be grouped on the basis of common
structural patterns of the form considered at that level. The resulting
explanation is progressively refined at the subsequent level, where another type of structural patterns is used to divide the individual clusters
from the above level into subgroups, revealing meaningful and previously
uncaught structural differences. Each cluster in the hierarchy is summarized through a novel technique into a corresponding representative, that
provides a clear and differentiated understanding of the structural information within the cluster.
1
Introduction
The problem of clustering XML documents by structure has been extensively
investigated, with the consequent development of several approaches, such as [5,
7–10]. XML trees can share various forms of common structural components,
ranging from simple node/edge and pairwise tags [11], to more complex substructures such as groups of siblings, paths (either root-to-node [11] or root-toleaf [7]), as well as subtrees or even summaries [9, 10]. Therefore, if the addressed
form of structural patterns does not accord with the underlying properties of
XML data, valuable relationships of structural resemblance between the XML
documents can be missed, with a consequent degrade of clustering effectiveness.
Moreover, judging differences only in terms of one type of structural components
may not suffice to effectively separate the available XML documents.
This paper proposes a new hierarchical approach to clustering that considers
various forms of structural patterns in the XML documents to progressively derive a hierarchy of nested clusters. In addition, the characterization of each cluster is accomplished by means of new summarization method, aimed at subsuming
the structural properties within each cluster in terms of strongly representative
substructures.
38
2
Partitioning XML Trees
We introduce the notation used throughout the paper as well as some basic
concepts. The structure of XML documents without references can be modeled in
terms of rooted ordered labeled trees, that represent the hierarchical relationships
among the document elements (i.e., nodes).
Definition 1. XML Tree. An XML tree is a rooted, labeled, ordered tree, represented as a tuple t = (rt , Vt , Et , λt ), whose individual components have the
following meaning. Vt a set of nodes and rt ∈ Vt is the root node of t, i.e. the
only node with no entering edges. Et ⊆ Vt × Vt is a set of edges, catching the
parent-child relationships between nodes of t. Finally, λt : Vt 7→ Σ is a node
labeling function and Σ is an alphabet of node tags (i.e., labels).
t
u
Parent-child relationship in t is denoted by ni ≺ nj , where ni , nj ∈ Vt such
that ∃(ni , nj ) ∈ Et , ni is the parent while nj is the child. Ancestor-descendant
relationship is indicated as ni ≺p nj , where p is the distance (in nodes) between
the ancestor and the descendant (ni ≺1 nj is equivalent to ni ≺ nj ).
Tree-like structures are also used to represent generic structural patterns
occurring across a collection of XML trees.
Definition 2. Substructure. Let t and s be two XML trees. s is said to be a
substructure of t, if there exists a total function ϕ : Vs → Vt , that satisfies the
following conditions for each n, ni , nj ∈ Vs . Firstly, (ni , nj ) ∈ Es iff ϕ(ni ) ≺p
ϕ(nj ) in t with p ≥ 1. Secondly, λs (n) = λt [ϕ(n)].
t
u
The mapping ϕ preserves node labels and hierarchical relationships. In this
latter regard, depending on the value of p, two definitions of substructures can
be distinguished. In the simplest case p = 1 and a substructure s is simply an
induced tree pattern that matches a contiguous portion of t, this is indicated as
s v t. When p ≥ 1 [6, 12], s matches not necessarily contiguous portions of t,
this is denoted as s ⊆ t and s is also said to be an embedded tree pattern of t.
Our clustering is based on structural similiraty. Two documents are similar
if they share some elements, which can be nodes, edges (parent-child relationship), paths (ancestor-descendant relationship), etc... For this reason we choose
to cluster the documents in a multi-stage way, considering one by one a set of
elements, belonging to a specific feature space (nodes, edge, paths, etc).
At a given stage i, finding clusters in the high-dimensional feature space S (i)
is a challenging issue for various reasons [3]. The XML trees are partitioned by
the AT-DC algorithm [3], which is an effective hierarchical and parameter-free
method for transactional clustering.
The main clustering procedure is reported in fig. 1. It consists of m stages
of clustering (line 1). The end user incorporates (at line 1) valuable domain
knowledge and application semantics into the clustering process, by establishing
the most appropriate set of structural features S (i) for each stage as well as the
overall number m of stages.
39
Generate-Hierarchy(D)
Input: a set D = {t1 , . . . , tN } of XML trees;
Output: a set ∪i P (i) of multiple cluster partitions;
1: let S (i) be the set of features at stage i, with i = 1, . . . , m;
2: let i ← 1;
3: let P ← {D };
4: while i ≤ m do
5:
while P =
6 ∅ do
6:
let C be a cluster in P;
7:
P ← P − C;
8:
R ← Generate-Clusters(C, S (i) );
9:
for each C 0 ∈ R do
10:
let C 0 ← R − {C 0 } be the set of siblings of C 0 ;
11:
end for
12:
P (i) ← P (i) ∪ R;
13:
end while
14:
for each C ∈ P (i) do
15:
Rep(C) ← MineRep(C, C, α);
16:
end for
17:
P ← P (i) ;
18:
i ← i + 1;
19: end while
20: RETURN ∪i P (i) ;
Fig. 1. The hierarchical clustering process
The generic stage i (lines 4-19) consists of two phases: cluster separation
and summarization. Cluster separation exploits AT-DC to divide the individual
clusters belonging to the current partition P with respect to the feature space
S (i) (lines 5 - 13). At the beginning, i.e. when i = 1, the current partition
P includes a single cluster, which coincides with the whole dataset D of XML
trees (line 3). The partition P (i) resulting at the end of stage i (line 13) is itself a
collection of partitions. More precisely, at the current stage i, each parent cluster
C from P (i−1) is divided into an appropriate number of child clusters (line 8),
which together form the partition R of the foresaid C. At this point, each child
cluster C 0 in R is associated (lines 9-11) with its siblings C 0 = R − C (for the
cluster summarization purpose) and R is then added to the ongoing P (i) .
Cluster summarization (lines 14-16) is applied to each cluster C from the
obtained P (i) . It consists of a procedure, discussed in section 3, which associates
C with a set Rep(C) of representative substructures, that subsume the structural
information within C. P (i) becomes (at line 17) the current partition P for the
subsequent stage i + 1. At this stage, AT-DC is re-applied to further divide every
cluster C ∈ P (i) with respect to another set of structural features, i.e., S (i+1) .
The choice of a distinct feature space at each stage guarantees a progressively increasing degree of structural homogeneity. Moreover, at each distinct
stage, representatives provide a summarization of the tree structures within the
corresponding clusters in terms of (a combination of) the structural features
considered a that particular stage. Hence, the representative of a subcluster
highlights local patterns of structural homogeneity, that are not caught by the
representative of the parent cluster.
40
3
Cluster Summarization
The representative of a cluster of XML trees is modeled as a set of highly representative tree patterns, which provide an intelligible summarization of the most
relevant structural properties in the cluster. Notice that, as mentioned before, a
cluster is already characterized by a set of relevant features. However, features
can be combined further, and they do not necessarily allow to distinguish among
different clusters.
A set of tree patterns is actually viewed as the representative of a cluster of
XML trees, if its frequency in the cluster is much higher then elsewhere, and it
exhibits a strong degree of correlation with the documents of the cluster.
A representative can be computed by merging patterns. To avoid combinatory explosion, we consider only two types of tree pattern composition, namely
parent-child and sibling tree patterns composition.
Definition 3. Parent-child tree pattern. A parent-child tree pattern is an
arrangement of two basic tree patterns, in which one of the two tree patterns is
rooted at some leaf node of the other tree pattern. Let si and sj be two generic
tree patterns. Also, assume that l is some leaf node of si . The operator si /l sj
defines a new parent-child tree pattern s, such that |Vs | = |Vsi | + |Vsj | and
|Es | = |Esi | + |Esj | + 1, wherein the root rsj of sj is a child of l.
Given any two tree patterns si and sj , the set of all possible parent-child tree
patterns in which the root of sj is a child of the individual leaves of si is denoted
as
[
{si /l sj }
si / sj =
l∈Lsi
where Lsi represents the set of leaves of si .
t
u
A parent-child tree pattern is a vertical arrangement of two component tree
patterns. Instead, a sibling tree pattern follows from an horizontal arrangement
of its components.
Definition 4. Sibling tree pattern. Given two tree patterns with a same label
at their roots, a sibling tree pattern is a composite structure, whose root-to-leaf
paths are the union of the root-to-leaf paths in the two component patterns. Let
si and sj be two tree patterns such that λsi (rsi ) = λsj (rsj ).
The Representative Discovery Procedure is an Apriori-based technique whose
candidate generation phase is performed through these compositions.
4
Evaluation
The behavior of the devised clustering approach is now investigated through an
empirical evaluation with three objectives: the assessment of clustering quality,
the evaluation of cluster-summarization and a performance comparison.
41
All experiments were conducted on a Windows machine, with an Intel Itanium processor, 2Gb of memory and 2Ghz of clock speed. Standard benchmark
data sets were employed for a direct comparison against the competitors. Realworld data, named Real, encompasses the following collections: Astronomy (217
documents), Forum (264 messages), News (64 documents), Sigmod (51 documents), Wrapper (53 documents). The distribution of tags within the above documents is quite heterogeneous, due to the complexity of the DTDs associated
with the classes, and to the documents’ semantics. Three further synthetic data
sets were generated from as many collections of DTSs reported in [5]. The first
synthesized data set, referred to as Synth1, comprises 1000 XML documents produced from a collection of 10 heterogeneous DTDs (illustrated in fig. 6 of [5]),
that were individually used to generate 100 XML documents. These DTDs exhibit strong structural differences and, hence, most clustering algorithms can
produce high-quality results. A finer evaluation can be obtained by investigating
the behavior of the compared algorithms on a collection of XML documents,
that are very similar to one another from a structural point of view. To perform
such a test, a second synthesized data set, referred to as Synth2 and consisting
of 3000 XML documents, was assembled from 3 homogeneous DTDs (illustrated
in fig. 7 of [5]), individually used to generate 1000 XML documents. Experiments over Synth2 clearly highlight the ability of the competitors at operating
in extremely-challenging applicative-settings, wherein the XML documents share
multiple forms of structural patterns. Additionally, Synth3 is a collection consisting of the synthesized documents in [7], which exhibit a 30% degree of overlap.
Synth3 allows us to compare the effectiveness of the devised approach to the approach proposed in [7]. Clustering effectiveness is evaluated over each partition
Pi = {C1 , . . . , Ck } and it is measured in terms of average precision and recall [1].
Table 1 shows the results of clustering on such collections. As we can see,
precision and recall are optimal, even for the collection Synth2 of homogeneous
documents.
Collection N. of Docs Classes Clusters Avg Precision Avg Recall Avg Γ Time
Real
Synth1
Synth2
Synth3
Synth4
649
1000
3000
1400
800
5
10
3
7
8
5
10
3
7
10
1
1
1
1
1
1
1
1
1
0.8
0.9558
0.9455
0.3833
0.7875
0.7127
20.48s
13.32s
7.5s
2.68s
3.68s
Table 1. Evaluation of separability and homogeneity
To deeply investigate the Generate-Hierarchy procedure effectiveness,
we produce a new data set Synth4 which requires the multi-layer clustering over
all the features we consider. It is composed by 800 documents whose schema is
shown in fig. 2.
The DTDs capture substantial similarities and differences. In particular, all
dataset exhibit different paths (but they can share some edges). The documents
in DTD4 can be further split, since they can exhibit trees with paths ending
42
Fig. 2. DTDs for the Synth3 dataset
in the node A6. Also, node frequencies in DTD4 can substantially differ, thus
differentiating this DTD from the others even at a node level. This situation
is fully captured by the clustering algorithm, as shown in fig. 3(a). DTD4 is
separated from DTD3 at the node level and further split in two subclusters at the
edge level, according to whether edges contain or not edge (A9, A6). Also, the
trees containing such an edge can be further split according on whether or not
they contain the path from A10 to A6. Notice that, by the contrary, DTD8 does
not behave similarly, since there is no such a node like A6 that differentiates the
trees in the class.
(a) Cluster hierarchy for Synth3
(b) Cluster hierarchy for Sigmod
Fig. 3.
The evaluation of the multi-stage clustering is further confirmed by experimenting on Sigmod. As already mentioned, this dataset consists of documents
complying with three different DTDs. In particular, the distribution of the documents is unbalanced, since one of the DTDs, named IndexTermsPage, contains
much more documents the the other ones. Figure 3(b) shows that GenerateHierarchy separates all documents complying to different DTDs and further splits the documents in the class related to IndexTermsPage, according
to whether or not these documents contain the optional elements described in
the DTD (mainly, categoryAndSubjectDescriptorsTuple, category, content
and term). In particular, the separation of such a class leads to two subclasses C1
43
and C2 , that can be described by two DTDs, both subsumed by IndexTermsPage.
The difference between C1 and C2 is the absence (in C1 ) and the presence in C2 ,
of the elements of IndexTermsPage.
The evaluation of the accuracy of cluster summarization is inspired to an idea
originally proposed in [5] for a different purpose, i.e., measuring the structural
homogeneity of a set of intermediate clusters obtained while partitioning a collection of XML documents. Let t be an XML tree and R a set of substructures.
The representativeness γ(R, t) of R with respect to t is the fraction of nodes in
t matched by the embedded substructures of R:
γ(R, t) =
| ∪s∈R,s⊆t {n ∈ V|Vs 7→ V ⊆ Vt }|
|Vt |
where, Vt and Vs are the sets of nodes of, respectively, the XML tree t and
the generic substructure s. V is instead the subset of the nodes in t matched by
the nodes of s (which is the meaning of notation Vs 7→ V). Representativeness
can be easily generalized to clusters. The representativeness Γ [Rep(C)] of the
representative Rep(C) with respect to a cluster C can be defined as the average
representativeness of the documents in the cluster.
A connection between cluster representativeness and structural homogeneity
explains unexpectedly low Γ values. Representativeness is high when the representative frequently occurs in a cluster but not in the other ones. In the case
of homogeneous documents the erasure of candidate structures is very frequent:
the only structures that survives are very specific, so they are infrequent in the
cluster (for example in Synth2 ).
Table 1 shows the average Γ value exhibited in each experiment. In order
to evaluate the scalability of the algorithm, we used the DTDs for Synth1 and
produced respectively 100, 1000, 10.000 and 100.000 documents with 2, 4, and
8 clusters. The results are shown in fig. 4. As we can see, the algorithm is linear
both in the number of documents and in the number of clusters.
Fig. 4. Performance in milliseconds for data sets of different size
At the end of this intensive empirical evaluation, the devised approach can be
compared against a state-of-the-art competitor, namely the XProj approach [5].
By looking at the performance of XProj reported in [5], it can be observed that
44
our clustering approach attains the same quality. However, two strong advantages of the proposed approach are: the development of a hierarchy of nested
clusters, explaining multiple forms of structural relationships in the data; the
summarization of a cluster of XML documents, which provides an intelligible
subsumption of its structural properties. Also, notice that the scalability of our
approach is orders of magnitude higher than the one of XProj. Finally, the
devised approach is fully-automatic (i.e., parameter-free), whereas the optimal
performance of XProj, on each data set, is the consequence of a complex setting
process.
5
Conclusions
A new approach to clustering of XML documents was proposed, that produces
a hierarchy of nested clusters. Along the paths from the root to the leafs of
the hierarchy, the approach progressively separates the XML data by looking
at the occurrence of different types of structural patterns in their structures.
Also, each cluster in the hierarchy is subsumed, through a novel summarization
method, by a set of representative substructures, that provide an understanding
of the structural properties considered in the cluster. A comparative evaluation
proved that the devised approach is on a par and even better than established
competitors in terms of effectiveness, scalability and cluster summarization.
References
1. R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. AddisonWesley, 1999.
2. R. Baumgartener, S. Flesca, and G. Gottlob. Visual web information extraction
with lixto. In Procs. VLDB’01 Conf., pages 119 – 128, 2001.
3. E. Cesario, G. Manco, and R. Ortale. Top-down parameter-free clustering of highdimensional categorical data. IEEE TKDE, 19(12):1607 – 1624, 2007.
4. V. Crescenzi, G. Mecca, and P. Merialdo. Roadrunner: Towards automatic data
extraction from large web sites. In Procs. VLDB’01 Conf., pages 109–118, 2001.
5. C. C. Aggarwal et al. Xproj: A framework for projected structural clustering of
xml documents. In Procs. SIGKDD’07 Conf., pages 46 – 55, 2007.
6. C. Wang et al. Efficient pattern-growth methods for frequent tree pattern mining.
In Procs. PAKDD’04 Conf., pages 441 – 451, 2004.
7. G. Costa et al. A tree-based approach to clustering xml documents by structure.
In Procs. PKDD’04 Conf., pages 137 – 148, 2004.
8. M. L. Lee et al. Xclust: Clustering xml schemas for effective integration. In Procs.
CIKM’02 Conf., pages 292 – 299, 2002.
9. T. Dalamagas et al. A methodology for clustering xml documents by structure.
Information Systems, 31(3):187 – 228, 2006.
10. W. Lian et al. An efficient and scalable algorithm for clustering xml documents
by structure. IEEE TKDE, 16(1):82 – 96, 2004.
11. S. Helmer. Measuring the structural similarity of semistructured documents using
entropy. In Procs. VLDB’07 Conf., pages 1022 – 1032, 2007.
12. M. J. Zaki. Efficiently mining frequent trees in a forest: Algorithms and applications. IEEE TKDE, 17(8):1021 – 1035, 2005.
45
Outlier Detection For XML Documents
(Extended Abstract)
Giuseppe Manco and Elio Masciari
ICAR-CNR
{manco,masciari}@icar.cnr.it
Abstract. XML (eXtensible Markup Language) became in recent years
the new standard for data representation and exchange on the WWW.
This has resulted in a great need for data cleaning techniques in order to
identify outlying data. In this paper, we present a technique for outlier
detection that single out anomalies with respect to a relevant group of
objects. We exploit a suitable encoding of XML documents that are encoded as signals of fixed frequency that can be transformed using Fourier
Transforms. Outlier are identified by simply looking at the signal spectra.
The results show the effectiveness of our approach.
1
Introduction
An outlier is an observation that differs so much from other observations as to
arouse suspicion that it was generated by a different mechanism [8]. There exist
several approaches to the identification of outliers, namely, statistical-based [5],
deviation-based [4], distance-based [3], density-based [6], projection-based [1],
MDEF-based [12], and others. Abstracting from the specific method being exploited the general outlier detection task is the problem of identifying deviations
from the general patterns characterizing a data set. Detecting outliers is important in many application scenarios, as an example it can be used for improving
data cleaning approaches, where outliers are often data noise or errors diminishing the accuracy of data mining. Outlier detection is also the core of applications
such as fraud detection, stock market analysis, intrusion detection, marketing,
network sensors, and email spam detection, where irregular patterns entail special attention. Due to the increasing usage of semi-structured data models like
XML (eXtensible Markup Language), that is the new standard for data representation and exchange on the Web there is a great need for outlier detection
strategies entailed for such data. Although outlier detection methods are well
established for relational data, adapting them directly to XML data is unfeasible because XML and relational data models differ in several aspects. First,
XML data contain multiple levels of nested elements (or attributes) organized in
a tree-based structure, whereas relational data models have a flat tabular structure. Indeed, the hierarchical structure of XML data induce an ordering lacking
in relational data. Also, the modeling objectives for XML and relational data are
different, and therefore different relationships are represented. In relational data
models, the primary-foreign key relationships between entities form the basis for
data normalization and referential integrity. On the contrary, relationships between the XML elements are encoded in hierarchies, often with direct semantic
correspondence to the real-world relations such as containment and composition.
Despite its importance, XML outlier detection has not been paid the attention it deserves. There exists few works addressing structural and attribute
outlier detection for XML. The main distinction is between class outlier and attribute outlier, i.e. respectively outlier based on the overall structure of the document and outlier based on univariate points that exhibit deviating correlation
46
behavior with respect to other attributes[10]. In [10] an approach is presented
for correlation based attribute outlier detection, while main approaches for class
outlier detection try to adapt the techniques defined for the relational setting
to the semistructured one. They have been mainly proposed for data cleaning
purposes like in [15, 14, 16].
Our approach. We will tackle in this paper the class outlier detection problem. The basic intuition exploited in this paper is that an XML document has
a “natural” interpretation as a time series (namely, a discrete-time signal), in
which numeric values summarize some relevant features of the elements enclosed
within the document. We can get an example evidence of this observation by
simply indenting all the tags in a given document according to their nesting level.
Indeed, the sequence of indentation marks (as they appear within the document
rotated by 90 degrees), can be looked at as a time series, whose shape roughly describes the document’s structure. Hence, a key tool in the analysis of time-series
data is the use of the Discrete Fourier Transform (DFT): some useful properties
of the Fourier transform, such as energy concentration or invariance under shifts,
enable to analyze and manipulate signals in a very powerful way. The choice of
comparing the frequency spectra follows both effectiveness and efficiency issues.
Indeed, the exploitation of DFT leads to abstract from structural details which,
in most application contexts, should not affect the similarity estimation (such as,
e.g., different numbers of occurrences of a shared element or small shifts in the
actual positions where it appears). This eventually makes the comparison less
sensitive to minor mismatches. Moreover, a frequency based approach allows to
estimate the similarity through simple measures (e.g., vector distances) which
are computationally less expensive than techniques based on the direct comparison of the original document structures. To summarize, we propose to represent
the structure of an XML document as a time series, in which each occurrence of
a tag in a given context corresponds to an impulse. By analyzing the frequency
spectra of the signals, we can hence state the degree of (structural) similarity
between documents. It is worth noticing that the overall cost of the proposed
approach is only O(N log N ), where N is the maximum number of tags in the
documents to be compared. Once defined an effective distance measure we will
exploit a distance-based outlier detection algorithm in order to single out the
outlying documents.
2
Problem Statement and Overview of the Proposal
We begin by presenting the basic notation for XML documents that will be
used hereafter. An XML document is characterized by tags, i.e., terms enclosed
between angled brackets. Tags define the structure of an XML document and
provide the semantics of the information enclosed. A tag is associated with a
tag name (representing the term enclosed between angled brackets), and can
appear in two forms: either as a start-tag (e.g., <author>), or as an end-tag
(characterized by the / symbol, like, e.g., in </author>). Finally, a tag instance
denotes the occurrence of a tag within a certain document. It is required that,
in a well-formed XML document, tags are properly nested, i.e. each start-tag
has a corresponding end-tag at the same level. Therefore, an XML document
can be considered as an ordered tree, where each node (an element) represents a
portion of the document, delimited by a pair of start-tag and end-tag instances,
and denoted by the tag name associated with the instances. The structure of
an XML document corresponds to the shape of the associated tree. In a tree,
several types of structural information can be detected, which correspond to
different refinement levels: for example, attribute/element labels, edges, paths,
subtrees, etc. Defining the similarity among two documents essentially means
47
choosing an appropriate refinement level and comparing the documents according to the features they exhibit at the chosen level. Different choices may result
in rather dissimilar behaviors: in particular, comparing simple structural components (such as, e.g., labels or edges) allows for an efficient computation of
the similarity, but typically produces loose-grain similarity values. On the other
hand, complex structural components would make the computation of similarity
inefficient, and hence unpractical.
Consider, for example, the documents represented in Fig. 1. If a comparison
of nodes or edges is exploited, documents book1 and book2 appear to be similar, even though the subtrees rooted at the book element appear with different
frequencies. Accounting for frequencies does not always help: for example, if the
order of appearance of the subtrees of the xml element in book3 were changed,
the resulting tree still should have the same number of nodes, edges and even
paths.
In principle, approaches based on tree-edit distance [11] can better quantify
the difference between XML trees; however, they turn out to be too expensive
in many applications contexts, as they are generally quadratic w.r.t. document
sizes. Finally, notice that solutions based on detecting local substructures [9]
to be used as features may be even hard to handle, for they showing two main
disadvantages: first, they may imply ineffective representations of the trees in
high dimensional spaces, and second, costly feature extraction algorithms are
required.
xml
xml
book
book
xml
book
book
title
book
title
author
title
(a) book1
author
title
author
book
title
author
(b) book2
author
title
author
book
title
author
name
email
(c) book3
Fig. 1. book1 and book2 have the same elements, but with different cardinality. By
contrast, book3 induces a different structure for the author element.
In our opinion, an effective notion of structural similarity should take into
account a number of issues. First of all, it is important to notice that each
document may induce a definition of the elements involved. Thus, an appropriate comparison between two documents should rely on the comparison of such
definitions: the more different they are, the more dissimilar are the documents
and this information has to be exploited for signaling candidate outliers. Our
main objective is the development of an efficient method which is able to approximate the above features at best. Thus, we can state the problem of finding
the structural similarity in a set of XML documents as follows. Given a set
D = {d1 , . . . , dn } of XML documents, we aim at building a similarity matrix S,
i.e., a matrix representing, for each pair (di , dj ) of documents in D, an optimal
measure of similarity sij . Here, optimality refers to the capability of reflecting
the above described differences. Observe that we do not address here the problem of finding which parts of two documents are similar or not, as, e.g., tree-edit
based techniques do. We propose a technique which is essentially based on the
idea of associating each document with a time series representing, in a suitable
way, both its basic elements and their relationships within the document. More
precisely, we can assume a preorder visit of the tree-structure of an XML docu-
48
ment. As soon as we visit a node of the tree, we emit an impulse encoding the
information corresponding to the tag. The resulting signal shall represent the
original document as a time series, from which relevant features characterizing a
document can be efficiently extracted. As a consequence, the comparison of two
documents can be accomplished by looking at their corresponding signals.
The main features of the approach can be summarized as follows: 1) Each
element is encoded as a real value. Hence, the differences in the values of the
sequence provide for an evaluation of the differences in the elements contained
by the documents; 2) The substructures in the documents are encoded using
different signal shapes. As a consequence, the analysis of the shapes in the sequences realizes the comparison of the definitions of the elements. 3) Context
information can be used to encode both basic elements and substructures, so
that the analysis can be tuned to handle in a different way mismatches which
occur at different hierarchical levels.
In a sense, the analysis of the way the signal shapes differ can be interpreted
as the detection of different definitions for the elements involved in the documents. Moreover, the analysis of the frequencies of common signal shapes can be
seen as the detection of the differences between the occurrences associated with
a repetition marker. In this context, the proposed approach can be seen as an
efficient technique, which can satisfactorily evaluate how much two documents
are similar w.r.t. the structural features previously discussed. Notably, the use of
time-series for representing complex XML structures, combined with an efficient
frequency-based distance function, is the key for quickly evaluating structural
similarities: if N is the maximum number of tags in two documents, they can be
compared in only O(N log N ) time. In particular, the use of DFT supports the
above described notion of similarity: if two documents share many elements having a similar definition, they will be recognized as similar, even when there are
repeated and/or optional sub-elements. Indeed, working on frequency spectra
makes the comparison less sensitive to the differences in the number of occurrences of a given element and to small shifts in the actual positions where it
occurs in the documents. The details on representing an XML document as a
signal are omitted here due to space limitations, a complete explanation can be
found in [7].
3
Comparing Documents using DFT
Once defined a proper document encoding, we can now detail the similarity
measures for XML documents, sketched in section 1. As already mentioned, we
can assume that we are visiting the tree-structure of an XML document d (using
a preorder visit) starting from an initial time t0 . We also assume that each
tag instance occurs after a fixed time interval ∆. The total time spent to visit
the document is N ∆, where N is the size of tags(d). During the visit, as we
find a tag, we produce an impulse which depends on a particular tag encoding
function e and on the overall structure of the document (i.e., the document
encoding function enc). As a result of the above physical simulation, the visit of
the document produces a signal hd (t), which usually changes its intensity in the
time interval [t0 , t0 + N ∆). The intensity variations are directly related to the
opening/closure of tags:
½
[enc(d)](k) if t0 + k∆ ≤ t < t0 + (k + 1)∆
hd (t) =
0
if t < t0 or t ≥ t0 + N ∆
Comparing such signals, however, might be as difficult as comparing the original documents. Indeed, comparing documents having different lengths, requires
49
the combination of both resizing and alignment operations. Moreover, the intensity of a signal strongly depends on the encoding scheme adopted, which can
in turn depend from the context (as in the case, e.g., of the multilevel encoding
scheme).
In order to compare two documents di and dj , hence, we can exploit the
properties of the corresponding transforms. In particular [2, 13], a possibility is
to exploit that, by Parseval’s theorem, energy (total power) is an invariant in
the transformation (and hence the information provided by the encoding remains
unchanged in the transform). However, a more effective discrimination can exploit the difference in the magnitude of frequency components: in a sense, we are
interested (i ) in abstracting from the length of the document, and (ii ) in knowing whether a given subsequence (representing a subtree in the XML document)
exhibits a certain regularity, no matter where the subsequence is located within
the signal. In particular, we aim at considering as (almost) similar documents
exhibiting the same subtrees, even if they appear at different positions. Now,
as the encoding guarantees that each relevant subsequence is associated with a
group of frequency components, the comparison of their magnitudes allows the
detection of similarities and differences between documents. Observe that measuring the energy of the difference signal would result in a low similarity value.
On the other side, if the phases of the documents’ transforms are disregarded,
documents are more likely to be considered as similar.
A viable approximation can be the interpolation of the missing coefficients
starting from the available ones. It is worth noticing that the approximation
error due to interpolation is inversely proportional to min(Ndi , Ndj ): the more
elements are available in a document d, the better the DFT approximates the
(continuous) Fourier Transform of the signal hd (t), and consequently the higher
is the degree of reliability of interpolation. As a practical consequence, the approach is expected to exhibit good results with large documents, providing poorer
performances with small documents.
Definition 1. Let d1 , d2 be two XML documents, and enc a document encoding
function, such that h1 = enc(d1 ) and h2 = enc(d2 ). Let DFT be the Discrete
Fourier Transform of the (normalized) signals. We define the Discrete Fourier
Transform distance of the documents as the approximation of the difference of
the magnitudes of the DFT of the two encoded documents:

 12
X ¡¯
¯ ¯
¯¢2
¯[DFT(h
¯ ¯ ˜
¯ 
˜
dist(d1 , d2 ) = 
1 )](k) − [DFT(h
2 )](k)
M/2
k=1
˜ is an interpolation of DFT to the frequencies appearing in both d1
where DFT
and d2 , and M is the total number of points appearing in the interpolation, i.e.,
M = Ndi + Ndj − 1 points.
u
t
3.1
Outlier Identification
Once defined a technique to state the similarity between two XML documents we
need to define a strategy that, exploiting such a technique, identifies anomalies
in the data set.
Definition 2 (Fourier Based XML Outlier). Given a set of XML documents S, a positive integer k, and a positive real number R, a documents d ∈ S
is a DB(k, R)-outlier, or a distance-based outlier with respect to parameters k
and R, if less than k objects in S lie within distance R from o w.r.t. our distance
metric.
50
The threshold values R and k has to be chosen depending on the scenario
being monitored. Once defined our notion of outlier, we can design an effective
method for outlier detection.
Algorithm 1 Function Compute Outlier:
INPUT:
a set of XML documents S = {d1 , · · · , dn }, a pair of threshold
values R and k, an XML document dnew ;
OUTPUT:
Y es if dnew is an outlier no otherwise;
begin
temp = 0
For each di ∈ S do
dist = computeDF T Distance(di , dnew )
if dist > R
temp = temp + 1
if temp > k return Y es
return N o;
end
Function computeDF T Distance evaluates the DF T distance between the
XML documents being analyzed and the XML documents previously collected.
If the computed distance is greater than the threshold distance set by the user
we will increase an auxiliary variable temp. If temp is greater than the threshold
value k the document is marked as outlying.
Proposition 1. Algorithm 1 works in time O(|S|Ṅ log(N )).
The running time can be trivially computed by observing that for each document being analyzed we have to compute the Fourier Transformation and this
operation is performed in O(N log(N )) time that is the dominant operation for
the Algorithm.
4
Experimental Results
In this section, we present the experiments we performed to assess the effectiveness of the proposed approach in detecting outliers. To this purpose, a collection
of tests is performed, and for each test some relevant groups of homogeneous
documents (document classes) are considered. The direct result of each test is a
similarity matrix representing the degree of similarity for each pair of documents
in the data set and the number of detected outliers. The evaluation of the results
relies on some a priori knowledge about the document classes being used that
was obtained by domain experts or available from the datasets providers. We
performed several experiments on a wide variety of real datasets. More in detail
we report here due to space limitations the results on the following data.
The documents used belong to two main classes:1) Astronomy, a data set containing 217 documents extracted from an XML-based metadata repository, that
describes an archive of publications owned by the Astronomical Data Center at
NASA/GSFC (http://adc.gsfc.nasa.gov/); 2) Sigmod, a data set composed
of 51 XML documents containing issues of SIGMOD Record. Such documents
were obtained from the XML version of the ACM SIGMOD Web site.
For each class we added some outlying documents by perturbating the original DTD’s. We compared our approach with the one proposed in [16], we refer
to it as N oise. In order to perform a simple quantitative analysis we produce for
each test a similarity matrix, aimed at evaluating the resulting neighbors similarities (i.e., the average of the values computed for documents belonging to the
same class), and to compare them with the outer similarities (i.e., the similarity
computed by considering only documents belonging to different classes). To this
51
purpose, values inside the matrix can be aggregated according to the class of
membership of the related elements: given a set of documents belonging to n
prior classes, a similarity matrix S about these documents can be summarized
by a n × n matrix CS , where the generic element CS (i, j) represents the average
similarity between class i and class j.
P
 x,y∈Ci ,x6=y DIST (x,y) iff i = j
i |×(|Ci |−1)
CS (i, j) = Px∈C|C,y∈C
DIST (x,y)
i
j

otherwise
|Ci |×|Cj |
where DIST (x, y) is the chosen distance metric (N oise metric or our F ourier
metric).
The above definition is significant since we normalize the metric value so
eventually we can use different approaches, this will allow us to compare performance in the ideal setting for any approach.
The higher are the values on the diagonal of the corresponding CS matrix
w.r.t. those outside the diagonal, the higher is the ability of the similarity measure to separate different classes. In the following we report a similarity matrix
for each dataset being considered, as it will be clear the reported results show
that our technique is quite effective for outlier detection. In particular, the similarity matrix will give an intuition about the ability of the approach to catch
the neighboring documents for each class while the number of outliers detected
for each dataset is reported in a separate table. We used for the experiments
the following parameter values: as k value the maximum number of documents
supposed to belong to each class and as R value the average distance between
documents belonging to the same class.
Measuring Effectiveness for Astronomy. For this dataset our prior
knowledge is the partition of the documents into two classes. As it is easy to
see in Figure 2(a) and (b) F ourier outperforms N oise by allowing a perfect
assignment of the proper neighboring class to each document.
N oise
Class 1 Class 2
F ourier2D
Class 1 Class 2
Class 1
Class 2
0.6250
1
1
0.6250
Class 1 0.9790 0.8528
Class 2 0.8528 0.9915
(a)
(b)
Fig. 2. N oise and F ourier similarity matrices for Astronomy dataset
In Figure 3 the number of detected outlier is reported. The actual number of
outliers for each class is 7 so as it is easy to see F ourier exactly detected all the
outliers in the dataset, such a result is quite understandable considering that the
similarity matrix for F ourier exactly recognized the neighboring documents.
M ethod
N oise
F ourier
Class 1 Class 2
5
7
4
7
Fig. 3. N oise and F ourier2D number of detected outliers for Astronomy dataset
Measuring Effectiveness for Sigmod. In this case there were 3 main
classes as it is shown in Figure 4(a) and (b). Also in this case F ourier outperforms N oise. As we can see, differences among the various classes are marked
with higher precision by F ourier. This is mainly due to the fact that our approach is quite discriminative since it takes into account all the document features. For this dataset the number of actual outliers was 8 for each class, as it
is easy to see in Figure 5 F ourier still outperforms N oise for this dataset.
52
N oise
Class 1 Class 2 Class 3
Class 1 0.9986 0.7759 0.7055
Class 2 0.7759 0.9889 0.7566
Class 3 0.7055 0.7566 0.9920
F ourier
Class 1 Class 2 Class 3
Class 1
Class 2
Class 3
0.9885 0.7439 0.7108
0.7439 0.9899 0.7223
0.7108 0.7223 0.9874
(a)
(b)
Fig. 4. N oise and F ourier similarity matrices for Sigmod dataset
M ethod Class 1 Class 2 Class 3
N oise
F ourier
6
7
4
8
5
8
Fig. 5. N oise and F ourier number of detected outliers for Sigmod dataset
5
Conclusion
In this paper we addressed the problem of detecting outliers in XML data.
The technique we have proposed is mainly based on the idea of representing
a document as a signal. Thereby, the similarity between two documents can be
computed by analyzing their Fourier transforms thus defining a distance measure
that can be exploited to define distance based outliers. Experimental results
showed the effectiveness of the approach in detecting outlying XML documents.
References
1. C.C. Aggarwal and P. Yu. Outlier detection for high dimensional data. In SIGMOD01, 2001.
2. R. Agrawal, C. Faloutsos, and A. Swami. Efficient Similarity Search in Sequence
Databases. In FODO’93, pages 69–84, 1993.
3. F. Angiulli and F. Fassetti. Dolphin: An efficient algorithm for mining distancebased outliers in very large datasets. TKDD, 3(1), 2009.
4. A. Arning, C. Aggarwal, and P. Raghavan. A linear method for deviation detection
in large databases. In KDD’96, page 164169, 1996.
5. V. Barnett and T. Lewis. Outliers in Statistical Data. John Wiley & Sons, 1994.
6. M.M. Breunig, H. Kriegel, R. Ng, and J. Sander. Lof: Identifying density-based
local outliers. In SIGMOD00, 2000.
7. S. Flesca, G. Manco, E. Masciari, L. Pontieri, and A. Pugliese. Fast detection of
xml structural similarity. IEEE TKDE, 17(2):160–175, 2005.
8. D. Hawkins. Identification of Outliers. Monographs on Applied Probability and
Statistics. Chapman & Hall, 1980.
9. H. Kashima and T. Koyanagi. Kernels for semi-structured data. In Procs. Int.
Conf. on Machine Learning (ICML’02), pages 291–298, 2002.
10. Judice L. Y. Koh, Mong Li Lee, Wynne Hsu, and Wee Tiong Ang. Correlationbased attribute outlier detection in xml. In Proceedings of the 2008 IEEE 24th
International Conference on Data Engineering, pages 1522–1524, Washington, DC,
USA, 2008. IEEE Computer Society.
11. A. Nierman and H.V. Jagadish. Evaluating structural similarity in XML documents. In Procs. 5th Int. Workshop on the Web and Databases (WebDB 2002),
2002.
12. S. Papadimitriou, H. Kitagawa, P. Gibbons, and C. Faloutsos. Loci: Fast outlier
detection using the local correlation integral. In ICDE’03, page 315326, 2003.
13. D. Rafiei and A. Mendelzon. Efficient retrieval of similar time series. In FODO’98,
1998.
14. Choh Man Teng. Polishing blemishes: Issues in data correction. IEEE Intelligent
Systems, 19:34–39, 2004.
15. Melanie Weis and Felix Naumann. Dogmatix tracks down duplicates in xml. In
Proceedings of the 2005 ACM SIGMOD international conference on Management
of data, SIGMOD ’05, pages 431–442, New York, NY, USA, 2005. ACM.
16. Xingquan Zhu and Xindong Wu. Class noise vs. attribute noise: a quantitative
study of their impacts. Artif. Intell. Rev., 22:177–210, November 2004.
53
P2P support for OWL-S discovery
Domenico Redavid† , Stefano Ferilli? , and Floriana Esposito?
†
?
Artificial Brain S.r.l., Bari, Italy
Computer Science Department, University of Bari “Aldo Moro”, Italy
[email protected]
{ferilli, esposito}@di.uniba.it
Abstract. The discovery of Web services is often influenced by the rigid
structure of registries containing their XML description. In recent years
some methods that replace the traditional UDDI registry with Peer To
Peer networks for the creation of catalogs of Web services have been
proposed in order to make this structure flexible and usable. This paper
proposes a different view by placing the semantic description of services
as content of P2P networks and showing that all the needed information
for an efficient Web service discovery is already contained in its OWL-S
description.
1
Introduction
The discovery of Web services (Ws)1 is achieved through Universal Description,
Discovery and Integration (UDDI)2 , which provides a standard mechanism to
register and search WS descriptions. An UDDI registry is an indexed database
that contains instances of Web Services Description Language (WSDL)3 in turn
based on eXtensible Markup Language (XML)4 and independent from hardware
platforms. A requester needing to use a service queries the UDDI registry to
find the one that best meets its needs. The register returns an access point and
WSDL description which are then used by the requester to build needed SOAP5
messages to communicate with the actual service.
The UDDI registry is supported by a worldwide network of nodes connected
in a federation. When a client sends information to a registry, this is propagated
to other nodes. In this way it implements data redundancy, providing a certain
degree of reliability. However, data replication implies lower consistency and is
not a scalable approach. Another limitation of UDDI is the search mechanism:
it can focus only on a single search criterion such as name, location, category
1
2
3
4
5
W3C Web of Services - http://www.w3.org/standards/webofservices/
Universal Description, Discovery and Integration v3.0.2 (UDDI), OASIS Specification - http://uddi.org/pubs/uddi_v3.htm
Web Services Description Language (WSDL) Version 2.0 Part 0: Primer, W3C Recommendation 26 June 2007 - http://www.w3.org/TR/wsdl20-primer/
W3C XML Technology - http://www.w3.org/standards/xml/
SOAP Version 1.2 Part 0: Primer (Second Edition), W3C Recommendation 27 April
2007 - http://www.w3.org/TR/soap12-part0/
54
of business, etc.. Within the Service-Oriented Architecture (SOA) [3], the register has a role similar to yellow pages where a list of services can be found.
To fully exploit the potential of this type of architecture, the register should be
consultable not only by humans but also by software systems that need to find,
select and compose services in an automatic way. In recent years research has
been focusing on Peer-to-Peer (P2P) technologies [2] that offer Distributed Hash
Table (DHT) [11] functionalities. A P2P network provides a typical distributed
decentralized approach where multiple computers are interconnected and communicate by exchanging messages. DHT partitions the items of a key set among
participating nodes, and can send messages to the owner of a given key in an
efficient manner. The P2P network with DHT support is scalable and solves
the problem of data redundancy, but supports only exact match for keywords.
The inclusion in the P2P network of references to service semantics could be a
turning point, because such information could be exploited for the automatic
discovery of services with the help of semantic matchmaking techniques. In this
paper we discuss how this vision can be realized. Section 2 introduces the basic
concepts related to WS registries and Catalogs based on P2P protocols, and the
OWL-S language for the representation of the semantics associated with services.
Section 3 describes an implementation of the P2P network created by means of
the Open Chord API and OWL-S. Finally, Section 4 presents an analysis on the
potential of the proposed approach.
2
2.1
Background
Web service registries
Web services are software systems identified by means of a Web address and designed to support interoperability between computers on a network. They have
public interfaces defined and described as XML documents in a format, such as
WSDL, that can be processed by a machine in an automatic way. Their definitions can be sought from other software systems, which can directly interact
with the Web service operations described in the interface by activating the appropriate messages enclosed in an SOAP envelope. These messages are usually
transported via the Hyper Text Transport Protocol (HTTP) and formatted according to the XML standard. For the purposes of this discussion is important to
point out what are the current approaches to the organization of Web services.
These approaches can be broadly classified as centralized or decentralized. The
traditional centralized approach includes UDDI, where a central registry is used
to store descriptions of Web services. The current UDDI approach attempts to
mitigate the disadvantages of centralization by replicating the entire information
on different sites. Replication, however, may temporarily improve performance
if the number of UDDI users is limited. But with the rise of the replicated sites,
decreases the consistency of the duplicated data. The replication of UDDI data
is not a scalable approach. For this reason, different approaches on decentralized
registries have been proposed in order to connect individual customers through
the P2P network. Since this technology organizes peers into a hypercube, the
55
management becomes inefficient in the presence of a large amount of data. A
solution to this problem is given by [14] where a method for the reduction in size
of the indexing scheme that maps the multidimensional information space with
physical peers is presented. However, this method does not use semantic descriptions. The Web Service discovery based on P2P is also discussed in [18] and [5]
where ad hoc model frameworks are proposed for this purpose. As a starting
point, in the next section we will analyze a proposed approach that combines
ontologies and P2P based on DHT as a sophisticated solution to these issues.
2.2
Web Service catalog system based on DHT
Without a central registry, the easiest way to find out the location of a service in
a distributed system is to send the query to each participant (service provider).
While this approach might work for a small number of service providers, it is
certainly not scalable in a large distributed system. When a system includes
thousands of nodes, facilities that allow the selection of a subset of nodes that
will be fitted with the functionality exposed in the catalogs are needed. The
new generation of P2P systems includes complete DHT features [11] for decentralized applications. Some groups have proposed innovative approaches such
as CAN [11], Pastry [13] and Chord [17], which eliminate the defects of the
first P2P systems like Gnutella6 and Napster7 . Although they are implemented
in different ways, all these systems have interfaces to support access to DHT.
These interfaces permit to request shared information. In contrast to UDDI,
the P2P network content is usually described and indexed locally within each
peer, while search queries are propagated through the network. A central index
that spans the whole network is not required. Given a key, the corresponding
data items can be efficiently located using up to O(log(n)) network messages,
where n is the total number of nodes in the system [17]. In addition, distributed
systems evolve while remaining scalable to a large number of nodes. Current
efforts are directed towards this functionality in order to provide a catalog of
services fully distributed and scalable. The approach chosen for the purposes of
this paper is Chord because it proposes an original approach to the problem of
efficient location and is able to maintain the bandwidth close to the optimal in
the management of arrival and departure of competing nodes [8].
Chord uses routed queries to locate a key, minimizing the number of visits on
large amount of nodes. What distinguishes Chord from other P2P applications
is the ease of use and provable performance and correctness. In essence, Chord
supports one operation: given a key, it is mapped on a node. The location data
can be implemented by associating each key with a datum. In detail, it routes
a key through a sequence of O(log(n)) other nodes toward the destination. This
requires that a Chord node has information about O(log(n)) other nodes for
efficient routing. When the information is out of date, it is proved that the
performance degrades gracefully. This is important in practice because nodes will
6
7
Gnutella Web site - http://rfc-gnutella.sourceforge.net/
Napster Web site- http://free.napster.com/
56
join and leave arbitrarily, and consistency about O(log(n)) nodes may be hard
to maintain. Only one piece of information per node needs to be correct in order
to guarantee correct routing of queries. The Chord protocol uses the SHA − 1
hash functions to assign a m-bit identifier to each node and key. Furthermore, it
uses Consistent Hash functions that allow nodes to leave and enter the network
with minimal disruption [6]. The integer m is chosen to be large enough to make
negligible the probability that two nodes (keys) received the same identifier.
The hash function calculates the key identifier performing hashing over the IP
address of the node. The key identifiers of the nodes are arranged in a circular
ring of dimension 2m called Chord ring. The identifiers on the Chord ring are
numbered from 0 to 2m − 1. A key is assigned to a node whose identifier is
equal to or greater than the key identifier. This node is called the successor of
the node k and is the first node k on the circle in a clockwise direction. When
a node n wants to find a certain key k, it uses a lookup function that returns
the successor of n if k is between n and its successor, otherwise forwards the
query in the circle. Furthermore, in order to provide more efficient lookup, parts
of the routing information are stored in the nodes. In particular, each node n
maintains a routing table with at most m entries (where m is the number of
bits in the identifiers), called finger table. Stabilization primitives are used to
maintain updated finger tables, as well as the Chord ring itself.
This structure can be used to create Web Service Catalogs. For example, in
[19] each node in the system is a service provider or requester so that both these
two actors are connected together in the Chord ring. When a service provider
Ni wants to publish a service, creates the service catalog item, i.e., the tuple
C = key, Summary. The Chord protocol routes the catalog information to the
corresponding node of the system in accordance with the key in the catalog.
Thus, each node in the system contains part of information of catalog and all
the nodes together constitute the global catalog system implementing the functionality of a traditional UDDI registry. With WSDL, a Web Service can be
expressed as a set of operations, each of which implements a certain amount
of functions. An operation is specified by its name and the types of input and
output messages. The service name is used as the key catalog information for
the DHT hashing algorithm. In line with this, the operations included in the service and messages associated with these operations are used as a summary. The
catalog for a Web service W si has the structure: CW si = (Key, Summary, N ),
where:
– Key is the name of W si ,
– Summary contains the operations and its messages included in W si ,
– N is the node that publishes W si .
In the same paper [19] is proposed a mapping between the information contained
in the nodes (related to the WSDL) with ontology classes that represent them
in OWL-S services. In contrast, our approach foresees that the information contained in the catalogs are taken directly from the OWL-S descriptions already
available online.
57
2.3
Web Ontology Language for Services (OWL-S)
Semantic Web Services[9] provide an ontological framework for describing services, messages, and concepts in a machine-readable format, enabling logical
reasoning on service descriptions. The Web Ontology Language for Services
(OWL-S) provides a Semantic Web Services framework on which an abstract
description of a service can be formalised. It is an upper ontology described with
OWL8 whose root class is Service, therefore, every described service maps onto
an instance of this concept. The upper level Service class is associated with three
other classes:
Service Profile. The service profile specifies the functionality of a service.
This concept is the top-level starting point for the customizations of the
OWL-S model that supports the retrieval of suitable services based on their
semantic description. It describes the service by providing several types of
information:
• Human readable information: such as the service description, service
name, contact information, etc.;
• Functionalities: i.e. parameter type identifiers, identifiers for the input
and output, parameters of service methods, preconditions, results, etc.;
• Service parameters: which include parameter identifiers (e.g. name, value)
used by the service;
• Service categories: these include identifiers for defining the category of
service, i.e. category name, taxonomy, value, code;
Service Model. The service model exposes to clients how to use the service, by detailing the semantic content of requests, the conditions under
which particular outcomes will occur, and, where necessary, the step by step
processes leading to those outcomes. In other words, it describes how to
ask for (invoke) the service and what happens when the service is carried
out. From the point of view of the processes, the service model defines the
concept Process that describes the composition of one or more services in
terms of their constituent processes. A Process can be atomic, composite
or simple: an atomic process is a description of a non-decomposable service
that expects one message and returns one message in response. A composite
process consists of a set of processes within some control structures defining
a workflow. Whereas a simple process provides a service abstraction that
allows to view a composite service as an atomic one. Each process can have
any number of inputs, a set of preconditions, all of which must hold for
the process to be successfully invoked, and any number of results (outputs
and/or effects) that come from a successful execution of the service.
Service Grounding. A grounding is a mapping from an abstract to a concrete specification of those service description elements that are required for
interacting with the service. In general, a grounding indicates a communication protocol, a message format and other service-specific details (e.g., port
8
OWL Web Ontology Language, W3C Recommendation 10 February 2004 - http:
//www.w3.org/TR/owl-features/
58
numbers, the serialization techniques of inputs and outputs, etc.). From the
point of view of processes, a service grounding enables the transformation
from inputs and outputs of an atomic process into a concrete atomic process
grounding constructs.
Fig. 1. Schema Mapping WSDL-OWL-S
As we can see from Figure 1, OWL-S grounding maps the semantic description of the service with the corresponding WSDL. This means that the information contained in the Summary of Catalog shown in the previous section can
be directly obtained from OWL-S. Since each OWL-S instance, as well as all its
constituent parts, has its own URI, such information is always available online.
3
OWL-S discovery with P2P
3.1
Open Chord
Open Chord9 is an open source implementation of Chord. Its architecture consists of three levels (Figure 2). On the lower level is located the implementation of
the communication protocol used (Communication Layer), based on a network
protocol (such as Java Socket). Currently, two implementations are provided:
Local communication protocol, that has been developed for testing purposes,
and Socket-based protocol, that facilitates reliable communication between Open
Chord peers based on TCP/IP sockets.
9
Open Chord Web site - http://open-chord.sourceforge.net/
59
Fig. 2. Open CHORD architecture
The abstraction level (Communication abstraction layer ) provides two abstract classes that must be implemented to realize the communication protocol:
– Proxy, that represents a reference to remote peers in the Open Chord overlay
network.
– Endpoint, that provides a connection point for remote peers conform to a
specific communication protocol.
Concrete implementations for a communication protocol are determined with
help of the URL of a peer.
The Chord logic level, which implements the functionality of Chord, offers
two interfaces for Java applications that abstract from the implementation of
the Chord DHT routing. Both interfaces (i.e., Chord and AsynChord ), which
can be used by an application built on-top of Open Chord to retrieve, remove,
and store data in synchronous and asynchronous way from/to the underlying
DHT, provide some common methods that are important to create, join, and
leave an Open Chord DHT. The Chord logic level is also responsible for data
replication and maintenance of the necessary properties to keep running the
DHT, as described in [17].
3.2
A prototype implementation
To simulate an OWL-S P2P network using the Open Chord API a simple graphical application that displays a drop down menu consisting of the items File, Edit,
and View has been developed. By selecting ’Create Network’ from ’File’ menu a
60
Fig. 3. Screenshot of the prototype
single peer will be created, consisting simply of a new URL. Only the first node
has the ability to create a new network, to add other nodes the join method of
the interface Chord will be invoked. This method works similarly to the method
used to create the network, but in addition to the node that is to be added,
an existing URL, that is already part of the network, is required. This is called
bootstrap peer. To test the operations of the P2P network we have taken a set
of services from the dataset OWLS-TC10 , placed them in a local folder and inserted them in the network using the function ’Create nodes with’ from the ’File’
menu that automatically creates a number of nodes equal to the number of files
in the local folder. Subsequently, by selecting ’Insert’ from ’Edit’ menu, a dialog
appears that allows to insert new nodes in the network individually, specifying
the URL of the bootstrap node and the URL of the new node which must be
different from those used for the other nodes in the network. The next step is
the population of the network. To work with DHT the choice of the key is a
fundamental step. Our key was the output value of the OWL-S profile of the
selected services. The output value is extracted parsing the service profile available online. The value we have associated with the key is the URI of the OWL-S
service itself. Selecting ’Populate the Network’ from ’File’ menu this procedure
is executed in automatic way for all services contained in the local folder.
The synchronous retrieval of the value associated with the key is carried out
invoking the Open Chord method retrieve(Key). The result is an array of strings
containing the URL of zero or more OWL-S services, depending on whether the
searched key is the output of one or more services inserted in the network. Figure
10
OWLS-TC service retrieval test collection - http://projects.semwebcentral.org/
projects/owls-tc
61
Fig. 4. An example of results setting BOOK as key
4 shows the results obtained for the key BOOK using the ’Search. . . ’ pop up
opening from ’Modify’ menu.
By applying methods that use lexical ontologies (e.g., WordNet11 ) you can
obtain the synonyms of the key. Invoking the method retrieve(Key) on synonyms, those services that do not have as output the initial key are found,
providing a solution to the problem of exact matching between the searched key
and the output of the service.
4
Discussion
The use of P2P networks for SWS discovery opens the way for the application
of intelligent methods to satisfy the requests of Web users. The availability of
semantic descriptions of services allows the realization of new scenarios in which
the weight of the reasoning for the attainment of a goal moves increasingly towards software systems. In the scenario shown in Figure 5, a user queries a
software agent, capable of interpreting natural language, asking him to find a
service that returns a certain result (Goal). The agent uses the P2P network for
the discovery of services that may be suitable to meet the demand. Since the
P2P network returns the semantic descriptions of services, the agent can apply
automated reasoning methods to select and compose the most appropriate services with respect to the available user inputs. If the services are described by
using different ontologies, it will use semantic alignment tools12 and approaches
in the literature [4, 1] during the execution of these operations. This scenario
11
12
WordNet, a lexical database for English - http://wordnet.princeton.edu/
wordnet/
Alignment API and Alignment Server - http://alignapi.gforge.inria.fr/
62
Fig. 5. Scenario
describes the automatic orchestration of SWS and is particularly suitable for
use with OWL-S [12]. Looking in particular to the discovery use case, there are
various matchmaking techniques that exploit the OWL-S description to determine which services are best suited to fulfil the request. Srinivasan et al. [16]
propose an enhancement of CODE OWL-S IDE [15] where the used matching
procedure is based on the algorithm described in [10]. It defines a flexible matching mechanism based on subsumption in Description Logics. More sophisticated
solutions are provided by OWLS-MX [7], an hybrid Semantic Web Service matchmaker that retrieves services for a given query written in OWL-S itself. In other
words for every OWL-S service representing the description of the desired service (query), it returns an ordered set of relevant services ranked according to
their degree of (syntactic and/or semantic) matching with the query. This approach complements logic based reasoning with approximate matching relying
on Information Retrieval metrics.
Figure 6 illustrates the graphical user interface (GUI) of a software component that we have developed as a support for the test of matchmaking on
retrieved services. At the top there are two text fields designed to insert user
inputs and the searched output of the service, respectively. By pressing the Ok
button the request is processed. The results will vary depending on the chosen sort order (syntactic, semantic or hybrid) that reflect those available with
OWLS-MX API13 . Finally, clicking on a listed service the following information
13
OWLS-MX Semantic Web Service Matchmaker API - http://www.semwebcentral.
org/projects/owls-mx/
63
Fig. 6. Service discovery GUI
will be displayed: name, URI, inputs and outputs. The work presented in this
paper represents only a starting point towards a SWS discovery system based
solely on semantic descriptions of services. Future work includes extensive use of
the annotations included in the OWL-S profile for the selection of services that
best meet the user needs. For this purpose, the domain ontologies matchmaking
methods will be combined with lexical ontology based approaches in order to
analyze the lexical text descriptions of the service during the discovery process.
References
[1] David, J., Euzenat, J., Scharffe, F., dos Santos, C.T.: The alignment api 4.0.
Semantic Web 2(1), 3–10 (2011)
[2] Doyle, J.F.: Peer-to-peer: harnessing the power of disruptive technologies. Ubiquity 2001 (May 2001)
[3] Erl, T.: Service-Oriented Architecture: Concepts, Technology, and Design. Prentice Hall PTR, Upper Saddle River, NJ, USA (2005)
[4] Euzenat, J., Shvaiko, P.: Ontology matching. Springer-Verlag, Heidelberg (DE)
(2007)
[5] Gharzouli, M., Boufaida, M.: Pm4sws: A p2p model for semantic web services
discovery and composition. Journal of Advances in Information Technology 2(1)
(2011)
[6] Karger, D.R., Lehman, E., Leighton, F.T., Panigrahy, R., Levine, M.S., Lewin, D.:
Consistent hashing and random trees: Distributed caching protocols for relieving
hot spots on the world wide web. In: STOC. pp. 654–663 (1997)
[7] Klusch, M., Fries, B., Sycara, K.: Automated semantic web service discovery with
OWLS-MX. In: AAMAS ’06: Proceedings of the fifth international joint conference
64
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
on Autonomous agents and multiagent systems. pp. 915–922. ACM Press, New
York, NY, USA (2006)
Liben-Nowell, D., Balakrishnan, H., Karger, D.R.: Observations on the dynamic
evolution of peer-to-peer networks. In: Druschel, P., Kaashoek, M.F., Rowstron,
A.I.T. (eds.) IPTPS. Lecture Notes in Computer Science, vol. 2429, pp. 22–33.
Springer (2002)
McIlraith, S.A., Son, T.C., Zeng, H.: Semantic Web Services. IEEE Intelligent
Systems 16(2), 46–53 (2001)
Paolucci, M., Kawamura, T., Payne, T.R., Sycara, K.P.: Semantic Matching of
Web Services Capabilities. In: ISWC ’02: Proceedings of the First International
Semantic Web Conference on The Semantic Web. pp. 333–347. Springer-Verlag,
London, UK (2002)
Ratnasamy, S., Francis, P., Handley, M., Karp, R.M., Shenker, S.: A scalable
content-addressable network. In: SIGCOMM. pp. 161–172 (2001)
Redavid, D., Esposito, F., Iannone, L.: A comparative study on semantic web services frameworks from the dynamic orchestration perspective. In: In Proceedings
of International Conference on Knowledge Engineering and Ontology Development (KEOD). pp. 355–359 (October 2010)
Rowstron, A.I.T., Druschel, P.: Pastry: Scalable, decentralized object location,
and routing for large-scale peer-to-peer systems. In: Guerraoui, R. (ed.) Middleware. Lecture Notes in Computer Science, vol. 2218, pp. 329–350. Springer (2001)
Schmidt, C., Parashar, M.: Flexible information discovery in decentralized distributed systems. In: HPDC. pp. 226–235. IEEE Computer Society (2003)
Srinivasan, N., Paolucci, M., Sycara, K.: CODE: A Development Environment for
OWL-S Web services. Tech. Rep. CMU-RI-TR-05-48, Robotics Institute, Carnegie
Mellon University, Pittsburgh, PA (October 2005)
Srinivasan, N., Paolucci, M., Sycara, K.: Semantic Web Service Discovery in the
OWL-S IDE. In: HICSS ’06: Proceedings of the 39th Annual Hawaii International
Conference on System Sciences. p. 109.2. IEEE Computer Society, Washington,
DC, USA (2006)
Stoica, I., Morris, R., Liben-Nowell, D., Karger, D.R., Kaashoek, M.F., Dabek,
F., Balakrishnan, H.: Chord: a scalable peer-to-peer lookup protocol for internet
applications. IEEE/ACM Trans. Netw. 11(1), 17–32 (2003)
Xu, B., Chen, D.: Semantic web services discovery in p2p environment. In: ICPP
Workshops. p. 60. IEEE Computer Society (2007)
Yu, S., Zhu, Q., Xia, X., Le, J.: A novel web service catalog system supporting distributed service publication and discovery. In: Ni, J., Dongarra, J. (eds.) IMSCCS
(1). pp. 595–602. IEEE Computer Society (2006)
65
Marine Traffic Engineering through Relational
Data Mining
Antonio Bruno1 and Annalisa Appice1,2
1
Dipartimento di Informatica, Università degli Studi di Bari Aldo Moro
via Orabona, 4 - 70126 Bari - Italy
2
CILA (Centro Interdipartimentale per la ricerca in Logica e Applicazioni)
[email protected],[email protected]
Abstract. The automatic discovery of maritime traffic models can achieve
useful information for the identification, tracking and monitoring of vessels. Frequent patterns represent a means to build human understandable
representations of the maritime traffic models. This paper describes the
application of a multi-relational method of frequent pattern discovery
into the marine traffic investigation. Multi-relational data mining is here
demanded for the variety of the data and the multiplicity of the vessel positions (latitude-longitude) continuously transmitted by the AIS
(Automatic Identification System) installed on shipboard. This variety
of information leads to a relational (or complex) representation of the
vessels which by the way permits to naturally model the total temporal
order over consecutive AIS transmissions of a vessel. The viability of relational frequent patterns as a model of the maritime traffic is assessed
on navigation data truly collected in the gulf of Taranto.
1
Introduction
Marine traffic engineering is a research field originally defined in 1970s [10] at the
aim of investigating the marine traffic data and building a human interpretable
model of the maritime traffic. Through the understanding of this model, the
Vessel Traffic Service (VTS) would improve the port and fairway facilities as
well as the traffic regulation. Intuitively, the complexity of building a significant
maritime traffic model resides in the requirement of a model able to reflect the
spatial distribution and the temporal characteristics of the traffic flow.
Although, the marine traffic engineering was a popular research field between
the 1970s and the 1980s, after the 1990s, the relevant literature and research
projects in this field appeared less frequently. This little interest to research in
marine traffic engineering was caused to the actual difficulty in collecting traffic
data. In fact, the required observation time was long and several technological
limitations raised in the observation time. Today, the data collection problem
is definitely overcome. The widespread use of Automatic Identification System
(AIS) has had a significant impact on the maritime technology and any VTS
is now fit to obtain a large volume of traffic information which comprises the
timestamped latitude and longitude of the monitored vessels. On the other hand,
66
the galloping developments in data mining research have paved the way to face
the problem of automatically analyzing this large volume of traffic data, by the
now available, in order to extract the knowledge required to feed the service
marine traffic management and the VTS decision making systems. Both these
factors, traffic data availability and data mining techniques, have boosted the
recent renewed scientific interest towards the marine traffic engineering. Clustering [9], classification [5] and association rule discovery [11] techniques have
been employed to analyze AIS data and discover characteristics and/or rules
for the marine traffic flow forecast and the development and programming of
marine traffic engineering. Although, these studies have proved that data mining techniques are able to provide the extra aid for the situational awareness
in maritime traffic control, it is a fact that no marine traffic model described
in these works is able to capture the truly temporal characteristics of each AIS
transmission. In fact, AIS transmissions are timestamped, but a traditional data
mining technique looses the time label of the AIS data, and represents a navigation trajectory as a set, rather than a sequence, of consecutive latitude-longitude
vessel positions.
In this paper, we resort to multi-relational data mining to address the task
of learning a human interpretable model of the maritime traffic in the sea ports,
where several vessels are entering and leaving the port. The innovative contribution of this work is that, at the best of our knowledge, this is the first study in
maritime traffic engineering which correctly spans the traffic data over several
data tables (or relations) of a relational database and discover relational patterns
(i.e. patterns which may involve several relations at once) to describe the traffic
maritime model. In this multi-relational representation, we are able to model
vessel data and AIS data as distinct relational data tables (one for each data
type). This leads to distinguish between the reference objects of analysis (vessel
data) and the task-relevant objects (AIS data), and to represent their interactions. The modeled interactions also include the total temporal order over the
AIS transmissions for the same vessel. SPADA [6] is a multi-relational data mining method that discovers relational patterns and association rules. Relational
patterns extracted by SPADA have been proved to be effective for the capture
of the behavioral model underlying census data [1] and workflow data [12]. In
the case of traffic data, we use SPADA to discover interesting associations between a vessel (reference objects) and a navigation trajectory. Each navigation
trajectory represents a spatio-temporal pattern obtained by tracing subsequent
AIS transmissions (task-relevant objects) of vessels. This kind of spatio-temporal
rules automatically identify the well traveled navigation courses. This information can be employed in several ways. To opportunely arrange the navigation
traffic incoming a gulf in order to avoid collision or traffic jams. To discover vessels which suspiciously deviates from the planned navigation course. The main
limitation of SPADA in this application is the high computational complexity
which makes the analysis of large databases practically unfeasible. To overcome
this limitation, we run SPADA by considering the distributed version of SPADA
described in [2].
67
In order to prove the viability of the multi-relational approach in marine
traffic engineering, we describe a relational representation of the traffic data
derived from monitoring vessels entered and left the gulf of Taranto (South of
Italy) between September 1, 2010 (00:04:23) and October 9, 2010 (23:58:52)
(Section 2) and we briefly illustrate the multi-relational method for relational
pattern discovery (Section 3). Finally, we comment significance of navigation
traffic model we have extracted and its viability in the marine traffic engineering
(Section 4). Finally, some conclusions are drawn.
2
Marine Traffic Data
For this study, we consider the navigation traffic data collected for 106 vessels
entering and/or leaving the gulf of Taranto between September 1, 2010 (00:04:23)
and October 9, 2010 (23:58:52). The traffic data are obtained from [13]. As in
[11], the area of the gulf is converted into a geographic grid of 0.005◦ × 0.005◦
squared cells. Each cell of the grid is then enumerated by a progressive number.
For each vessel, the following data are collected:
– the name of the vessel,
– the MMSI, that is, a numeric code that unambiguously identifies the vessel,
– the vessel category, that is, wing, pleasure craft, tug, low enforcement, cargo,
tanker or other, and
– the sequence of AIS messages which were sent by the transceiver installed
on shipboard.
The AIS transceiver sends dynamic messages every two to thirty seconds
depending on the vessel speed, and every three minutes while the vessel is at
the anchor. As we are interested in describing the observable change of the
vessel position within the geographic grid, we decide to consider only those AIS
transmissions which reflect a change of the cell occupied by the vessel. Each AIS
message includes the following data:
–
–
–
–
–
the
the
the
the
the
vessel MMSI;
received time (day-month-year hour-minutes-seconds);
latitude and longitude of the vessel;
course over ground;
vessel speed;
The latitude and longitude coordinates of each AIS transmission are transformed into the identifier of the cell containing the coordinates. By following
the suggestion reported in [11], the course over ground is discretized every 45◦
thus obtaining N, E, W, S, NE, NW, SE, SW, while the speed is discretized in
low, medium and high. After this transformation, properties of vessels (name
and category), data of the AIS transmission (cell, speed, direction) and interaction between vessel and transmitted AIS data are stored as ground atoms into
the extensional part of a deductive database. An example of data stored in the
68
database for the vessel named ALIDA S is reported below.
mmssi(247205900).
name(247205900, alida s).
category(247205900, cargo).
ais(247205900, 2010-10-07 19:51:30).
ais(247205900, 2010-10-07 20:45:26).
ais(247205900, 2010-10-07 21:50:19).
ais(247205900, 2010-10-07 21:55:23).
cell(247205900, 2010-10-07 19:51:30, 312).
cell(247205900, 2010-10-07 20:45:26, 313).
cell(247205900, 2010-10-07 21:50:19, 312).
cell(247205900, 2010-10-07 21:55:23, 311).
direction(247205900, 2010-10-07 19:51:30, northwest).
direction(247205900, 2010-10-07 20:45:26, northwest).
direction(247205900, 2010-10-07 21:50:19, northwest).
direction(247205900, 2010-10-07 21:55:23, northwest).
speed(247205900, 2010-10-07 19:51:30, medium).
speed(247205900, 2010-10-07 20:45:26, medium).
speed(247205900, 2010-10-07 21:50:19, low).
speed(247205900, 2010-10-07 21:55:23, low).
The key predicate mmsi() identifies the reference object (vessel) of the unit
of analysis. The property predicates name(), category(), position(), direction()
and speed () define the value (in bold) taken by an attribute of an object (reference object as for name() and category() or task relevant object as for position(),
direction() and speed ()). Finally the structural predicate ais() relates reference
objects (vessel) with task-relevant objects (AIS transmissions). This way, the
extensional part of deductive database for SPADA is fed with 19137 atoms partitioned between 106 units of analysis.
3
Maritime Traffic Model Discovery
Studies for association rule discovery in Multi-Relational Data Mining [6] are
rooted in the field of Inductive Logic Programming (ILP) [8]. In ILP both relational data and relational patterns are expressed in a first-order logic and the
logical notions of generality order and of the downward/upward refinement operator on the space of patterns are used to define both the search space and the
search strategy. In the specific case of SPADA, properties of both reference and
task relevant objects are represented as the extensional part DE of a deductive
database D [4], while the domain knowledge is represented as a normal logic
program which defines the intensional part DI of the deductive database D.
In the application of SPADA in the marine traffic engineering, the extensional database stores information on the traffic data (e.g., vessel and AIS data)
as reported in Section 2, while the intensional database includes the definition
69
of relations which are implicit in data, but useful for capturing the model underlying the data. In this study, the intensional part of database surely includes
some definition of a relation next (which makes explicit the temporal order over
the AIS transmissions that is implicit in the timestamp of each transmission).
A possible definition of the relation next is the following:
next(V, A1, A2) ←
ais(V, T1), ais(V, T2),
cell(V, T1, A1), cell(V, T2, A2),
A1= A2, T1<T2,
not(ais(V,T), T1<T, T<T2)
which defines the direct sequence relation between two consecutive AIS transmissions of the same vessel.
In SPADA, the set of ground atoms in DE is partitioned into a number
of non-intersecting subsets D[e] (unit of analysis) each of which includes facts
concerning the AIS transmissions involved in a specific vessel trip e. The partitioning of DE is coherent with the individual-centered representation of training
data [3], which has both theoretical (PAC-learnability) and computational advantages (smaller hypothesis space and more efficient search). The discovery
process is performed by resorting to the classical levelwise method described in
[7], with the variant that the syntactic ordering between patterns is based on
θ-subsumption. By SPADA, fragments of the traffic models underlying the navigations of the various traced vessels can be expressed in the form of relational
navigation rules in the form:
mmsi(V ) ⇒ µ(V )
[s, c],
where mmsi(V ) is the atom that identifies a vessel, while µ(V ) is a conjunction
of atoms which provides a description of a fragment of the navigation trajectory
traced for V . Each atom in µ(V ) describes either the next relation between AIS
transmissions or a property of the vessel (type or length) or a datum included in
the AIS transmission (id of the crossed geographical cell, navigation direction,
velocity). An example of discovered association rule is the following:
vessel(V)⇒
cell(V,T,123), next(V,123,124), next(V,124,125)
[s=63%, c=100%]
The support s estimates the probability p(vessel(V ) µ(V )) on D.
This
means that s% of the units of analysis D[e] are covered by vessel(V
)
µ(V ),
that is a substitution θ = {V ← e}·θ1 exists such that vessel(V ) µ(V )θ ⊆ D[e].
The confidence c estimates the probability p(µ(V )|vessel(V )).
Our proposal is to employ SPADA in order to process large traffic data volume and to collect the navigation rules discovered by SPADA in order to obtain
an interpretable description of the model underlying the maritime traffic. As the
70
navigation rules describe fragments of the trajectories frequently crossed by the
monitored vessels, they are then visualized in a GIS environment for the human
interpretation. At the aim of this study, we have further extended SPADA by integrating a rule post-processing module which filters out uninteresting rules and
ranks the output of the filtering phase on the basis of the rule significance. Then,
the top-k rules compose the maritime traffic model. Interesting rules correspond
to non-redundant rules. Formally, let R be the navigation rule set output by
SPADA. A rule r ∈ R is labeled as redundant in R iff there exists a rule r ∈ R
and the substitution θ such that rθ ⊂ r . For example, let us consider the set of
navigation rules which comprises:
r1: vessel(V)⇒ cell(V,T,123).
r2: vessel(V)⇒ cell(V,T,123), next(V,123,124).
r3: vessel(V)⇒ cell(V,T,123), next(V,123,124), next(V,124,125).
Both r1 and r2 are redundant in R due to the presence of r3.
Redundant rules are implicit in non-redundant rules (although, they may
have different support, they are always frequent rules), hence we can filtered out
the redundant navigation rules without loosing any knowledge in the maritime
traffic model which is built finally. Filtered rules are ranked on the basis of significance expressed by pattern length (number of atoms in the rule and support
value). By decreasing k, we prune less significant knowledge in the model.
4
Maritime Traffic Models
A relational model of the maritime traffic in the gulf of Taranto (South of Italy)
was extracted by considering two experimental settings, denoted as S1 and S2.
In the former setting (S1), the intensional part is populated with the definition
of the ternary “next” predicate formulated as in Section 2. In the latter setting
(S2), the intensional part is populated with an intensional definition of both a
new “cell” predicate and a “next” predicate which incorporate the information
on the speed and direction of navigation as follows:
cell(V, T, C, S, D) ←
cell(V, T, C), speed(V, T, S), direction(V, T, D).
next(V, A1, A2, S, D) ←
ais(V, T1), ais(V, T2),
cell(V, T1, A1), cell(V, T2, A2),
speed(V, T2, S), direction(V, T2, S)
A1= A2, T1<T2,
not(ais(V, T, A), T1<T, T<T2).
In both settings, SPADA is run to discover relational rules with 0.1 as minimum support and 3 as minimum pattern length. In the first setting, SPADA
71
outputs the geometrical description fragments of navigation trajectories incoming and leaving the gulf of Taranto. The number of discovered rules is 126. After
filtering out redundant rules, 41 rules are ranked according to the significance
criterion. The top ranked navigation rule is reported below:
vessel(V)⇒
category(V,cargo), cell(V, T, 903),
next(V, 903, 904), next(V, 904, 944),
next(V, 944, 945), next(V, 945, 946), next(V, 946, 986),
next(V, 986, 987).
[s=10.3%, c=100%]
This rule states that 10.3% vessels monitored in the gulf of Taranto in the
period under study are cargo vessels which follow a navigation trajectory crossing
across the cells identified by 903, 904, 944, 945, 946, 986 and 987 in this order.
The maritime traffic model obtained by selecting the top-5 navigation rules is
plotted in Figure 1. By visualizing this model we are able to see the geometrical
representation of maritime trajectories which may be busy in the gulf of Taranto.
This information may be employed from the service maritime management in
order to opportunely program the maritime traffic in the gulf of Taranto and
avoid gridlock or vessel accident.
In the second setting, SPADA discovers a more detailed description of the
navigation trajectories frequently crossed in the gulf. In fact, the description
mined for each navigation trajectory now comprises both direction and velocity
of the vessel at each crossed cell in the trajectory. With this setting, SPADA
discovers 11 navigation rules. After filtering out redundant rules, 8 rules are
ranked according to the significance criterion. The top ranked navigation rule is
reported below:
vessel(V)⇒
category(V,cargo), cell(V, T, 945, low, north east),
next(V, 945, 946, low, north east), next(V, 946, 986, low, north east).
[s=11.3%, c=100%]
This rule states that 11.3% of vessels in this study move across the cells
945, 946 and 986 maintaining a low velocity and north-east navigatition dierection. Although this navigation rule describes a shorter trajectory than the top
ranked rule of the first setting, it provides a deeper insight in the navigation
behaviour (velocity and direction) of vessels crossing these cells, which where
ignored before.
5
Conclusions
In this paper, we presented a preliminary study of the application of relational
data mining to the marine traffic engineering. Relational data mining is here
demanded to represent multiplicity and variety of data continuously transmitting
72
(a) Visualization.
(b) Ranking.
Fig. 1. The top-5 relational models of the incoming and outcoming navigation trajectories frequently crossed in the gulf of Taranto.
from a vessel during the navigation time. In particular, we prove the viability
of a multi-relational approach to obtain human interpretable maritime models
of the maritime traffic by considering the AIS data transmitted from vessels in
the gulf of Taranto. The results are encouraging and open appealing and novel
directions of research in the field of the marine traffic engineering.
As future work, we plan to explore the task of discovering relational rules
which include a disjunction of atoms in the rule body in order to describe those
trajectories which include one or more ramification in the path. Additionally, we
intend to use the discovered navigation trajectories to obtain a prediction model
that permits to predict the position of a vessel at any future time. This task
requires the consideration of either geographical constraints such as the presence
of the mainland (or in general physical obstacles) or navigation constraints such
as velocity, direction, timetable and so on.
Acknowledgment
This work is partial fulfillment of the research objective of ATENEO-2010 project
entitled “Modelli e Metodi Computazionali per la Scoperta di Conoscenza in Dati
Spazio-Temporali”.
References
1. A. Appice, M. Ceci, A. Lanza, F. A. Lisi, and D. Malerba. Discovery of spatial association rules in geo-referenced census data: A relational mining approach.
Intelligent Data Analysis, 7(6):541–566, 2003.
2. A. Appice, M. Ceci, A. Turi, and D. Malerba. A parallel, distributed algorithm for
relational frequent pattern discovery from very large data sets. Intell. Data Anal.,
15(1):69–88, 2011.
73
3. H. Blockeel and M. Sebag. Scalability and efficiency in multi-relational data mining.
SIGKDD Explorations, 5(1):17–30, 2003.
4. S. Ceri, G. Gottlob, and L. Tanca. Logic Programming and Databases. SpringerVerlag New York, Inc., New York, NY, USA, 1990.
5. R. Lagerweij. Learning a Model of Ship Movements. Thesis for Bachelor of Science
- Artificial Intelligence, University of Amsterdam, 2009.
6. F. A. Lisi and D. Malerba. Inducing multi-level association rules from multiple
relations. Machine Learning, 55(2):175–210, 2004.
7. H. Mannila and H. Toivonen. Levelwise search and borders of theories in knowledge
discovery. Data Mining and Knowledge Discovery, 1(3):241–258, 1997.
8. S. Muggleton. Inductive Logic Programming. Academic Press, London, 1992.
9. C. Tang and Z. Shao. Modelling urban land use change using geographically
weighted regression and the implications for sustainable environmental planning. In
Q. Peng, K. C. P. Wang, Y. Qiu, Y. Pu, X. Luo, and B. Shuai, editors, Proceedings
of the 2nd International Conference on Transportation Engineering, pages 4465–
4470. ASCE, American Society of Civil Engineering, 2009.
10. S. Toyoda and Y. Fujii. Marine traffic engineering. The Journal of Navigation,
24:24–34, 1971.
11. M.-C. Tsou. Discovering knowledge from ais database for application in vts. The
Journal of Navigation, 63:449–469, 2010.
12. A. Turi, A. Appice, M. Ceci, and D. Malerba. A grid-based multi-relational approach to process mining. In S. S. Bhowmick, J. Küng, and R. Wagner, editors,
Proceedings of the 19th International Conference on Database and Expert Systems
Applications, DEXA 2008, volume 5181 of Lecture Notes in Computer Science,
pages 701–709. Springer, 2008.
13. web url:. http://www.marinetraffic.com/ais.
74
Author Index
Annalisa Appice, 66
Saima Jabeen, 14
Elena Baralis, 14
Elena Bellodi, 26
Antonio Bruno, 66
Luca Cagliero, 14
Gianni Costa, 38
Fabio Leuzzi, 2
Giuseppe Manco, 38, 46
Elio Masciari, 46
Riccardo Ortale, 38
Sašo Džeroski, 1
Floriana Esposito, 54
Stefano Ferilli, 2, 54
Alessandro Fiori,14
Domenico Redavid, 54
Fabrizio Riguzzi, 26
Ettore Ritacco, 38
Fulvio Rotella, 2