Download Proceedings of The Workshop on Mining Complex Patterns

Proceedings of The Workshop on Mining Complex Patterns Editors Annalisa Appice (University of Bari, Italy) Michelangelo Ceci (University of Bari, Italy) Corrado Loglisci (University of Bari, Italy) Giuseppe Manco (ICAR-CNR, Italy) Preface The International Workshop on Mining Complex Patterns (MCP 2011) was held in Mondello (Palermo), Italy, on September 17th 2011 in conjunction with AI*IA 2011: the 12th International Conference of Italian Association for Artificial Intelligence (AI*IA 2011). During the last two decades, studies in Machine Learning have paved the way to the definition of efficient and stable data mining and knowledge discovery algorithms. Data mining and knowledge discovery can today be considered as stable fields with numerous efficient algorithms and studies that have been proposed to extract knowledge in different forms from data. Although, most existing data mining approaches look for patterns in tabular data (which are typically obtained from relational databases), algorithmic extensions are recently investigated to massive datasets representing complex interactions between several entities from a variety of sources. These interactions may be spanned at multiple levels of granularity as well as at spatial and temporal dimensions. Our purpose in this workshop was to bring together researchers and practitioners of data mining interested in methods and applications where complex patterns in expressive languages are extracted from text/hypertext data, networks and graphs, event or log data, biological sequences, spatio-temporal data, sensor data and streams, and so on. Twelve contributions were originally submitted, of which seven were accepted for the oral presentation. Each submission was evaluated by three independent referees. Besides paper presentations, the scientific programme also featured an invited talk by Sašo Džeroski (Department of Knowledge Technologies, Jozef Stefan Institute, Ljubljana, Slovenia). We would like to thank the invited speaker, all the authors who submitted papers and all the workshop participants. We are also grateful to members of the program committee for their thorough work in reviewing submitted contributions with expertise and patience, and the members of AI*IA. Mondello, September 2011 Annalisa Appice Michelangelo Ceci Corrado Loglisci Giuseppe Manco Organization Workshop Chairs Annalisa Appice Università degli Studi di Bari Aldo Moro webpage: http://www.di.uniba.it/~appice/ mailto: [email protected] Michelangelo Ceci Università degli Studi di Bari Aldo Moro webpage: http://www.di.uniba.it/~ceci/ mailto: [email protected] Corrado Loglisci Università degli Studi di Bari Aldo Moro webpage: http://www.di.uniba.it/~loglisci/ mailto: [email protected] Giuseppe Manco Institute for High Performance Computing and Networks Italian National Research Council, Rende (CS) webpage: http://www.icar.cnr.it/manco mailto: [email protected] Program Committee Fabrizio Angiulli (Università della Calabria) Tania Cerquitelli (Politecnico di Torino) Saso Dzeroski (Jozef Stefan Institute) Nicola Fanizzi (Università degli Studi di Bari "Aldo Moro") Stefano Ferilli (Università degli Studi di Bari "Aldo Moro") Joao Gama (University of Porto) Elio Masciari (Institute for High Performance Computing and Networks Italian National Research Council, Rende (CS)) Rosa Meo (Università degli Studi di Torino) Andrea Passerini (Università di Trento) Zbigniew W. Ras (University of North Carolina and Warsaw University of Technology) Chiara Renso (KDD Lab, Pisa) Fabrizio Riguzzi (Università di Ferrara) Alessandro Sperduti (Università degli studi di Padova) Franco Turini (Università di Pisa) Alfonso Urso (Institute for High Performance Computing and Networks Italian National Research Council, Palermo) Table of Contents An Ontology of Data Mining ..................................................................................... Sašo Džeroski 1 Cooperating Techniques for Extracting Conceptual Taxonomies from Text............. Stefano Ferilli, Fabio Leuzzi and Fulvio Rotella 2 PatTexSum: A Pattern-based Text Summarizer......................................................... Elena Baralis, Luca Cagliero, Alessandro Fiori and Saima Jabeen 14 An Expectation Maximization Algorithm for Probabilistic Logic Programs............. Elena Bellodi and Fabrizio Riguzzi 26 Clustering XML Documents by Structure: a Hierarchical Approach......................... Gianni Costa, Giuseppe Manco, Riccardo Ortale, Ettore Ritacco 38 Outlier Detection For XML Documents..................................................................... Giuseppe Manco and Elio Masciari 46 P2P support for OWL-S discovery............................................................................. Domenico Redavid and Stefano Ferilli and Floriana Esposito 54 Marine Traffic Engineering through Relational Data Mining.................................... Antonio Bruno and Annalisa Appice 66 An Ontology of Data Mining Sašo Džeroski Jožef Stefan Institute, Department of Knowledge Technologies, Ljubljana, Slovenia [email protected] Abstract. We have developed OntoDM[2][1][3], an ontology of the scientific domain of data mining aimed at describing data mining investigations. It represents entities such as data, data mining tasks and algorithms, and generalizations (output by the latter). In contrast to other ontologies of data mining, OntoDM is a deep ontology, general purpose rather than tailor made, and compliant to best practices in ontology engineering. OntoDM allows us to cover a large part of the diversity in data mining research, including recently developed approaches to mining structured data and constraint-based data mining. The talk will describe the OntoDM ontology and how standard and more recent data mining approaches are represented within it. Two use cases will be described, one of which concerns QSAR modeling in drug design investigations. References 1. P. Panov, S. Džeroski, and L. Soldatova. Ontodm: An ontology of data mining. In Proceedings of the 2008 IEEE International Conference on Data Mining Workshops, pages 752–760, Washington, DC, USA, 2008. IEEE Computer Society. 2. P. Panov, L. Soldatova, and S. Džeroski. Representing entities in the ontodm data mining ontology. In S. Džeroski, B. Goethals, and P. Panov, editors, Inductive Databases and Constraint-Based Data Mining, pages 27–55. Springer, 2010. 3. P. Panov, L. N. Soldatova, and S. Džeroski. Towards an ontology of data mining investigations. In J. Gama, V. S. Costa, A. M. Jorge, and P. Brazdil, editors, Discovery Science, volume 5808 of Lecture Notes in Computer Science, pages 257– 271. Springer, 2009. 1 Cooperating Techniques for Extracting Conceptual Taxonomies from Text S. Ferilli1,2 , F. Leuzzi1 , and F. Rotella1 1 2 Dipartimento di Informatica – Università di Bari [email protected] {fabio.leuzzi, rotella.fulvio}@gmail.com Centro Interdipartimentale per la Logica e sue Applicazioni – Università di Bari Abstract. The current abundance of electronic documents requires automatic techniques that support the users in understanding their content and extracting useful information. To this aim, it is important to have conceptual taxonomies that express common sense and implicit relationships among concepts. This work proposes a mix of several techniques that are brought to cooperation for learning them automatically. Although the work is at a preliminary stage, interesting initial results suggest to go on extending and improving the approach. 1 Introduction The spread of electronic documents and document repositories has generated the need for automatic techniques to understand and handle the documents content, in order to help the user in satisfying his information needs without being overwhelmed by the huge amount of available data. Since most of these data are in textual form, and since text explicitly refers to concepts, most work has focussed on Natural Language Processing (NLP). Obtaining automatically Full Text Understanding is not trivial, due to the intrinsic ambiguity of natural language and to the huge amount of required common sense and linguistic/conceptual background knowledge. Nevertheless, even small portions of such a knowledge may significantly improve understanding performance, at least in limited domains. Although standard tools, techniques and representation formalisms are still missing, lexical and/or conceptual taxonomies can provide a useful support to many NLP tasks. Unfortunately, manually building this kind of resources is very costly and error-prone, which is a strong motivation towards automatic construction of conceptual networks by mining large amounts of documents in natural language. This work aims at partially simulating some human abilities in this field, such as extracting the concepts expressed in given texts and assessing their relevance; obtaining a practical description of the concepts underlying the terms, which in turn would allow to generalize concepts having similar descriptions; and applying some kind of reasoning ‘by association’, that looks for possible indirect connections between two identified concepts. The system will take in input texts in natural language, and process them to build a conceptual network that supports the above objectives. The resulting network can be visualized by means of a suitable interface and translated into a First-Order Logic 2 (FOL) formalism, to allow the subsequent exploitation of logic inference engines in applications that use that knowledge. Our proposal consists in a mix of existing tools and techniques, that are brought to cooperation in order to reach the above objectives, extended and supported by novel techniques when needed. The next section briefly recalls related work. Then, Section 3 describes the mixed approach, and discusses the novel parts in more detail. A preliminary evaluation of the proposal is reported in Section 4, while Section 5 concludes the paper and outlines future work issues. 2 Related Work Many works exist aimed at building taxonomies and ontologies from text. A few examples: [10, 9] build ontologies from natural language text by labelling the taxonomical relations only, while we label also non-taxonomical ones with actions (verbs); [14] builds a taxonomy considering only concepts that are present in a domain but do not appear in others, while we are interested in all recognized concepts independently of their being generic or domain-specific; [13] defines a language to build formal ontologies, while we are interested in the lexical level. As regards our proposal, a first functionality that we needed is syntactic analysis of the input text. We exploited the Stanford Parser and Stanford Dependencies [7, 1], two very effective tools that can identify the most likely syntactic structure of sentences (including active and passive forms), and label their components as ‘subject’ or ‘(direct/indirect) object’. Moreover, they normalize the words in the input text using lemmatization instead of stemming, which allows to distinguish their grammatical role and is more comfortable to read by humans. We also exploited the Weka project [5], that provides a set of tools to carry out several learning and Data Mining (DM) tasks, including clustering, classification, regression, discovery of association rules and visualization. Another technique that inspired our work is the one described in [8] to semiautomatically extract a domain-specific ontology from free text, without using external resources but focussing the analysis on Hub Words (i.e., words having high frequency). After building the ontology, the adaptation of a Hub Word t is ranked according to its ‘Hub Weight’: W (t) = αw0 + βn + γ n X w(ti ) i=1 where w0 is a given initial weight, n is the number of relationships in which t is involved, w(ti ) is the tf ∗idf weight of the i-th word related to t, and α+β+γ = 1. A task aimed at identifying most important words in a text, to be used as main concepts for inclusion in the taxonomy, is Keyword Extraction (KE). Among the several proposals available in the literature, we selected two techniques that can work on single documents (rather than requiring a whole corpus) and are based on different and complementary approaches, so that they can together provide an added value. The quantitative approach in [12] is based on 3 the assumption that the relevance of a term in a document is proportional to how frequently it co-occurs with a subset of most frequent terms in that document. The χ2 statistic is exploited to check whether the co-occurrences establish a significant deviation from chance. To improve orthogonality, the reference frequent terms are preliminarily grouped exploiting similarity-based clustering (using similar distribution of co-occurrence with other terms) and pairwise clustering (based on frequent co-occurrences). The qualitative approach in [3], based on WordNet [2] and its extension WordNet Domains [11], focusses on the meaning of terms instead of their frequency and determines keywords as terms associated to the concepts referring to the main subject domain discussed in the text. It exploits a density measure that determines how much a term is related to different concepts (in case of polysemy), how much a concept is associated to a given domain, and how relevant a domain is for a text. Lastly, we need in some steps of our technique to assess the similarity among concepts in a given conceptual taxonomy. A classical, general measure, is the Hamming distance [6], that works on pairs of equal-lenght vectorial descriptions and counts the minimum number of changes required to turn one into the other. Other measures, specific for conceptual taxonomies, are sf F a [4] (that adopts a global approach based on the whole set of hypernyms) and sf W P [16] (that focuses on a particular path between the nodes to be compared). 3 Proposed Approach In the following, we will assume that each term in the text corresponds to an underlying concept (phrases can be preliminarily extracted using suitable techniques, and handled as single terms). A concept is described by a set of characterizing attributes and/or by the concepts that interact with it in the world described by the corpus. The outcome is a graph, where nodes are the concepts recognized in the text, and edges represent the relationships among these nodes, expressed by verbs in the text (whose direction denotes their role in the relationship). This can be interpreted as a semantic network. 3.1 Identification of Relevant Concepts The input text is preliminarily processed by the Stanford Parser in order to extract the syntactic structure of the sentences that make it up. In particular, we are interested only in (active or passive) sentences of the form subject-verb(direct/indirect)complement, from which we extract the corresponding triples hsubject, verb, complement i that will provide the concepts (the subject s and complement s) and attributes (verbs) for the taxonomy. Indirect complements are treated as direct ones, by embedding the corresponding preposition into the verb: e.g., to put, to put on and to put across are considered as three different verbs, and sentence John puts on a hat returns the triple hJohn,put on,hati, in which John and hat are concepts associated to attribute put on, indicating that John can put on something, while a hat can be put on). Triples/sentences involving 4 verb ‘to be’ or nouns with adjectives provide immediate hints to build the subclass structure in the taxonomy: for instance, “The dog is a domestic animal...” yields the relationships is a(dog, animal) and is a(domestic animal,animal). The whole set of triples is represented in a Concepts×Attributes matrix V that recalls the classical Terms×Documents Vector Space Model (VSM) [15]. The matrix is filled according to the following scheme (resembling tf · idf ): fi,j |A| Vi,j = P · log |{j : ci ∈ aj }| k fk,j where: – fP i,j is the frequency of the i-th concept co-occurring with the j-th attribute; – k fk,j is the sum of the co-occurrences of all concepts with the j-th attribute; – A is the entire set of attributes; – |{j : ci ∈ aj }| is the number of attributes with which the concept ci cooccurrs (i.e., for which fi,j 6= 0). Its values represent the term frequency tf, as an indicator of the relevance of the term in the text at hand (no idf is considered, to allow the incremental addition of new texts without the need of recomputing this statistic). A clustering step (typical in Text Mining) can be performed on V to identify groups of elements having similar features (i.e., involved in the same verbal relationships). The underlying idea is that concepts belonging to the same cluster should share some semantics. For instance, if concepts dog, John, bear, meal, cow all share attributes eat, sleep, drink, run, they might be sufficiently close to each other to fall in the same cluster, indicating a possible underlying semantic (indeed, they are all animals). Since the number of clusters to be found is not known in advance, we exploit the EM clustering approach provided by Weka based on the Euclidean distance applied row vectors representing concepts in V. Then, the application on the input texts of various Keyword Extraction techniques, based on different (and complementary) aspects, perspectives and theoretical principles, allows to identify relevant concepts. We use the quantitative approach based on co-occurrences kc [12], the qualitative one based on WordNet kw [3] and a psychological one based on word positions kp . The psychological approach is novel, and is based on the consideration that humans tend to place relevant terms/concepts toward the start and end of sentences and discourses, where the attention of the reader/listener is higher. In our approach, the chance of a term being a keyword is assigned simply according to its position in the sentence/discourse, according to a mixture model determined by mixing two Gaussian curves whose peaks are placed around the extremes of the portion of text to be examined. The information about concepts and attributes is exploited to compute a Relevance Weight W (·) for each node in the network. Then, nodes are ranked by decreasing Relevance Weight, and a suitable cutpoint in the ranking is determined to distinguish relevant concepts from irrelevant ones. We cut the list at 5 the first item ck in the ranking such that: W (ck ) − W (ck+1 ) ≥ p · max (W (ci ) − W (ci+1 )) i=0,...,n−1 i.e., the difference in relevance weight from the next item is greater or equal than the maximum difference between all pairs of adjacent items, smoothed by a user-defined parameter p ∈ [0, 1]. Computation of Relevance Weight Identifying key concepts in a text is more complex than just identifying keywords. Inspired to the Hub Words approach, we compute for each extracted concept a Relevance Weight expressing its importance in the extracted network, by combining different values associated to different perspectives: given a node/concept c, P w(c) k(c) e(c) dM − d(c) (c,c) w(c) W (c) = α +ǫ +β +γ +δ maxc w(c) maxc e(c) e(c) dM maxc k(c) where α, β, γ, δ, ǫ are weights summing up to 1, and: – – – – – – w(c) is an initial weight assigned to node c; e(c) is the number of edges of any kind involving node c; (c, c) denotes an edge involving node c; dM is the largest distance between any two nodes in the whole vector space; d(c) is the distance of node c from the center of the corresponding cluster; k(c) is the keyword weight associated to node c. The first term represents the initial weight provided by V, normalized by the maximum initial weight among all nodes. The second term considers the number of connections (edges) of any category (verbal or taxonomic relationships) in which c is involved, normalized by the maximum number of connections of any node in the network. The third term (Neighborhood Weight Summary) considers the average initial weight of all neighbors of c (just summing up the weights the final value would be proportional to the number of neighbors, that is already considered in the previous term). The fourth term represents the Closeness to Center of the cluster, i.e. the distance of c from the center of its cluster, normalized by the maximum distance between any two instances in the whole vector space. The last term takes into account the outcome of the three KE techniques on the given text, suitably weighted: k(c) = ζkc (c) + ηkw (c) + θkp (c) where ζ, η and θ are weights ranging in [0, 1] and summing up to 1. These terms were designed to be independent of each other. A partial interaction is present only between the second and the third ones, but is significantly smoothed due to the applied normalizations. 6 3.2 Generalization of Similar Concepts To generalize two or more concepts (G generalizes A if anything that can be labeled as A can be labeled as G as well, but not vice-versa), we propose to exploit WordNet and use the set of connections of each concept with its direct neighbors as a description of the underlying concept. Three steps are involved in this procedure: 1. Grouping similar concepts, in which all concepts are grossly partitioned to obtain subsets of similar concepts; 2. Word Sense Disambiguation, that associates a single synset to each term by solving possible ambiguities using the domain of discourse (Algorithm 1); 3. Computation of taxonomic similarity, in which WordNet is exploited to confirm the validity of the groups found in step 1 (Algorithm 2). As to step 1, we build a Concepts×Concepts matrix C where Ci,j = 1 if there is at least a relationship between concepts i and j, or Ci,j = 0 otherwise. Each row in C can be interpreted as a description of the associated concept in terms of its relationships to other concepts, and exploited for applying a pairwise clustering procedure based on Hamming distance. In detail, for each possible pair of different row and column items whose corresponding row and column are not null and whose similarity passes a given threshold: if neither is in a cluster yet, a new cluster containing those objects is created; otherwise, if either item is already in a cluster, the other is added to the same cluster; otherwise (both already belong to different clusters) their clusters are merged. Items whose similarity with all other items does not pass the threshold result in singleton clusters. This clustering procedure alone might not be reliable, because terms that occur seldom in the corpus have few connections (which would affect their cluster assignment due to underspecification) and because the expressive power of this formalism is too low to represent complex contexts (which would affect even more important concepts). For this reason, the support of an external resource might be desirable. We consider WordNet as a sensible candidate for this, and try to map each concept in the network to the corresponding synset (a non trivial problem due to the typical polysemy of many words) using the one domain per discourse assumption as a simple criterion for Word Sense Disambiguation: the meanings of close words in a text tend to refer to the same domain, and such a domain is probably the dominant one among the words in that portion of text. Thus, WordNet allows to check and confirm/reject the similarity of concepts belonging to the same cluster, by considering all possible pairs of words whose similarity is above a given threshold. The pair (say {A, B}) with largest similarity value is generalized with their most specific common subsumer (hypernym) G in WordNet; then the other pairs in the same cluster that share at least one of the currently generalized terms, and whose least common hypernym is again G, are progressively added to the generalization. Similarity is determined using a mix of the measures proposed in [4] and in [16], to consider both the global similarity 7 Algorithm 1 Find “best synset” for a word Input: word t, list of domains with Output: best synset for word t. weights. best synset ← empty best domain ← empty for all synset(st ) do max weight ← −∞ optimal domain ← empty for all domains(ds ) do if weight(ds ) > max weight then max weight ← weight(ds ) optimal domain ← ds end if end for if max weight > weight(best domain) then best synset ← st best domain ← optimal domain end if end for and the actual viability of the specific candidate generalization: sf (A, B) = sfF a (A, B) · sfW P (A, B) 3.3 Reasoning ‘by association’ Reasoning ‘by association’ means finding a path of pairwise related concepts that establishes an indirect interaction between two concepts c′ and c′′ in the semantic network. We propose to look for such a path using a Breadth-First Search (BFS) technique, applied to both concepts under consideration.The expansion steps of the two processes are interleaved, checking at each step whether the new set of concepts just introduced has a non-empty intersection with the set of concepts of the other process. When this happens, all the concepts in such an intersection identify one or more shortest paths connecting c′ and c′′ , that can be retrieved by tracing back the parent nodes at each level in both directions up to the roots c′ and c′′ . Since this path is made up of concepts only, to obtain a more sensible ‘reasoning’ it must be filled with the specific kind of interaction represented by the labels of edges (verbs) that connect adjacent concepts in the chain. 4 Evaluation The proposed approach was evaluated using ad-hoc tests that may indicate its strengths and weaknesses. Due to lack of space, only a few selected outcomes will be reported here. Although preliminary, these results seem enough to suggest that the approach is promising. The following default weights for the Relevance Weight components were empirically adopted: – α = 0.1 to increase the impact of most frequent concept (according to tf ); 8 Algorithm 2 Effective generalization research. Input: the set of C clusters returned by pair-wise Output: set of candidate generalizations. clustering; T similarity threshold. generalizations ← empty set for all c ∈ C do good pairs ← empty set for all pair(Oi , Oj ) | i, j ∈ c do if similarity score(pair(Oi , Oj )) > T then good pairs.add(pair(Oi, Oj ), wordnet hypernym(pair(Oi, Oj ))) end if if good pairs 6= empty set then new set ← {good pairs.getBestP air, good pairs.getSimilarP airs} generalizations.add(new set) end if end for end for good pairs → all pairs that passed T , with the most specific common hypernym discovered in WordNet good pairs.getBestPair → the pair that has the best similarity score. good pairs.getSimilarPairs → the pairs that involve one of two objects of the best pair, that have satisfied the similarity score and have the same hypernym as the best pair wordnet hypernym → the most specific common hypernym discovered in WordNet for the two passed object. – β = 0.1 to keep low the impact of co-occurrences between nodes; – γ = 0.3 to increase the impact of less frequent nodes if they are linked to relevant nodes; – δ = 0.25 to increase the impact of the clustering outcome; – ǫ = 0.25 as for δ, to increase the impact of keywords. while those for the KE techniques were taken as ζ = 0.45, η = 0.45 and θ = 0.1 (to reduce the impact of the psychological perspective, that is more naive compared to the others). 4.1 Recognition of relevant concepts We exploited a dataset made up of documents concerning social networks of socio-political and economic subject. Table 1 shows on the top the settings used for three different runs, concerning the Relevance Weight components: W =A+B+C +D+E and the cutpoint value for selecting relevant concepts. The corresponding outcomes (at the bottom) show that the default set of parameter values yields 3 relevant concepts, having very close weights. Component D determines the inclusion of the very unfrequent concepts (see column A) access and subset (0.001 and 6.32 E-4, respectively) as relevant ones. They benefit from the large initial weight of network, to which they are connected. Using the second set of parameter values, the predominance of component A in the overall computation, 9 Table 1. Three parameter choices and corresponding outcome of relevant concepts. Test # α β γ δ ǫ p 1 0.10 0.10 0.30 0.25 0.25 1.0 2 0.20 0.15 0.15 0.25 0.25 0.7 3 0.15 0.25 0.30 0.15 0.15 1.0 Test # Concept # A B C D E network 0.100 0.100 0.021 0.178 0.250 1 access 0.001 0.001 0.154 0.239 0.250 subset 6.32E-4 0.001 0.150 0.239 0.250 2 network 0.200 0.150 0.0105 0.178 0.250 network 0.150 0.25 0.021 0.146 0.150 user 0.127 0.195 0.022 0.146 0.150 3 number 0.113 0.187 0.022 0.146 0.150 individual 0.103 0.174 0.020 0.146 0.150 W 0.649 0.646 0.641 0.789 0.717 0.641 0.619 0.594 Table 2. Pairwise clustering statistics. Dataset MNC (0.001) MNC (0.0001) Vector size B 3 2 1838 P 3 1 1599 B+P 5 1 3070 and the cutpoint threshold lowered to 70%, cause the frequency-based approach associated to the initial weight to give neat predominance to the first concept in the ranking. Using the third set of parameter values, the threshold is again 100% and the other weights are such that the frequency-based approach expressed by component A is balanced by the number of links affecting the node and by the weight of its neighbors. Thus, both nodes with highest frequency and nodes that are central in the network are considered relevant. Overall, concept network is always present, while the other concepts significantly vary depending on the parameter values. 4.2 Concept Generalization Two toy experiments are reported for concept generalization. The maximum threshold for the Hamming distance was set to 0.001 and 0.0001, respectively, while the minimum threshold of taxonomic similarity was fixed at 0.4 in both. Two datasets on Social networks were exploited: a book (B) and a collection of scientific papers (P ) concerning socio-political and economic discussions. Observing the outcome, three aspects can be emphasized: the level of detail of the concept descriptions that in pairwise clustering satisfy the criterion, the intuitivity of the generalizations supported by WordNet Domains, and the values of the single conceptual similarity measures applied to synsets in WordNet. In Table 2, MNC is the Max Number of Connections detected among all concept descriptions that have been agglomerated at least once in the pairwise clustering. Note that all descriptions which have never been agglomerated, are 10 Table 3. Generalizations for different pairwise clustering thresholds (Thr.) and minimum similarity threshold 0.4 (top), and corresponding conceptual similarity scores (bottom). Thr. Dataset Subsumer parent B [110399491] human action P 0.001 [100030358] Subs. Domain Concepts adopter [109772448] person dad [109988063] discussion [107138085] factotum judgement [100874067] psychiatrist [110488016] B + P dr. [110020890] medicine abortionist [109757175] specialist [110632576] physiological state dependence [114062725] B physiology [114034177] affliction [114213199] mental attitude marxism [106215618] 0.0001 P psychology [106193203] standpoint [106210363] feeling psychological dislike [107501545] B+P [100026192] features satisfaction [107531255] # Pairs F a score W P score 1 adopter, dad 0.733 0.857 2 discussion, judgement 0.731 0.769 psychiatrist, abortionist 0.739 0.889 3 psychiatrist, specialist 0.728 0.889 4 dependence, affliction 0.687 0.750 5 marxism, standpoint 0.661 0.625 6 dislike, satisfaction 0.678 0.714 Conc. Domain factotum person factotum law medicine medicine medicine physiology medicine politics factotum psychological features psychological features Score 0.628 0.562 0.657 0.647 0.516 0.413 0.485 considered as single instance in a separated cluster. Hence, the concepts recognized as similar have very few neighbors, suggesting that concepts become ungeneralizable as their number of connections grows. Although in general this is a limitation, such a cautious behavior is to be preferred until an effective generalization technique is provided, that ensures the quality of its outcomes (wrong generalizations might spoil subsequent results in cascade). It is worth emphasizing that not only sensible generalizations are returned, but their domain is also consistent with those of the generalized concepts. This happens with both thresholds (0.001 and 0.0001), that return respectively 23 and 30 candidate generalizations (due to space limitations, Table 3 reports only a representative sample, including a generalization for each dataset used). Analyzing the two conceptual similarity measures used for generalization reveals that, for almost all pairs, both yield very high values, leading to final scores that neatly pass the 0.4 threshold, and sf W P is always greater than sf F a . Since the former is more related to a specific path, and hence to the goodness of the chosen subsumer, this confirms the previous outcomes (suggesting that the chosen subsumer is close to the generalized concepts). In the sample reported in Table 3, only case 5 disagrees with these considerations. 4.3 Reasoning by association Table 4 shows a sample of outcomes of reasoning by association. E.g., case 5 explains the relationship between freedom and internet as follows: the adult 11 Table 4. Exampes of reasoning by associations (start and target nodes in emphasis). # 1 2 3 4 5 Subject Verb flexibility convert people settle, desire, do at, extinguish people use, revolution myspace develop people member people erode, combine computer acknowledge internet extend majority erode majority erode, do adult use facebook acknowledge adult write adult use technology acknowledge internet acknowledge Complement life life myspace headline threat technology technology neighbor internet facebook platform platform freedom platform platform technology write about freedom, and use platform, that is recognized as a technology, as well as internet. 5 Conclusions This work proposed an approach to automatic conceptual taxonomy extraction from natural language texts. It works by mixing different techniques in order to identify relevant terms/concepts in the text, group them by similarity and generalize them to identify portions of a hierarchy. Preliminary experiments show that the approach can be viable, although extensions and refinements are needed to improve its effectiveness. In particular, a study on how to set standard suitable weights for concept relevance assessment is needed. A reliable outcome might help users in understanding the text content and machines to automatically performing some kind of reasoning on the resulting taxonomy. References [1] Marie-Catherine de Marneffe, Bill MacCartney, and Christopher D. Manning. Generating typed dependency parses from phrase structure trees. In LREC, 2006. [2] Christiane Fellbaum, editor. WordNet: An Electronic Lexical Database. MIT Press, Cambridge, MA, 1998. [3] S. Ferilli, M. Biba, T.M. Basile, and F. Esposito. Combining qualitative and quantitative keyword extraction methods with document layout analysis. In Postproceedings of the 5th Italian Research Conference on Digital Library Management Systems (IRCDL-2009), pages 22–33, 2009. 12 [4] S. Ferilli, M. Biba, N. Di Mauro, T.M. Basile, and F. Esposito. Plugging taxonomic similarity in first-order logic horn clauses comparison. In Emergent Perspectives in Artificial Intelligence, Lecture Notes in Artificial Intelligence, pages 131–140. Springer, 2009. [5] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I.H. Witten. The weka data mining software: an update. SIGKDD Explorations, 11(1):10–18, 2009. [6] R.W. Hamming. Error detecting and error correcting codes. Bell System Technical Journal, 29(2):147–160, 1950. [7] Dan Klein and Christopher D. Manning. Fast exact inference with a factored model for natural language parsing. In Advances in Neural Information Processing Systems, volume 15. MIT Press, 2003. [8] Sang Ok Koo, Soo Yeon Lim, and Sang-Jo Lee. Constructing an ontology based on hub words. In ISMIS’03, pages 93–97, 2003. [9] A. Maedche and S. Staab. Mining ontologies from text. In EKAW, pages 189–202, 2000. [10] A. Maedche and S. Staab. The text-to-onto ontology learning environment. In ICCS-2000 - Eight International Conference on Conceptual Structures, Software Demonstration, 2000. [11] Bernardo Magnini and Gabriela Cavagli. Integrating subject field codes into wordnet. pages 1413–1418, 2000. [12] Yutaka Matsuo and Mitsuru Ishizuka. Keyword extraction from a single document using word co-occurrence statistical information. International Journal on Artificial Intelligence Tools, 13:2004, 2003. [13] N. Ogata. A formal ontology discovery from web documents. In Web Intelligence: Research and Development, First Asia-Pacific Conference (WI 2001), number 2198 in Lecture Notes on Artificial Intelligence, pages 514–519. Springer-Verlag, 2001. [14] Alessandro Cucchiarelli Paola Velardi, Roberto Navigli and Francesca Neri. Evaluation of OntoLearn, a methodology for automatic population of domain ontologies. In Paul Buitelaar, Philipp Cimiano, and Bernardo Magnini, editors, Ontology Learning from Text: Methods, Applications and Evaluation. IOS Press, 2006. [15] G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Commun. ACM, 18:613–620, November 1975. [16] Zhibiao Wu and Martha Palmer. Verbs semantics and lexical selection. In Proceedings of the 32nd annual meeting on Association for Computational Linguistics, pages 133–138, Morristown, NJ, USA, 1994. Association for Computational Linguistics. 13 PatTexSum : A pattern-based text summarizer Elena Baralis, Luca Cagliero, Alessandro Fiori, and Saima Jabeen elena.baralis,luca.cagliero,alessandro.fiori,[email protected] Affiliation: Politecnico di Torino. Corso Duca degli Abruzzi, 24 10129 Torino, Italy. Tel: +390110907194 Fax:+390110907099 Abstract. In the last decade the growth of the Internet has made a huge amount of textual documents available in the electronic form. Text summarization is commonly based on clustering or graph-based methods and usually considers the bag-of-word sentence representation. Frequent itemset mining is a widely exploratory technique to discover relevant correlations among data. The well-established application of frequent itemsets to large transactional datasets prompts their usage in the context of document summarization as well. This paper proposes a novel multi-document summarizer, namely PatTexSum (Pattern-based Text Summarizer), that is mainly based on a pattern-based model, i.e., a model composed of frequent itemsets. Unlike previously proposed approaches, PatTexSum selects most representative and not redundant sentences to include in the summary by considering both (i) the most informative and non-redundant itemsets extracted from document collections tailored to the transactional data format, and (ii) a sentence score, based on the tf-idf statistics. Experiments conducted on a collection of real news articles show the effectiveness of the proposed approach. 1 Introduction From the birth of the Internet on, analysts may progressively access and analyze larger data collections. Since the large majority of the information is available in textual form, a challenging task is to convey the most relevant information provided by textual documents into short and concise summaries. Many document summarization approaches have been proposed in literature. Most of them select the most representative sentences to include in the summaries by means of the following approaches: (i) clustering (e.g., [13, 20]), (ii) graph-based methods (e.g., [12]), and (iii) linear programming (e.g., [15]). Clustering-based approaches exploit clustering algorithms to group sentences and select representatives among each group. For instance, MEAD [13] evaluates the similarity between the document sentences and the centroids and selects, similarly to [6], the most relevant sentences among each document cluster based on the tf-idf statistical measure [16]. Differently, in [20] an incremental hierarchical clustering algorithm is exploited to update summaries over time. The graph-based approaches try to represent correlations among sentences by 14 means of a graph-based model. According to this model, sentences are represented by graph nodes, while the edges weigh the strength of the correlation between couples of sentences. The most representative sentences are selected according to graph-based indexing strategies. For instance, [12] proposes to rank sentences based on the eigenvector centrality computed by means of the wellknown PageRank algorithm [5]. Finally, the linear programming methods identify the most representative sentences by maximizing ad-hoc object functions. For instance, in [15] the authors formalized the extractive summarization task as a maximum coverage problem with the Knapsack constraints based on the the bag-of word sentence representation and enforce additional constraints based on sentence relevance within each document. Most the aforementioned approaches rely on the bag-of-word sentence representation and make use of well-founded statistical measures (e.g., the tf-idf measure [16]). Frequent itemset mining is a widely exploratory technique, first introduced in [1] in the context of market basket analysis, to discover correlations that frequently occur in the analyzed data. A number of approaches focus on discovering frequent itemsets from transactional data and then selecting their most informative yet non-redundant subset by means of postpruning. To address this issue, static approaches (e.g., [4, 8]) compare the observed frequency (i.e., the support) of each itemset in the source transactional data against some null hypotheses (i.e., their expected frequency). Differently, dynamic approaches (e.g., [9, 18]) make often use of the maximum entropy model to take previously selected patterns into account and, thus, reduce model redundancy. Although the discovery and selection of valuable frequent itemsets from transactional data is well-established, to the best of our knowledge their usage in document summarization has never been investigated yet. PatTexSum (Pattern-based Text Summarizer) is a novel multi-document summarization approach that exploits a pattern-based model to select the most representative and not redundant sentences belonging to the document collection. It focuses on combining the effectiveness of pattern-based models, composed of highly informative and non-redundant itemsets, to represent correlations among data with the discriminating power of a sentence evaluation measure, based the tf-idf statistics. Pattern-based model generation focuses on extracting and selecting valuable frequent itemsets from a transactional representation of the document collection. To this aim, an efficient and effective approach, recently proposed in [11] in the context of transactional data, is adopted. [11] succinctly summarizes transactional data by adopting an heuristics to solve the maximum entropy model that allows on-the-fly evaluating itemsets during their extraction. This feature makes this approach particularly appealing for its application in text summarization. To effectively discriminate among sentences, an evaluation score, computed from their bag-of-word representation and based on the well-founded tf-idf statistic [16], is also considered. PatTexSum combines the information discovered from both transactional and bag-of-word data representations and adopts an effective greedy approach, first proposed in [2], to solve the problem of selecting sentences that cover at best the pattern-based model. 15 To evaluate the PatTexSum performance a suite of experiments on a collection of news articles has been performed. Results, reported in Section 3, show that PatTexSum significantly outperforms mostly used previous summarizers in terms of precision, recall, and F-measure. This paper is organized as follows. Section 2 presents the proposed method and thoroughly describes its main steps. Section 3 assesses the effectiveness of the PatTexSum framework in summarizing textual documents, while Section 4 draws conclusions and presents future developments of this work. 2 The PatTexSum method PatTexSum focuses on summarizing collections of textual documents by exploiting a two-way data representation. Pattern-based model generation relies on a transactional representation of the document sentences, while the relevance score evaluation, based on the tf-idf statistic, is based on the bag-of-word sentence representation. A greedy approach is used to effectively combine knowledge discovered from both data representations and select most representative sentences to include in the summary. Figure 1 shows the main steps behind the proposed approach, which will be thoroughly described in the following. Fig. 1. The PatTexSum method 2.1 Document representation PatTexSum exploits two different document/sentence representations: (i) the traditional bag-of-word (BOW) representation and (ii) the transactional data 16 format. The raw document content is first preprocessed to make it suitable for the data mining and knowledge discovery process. Stopwords, numbers, and website URLs are removed to avoid noisy information, while the Wordnet stemming algorithm [3] is applied to reduce document words to their base or root form (i.e., the stem). Let D={d1 , . . . , dn } be a document collection, where each document dk is composed of a set sentences Sk ={s1k , . . . , szk }. Documents are composed of a sequence of sentences, each one composed of a set of words. The BOW representation of the j-th sentence sjk belonging to the k-th document dk of the collection D is the set of all word stems (i.e., terms) occurring in sjk . Consider now the set trjk ={w1 , . . . , wl } where trjk ⊆ sjk and wq 6= wr ∀ q 6= r. It includes the subset of distinct terms occurring in the sentence sjk . To tailor document sentences to the transactional data format, we consider each document sentence as a transaction whose items are distinct terms taken from its BOW representation, i.e., trjk is the transaction that corresponds to the document sentence sjk . A transactional representation T of the document collection D is the union of all transactions trjk corresponding to each sentence sjk belonging to any document dk ∈ D. The document collection is associated with the statistical measure of the term frequency-inverse document frequency (tf-idf) that evaluates the relevance of a word in the whole collection. A more detailed description of the tf-idf statistic follows. The whole document content could be represented in a matrix form T C, in which each row represents a distinct term of the document collection while each column corresponds to a document. Each element tcik of the matrix T C is the tf-idf value associated with a term wi in the document dk belonging to the whole collection D. It is computed as follows: tcik = P nik r∈{q : wq ∈dk } nrk · log |D| |{dk ∈ D : wi ∈ dk }| (1) where nik is the number of occurrences P of i-th term wi in the k-th document dk , D is the collection of documents, r∈{q : wq ∈dk } nrk is the sum of the number of occurrences of all terms in the k-th document dk , and log |{dk ∈D|D| : wi ∈dk }| represents the inverse document frequency of term wi . 2.2 The pattern-based model generation Frequent itemset mining is a well-established data mining approach that focuses on discovering recurrences, i.e., itemsets, that frequently occur in the source data. An itemset I of length k, i.e., a k-itemset, is a set of k distinct items. Let T be the document collection in the transactional data format. We denote as D(I) the set of transactions supported by I, i.e., D(I) = {trjk ∈ T | I ⊆ trjk }. The support of an itemset I is the observed frequency of occurrence of I in D, i.e., sup(I)= D(I) |T | . Since the problem of discovering all itemsets in a transactional dataset is computationally intractable [1], itemset mining is commonly driven by a minimum support threshold min sup. 17 Given a minimum support threshold min sup and a model size p, PatTexSum generates a pattern-based model that includes the most informative yet non-redundant set of p frequent itemsets discovered from the document collection T tailored to the transactional data format (Cf. Section 2.1). Among the large set of previously proposed approaches focused on succinctly representing transactional data by means of itemsets [8, 17, 18], we adopt a method recently proposed in [11]. Unlike previous approaches, it exploits an entropy-based heuristic to drive the mining process and select most informative yer not redundant itemsets without the need of postpruning. Its efficiency and effectiveness in discovering succinct transactional data summaries makes it particularly suitable for the application to text summarization. 2.3 Sentence evaluation and selection The PatTexSum method exploits the pattern-based model to evaluate and select most relevant sentences to include in the summary. Sentence evaluation and selection steps consider (i) a sentence relevance score that combines the tf-idf statistic [16] associated with each sentence term, and (ii) the sentence coverage with respect to the generated pattern-based model (Cf. Section 2.2). In the following we formalize both sentence coverage and relevance. Sentence relevance score The relevance score of a sentence is evaluated by using the bag-of-word document representation. It is computed as the sum of the tf-idf values (Cf. Formula 1) of each term belonging to the sentence in the document collection. In Formula 2 the score expression for a generic sentence sjk belonging to the document collection D is reported P i | wi ∈sjk tcik SR(sjk ) = (2) |tjk | P where |tjk | is the number of distinct terms occurring in sjk , and i | wi ∈sjk tcik is the sum of the tf-idf values associated with terms (i.e., word stems) in sjk (Cf. Formula 1). Sentence model coverage The sentence coverage measures the pertinence of each sentence to the generated pattern-based model. To this aim, it considers document sentences tailored to the transactional data format. Let D be the collection of documents, i.e., a set of sentences. We first associate with each sentence sjk ∈ D a binary vector, denoted in the following as sentence coverage vector (SC), SCjk ={sc1 , . . . , scp } where p is the number of itemsets belonging to the model and sci = 1trjk (Ii ) indicates whether itemset Ii is included or not in trjk . More formally, 1trjk is an indicator function defined as follows: ( 1 if Ii ⊆ trjk , 1trjk (Ii ) = (3) 0 otherwise 18 Algorithm 1 Sentence selection - Greedy approach Input: set of sentence relevance scores SR, set of sentence coverage vectors SC, tf-idf matrix T C Output: summary S 1: {Initializations} 2: S = ∅ 3: ESC = ∅ {set of eligible sentence coverage vectors} 4: SC ∗ = all zeros() {summary coverage vector with only 0s} 5: {Cycle until either SC ∗ contains only 1s or all the SC vectors contain only zeros} 6: while not (summary coverage vector all ones() or sentence coverage vectors only zeros()) do 7: {Determine the sentences with the highest number of ones} 8: ESC = max ones sentences() 9: if ESC != ∅ then 10: {Select the sentence with maximum relevance score} 11: SCbest = ESC[1] 12: for all t ∈ ESC[2 :] do 13: if SRt > SRbest then 14: SCbest =SCt 15: end if 16: end for 17: {Update sets and summary coverage vector} 18: S = S ∪ SCbest 19: SC ∗ = SC ∗ OR SCbest 20: ESC = ESC \ SCbest 21: {Update the sentence coverage vectors belonging to V} 22: for all SCi in SC do 23: SCi = SCi AND SC ∗ 24: end for 25: else 26: break 27: end if 28: end while 29: return S The coverage of a sentence sjk with respect to the pattern-based model is defined as the number of 1’s that occur in the corresponding coverage vector SCjk . We formalize the problem of selecting the most informative and not redundant sentences according to the pattern-based model as a set covering problem. The set covering problem A set covering algorithm focuses on selecting the minimum set of sentences, of arbitrary size l, whose logic OR of coverage vectors, i.e., SC ∗ =SC1 ∨ . . . ∨ SCl , generates a binary vector composed of all 1’s. This implies that each itemset belonging to the model covers at least one sentence. The SC ∗ vector will be denoted as the summary coverage vector throughout the paper. The set covering problem is known to be NP-hard. To solve the problem, we adopt a greedy strategy that we already proved to be effective in summarization of biological microarray data [2]. In order to build an accurate yet concise summary, the sentence coverage with respect to the pattern-based model is considered as the most discriminative feature, i.e., sentences that cover the maximum number of itemsets belonging to the model are selected firstly. At equal terms, the sentence with maximal coverage that is characterized by the highest relevance score SR is preferred. The adopted algorithm identifies, at each step, the sentence sjk with the best complementary vector SCjk with respect to the current summary coverage 19 vector SC ∗ . The pseudo-code of the greedy approach is reported in Algorithm 1. It takes in input the set of sentence relevance scores SR, the set of sentence coverage vectors SC, and the tf-idf matrix T C. It produces the summary S, i.e., the minimal subset of most representative sentences. The first step is the variable initialization and the sentence coverage vector computation (lines 1-4). Next, the sentence with maximum coverage, i.e., the one whose coverage vector contains the highest number of ones, is iteratively selected (line 7). At equal terms, the sentence with maximum relevance score (Cf. Formula 2) is preferred (lines 12-16). Finally the selected sentence is included in the summary S while the summary and sentence coverage vectors are updated (lines 18-24). The procedure iterates until either the summary coverage vector contains only ones, i.e., the model is fully covered by the summary, or the remaining sentences are not covered by any itemset, i.e., the remaining sentences are not pertinent to the model (line 6). Experimental results, reported in Section 3, show that the proposed summarization method performs better than exclusively considering either sentence coverage or sentence relevance. 3 Experimental results We conducted a set of experiments to address the following issues: (i) the effectiveness of the proposed summarization approach against two widely used summarizers, i.e., the Open Text Summarizer (OTS) [14] and TexLexAn [19] (Section 3.1), and (ii) the impact of the pattern-based model size and the support threshold on the performance of PatTexSum (Section 3.2). We evaluated all the summarization approaches on a collection of real-life news articles. To this aim, the 10 top-ranked news documents, provided by the Google web search engine (http://www.google.com), that concern the following recent news topics have been selected: Natural Disaster: Earthquake in Spain 2011 Royal Wedding: Prince William and Kate Middleton wedding Technology: Microsoft purchased Skype Education: Wealthy parents could buy their children places at elite universities – Sport:Australia defeat Pakistan in Azlan shah Hockey – – – – The datasets relative to the above news categories are made available for research purposes, upon request to the authors. To compare the results by PatTexSum with OTS [14] and TexLexAn [19], we used the ROUGE [10] toolkit (version 1.5.5), which is widely applied by Document Understanding Conference (DUC) for document summarization performance evaluation1 . It measures the quality of a summary by counting the unit overlaps between the candidate summary and a set of reference summaries. 1 The provided command is: ROUGE-1.5.5.pl -e data -x -m -2 4 -u -c 95 -r 1000 -n 4 -f A -p 0.5 -t 0 -d -a 20 Intuitively, the summarizer that achieves the highest ROUGE scores could be considered as the most effective one. Several automatic evaluation scores are implemented in ROUGE. For the sake of brevity, we reported only ROUGE-2 and ROUGE-4 as representative scores. Analogous results have been obtained for the other scores. Since a ”golden summary” (i.e., the optimal document collection summary) is not available for web news document, we performed a leave-one-out cross validation. More specifically, for each category we summarized nine out of ten news documents and we compared the resulting summary with the remaining (not yet considered) document, which has been selected as golden summary at this stage. Next, we tested all other possible combinations by varying the golden summary and we computed the average performance results, in terms of precision, recall, and F-measure, achieved by each summarizer for both ROUGE2 and ROUGE-4. 3.1 Performance comparison and validation We evaluated the performance, in terms of ROUGE-2 and ROUGE-4 precision (Pr), recall (R), and F-measure (F), of PatTexSum against OTS and TexLexAn. For both OTS and TexLexAn we adopted the configuration suggested by the respective authors. For PatTexSum we enforced a minimum support threshold min sup=1.5% and we tuned the value of the pattern-based model size p to its best value for each considered dataset. A more detailed discussion on the impact of both min sup and p on the performance of PatTexSum is reported in Section 3.2. PatTexSum performs better than the other considered summarizers on all tested datasets. To validate the statistical significance of PatTexSum performance improvement against OTS and TexLexAn, we used the paired t-test [7] at significance level p− value = 0.05 for all evaluated datasets and measures. For ROUGE-2, PatTexSum provides significantly better results than OTS, whose summarization approach is mainly based on tf-idf measure, and TexLexAn in terms of precision and/or recall on 3 out of 5 datasets (i.e., Natural disaster, Technology and Sports). Moreover, PatTexSum significantly outperforms TexLexAx and OTS in terms of F-measure (i.e., the harmonic average of precision and recall [16]) on, respectively, 2 and 3 of them (i.e., Natural disaster and Technology for both, and Sports for TexLexAn). Similar results were obtained for ROUGE-4. 3.2 PatTexSum parameter analysis We analyzed the impact of the minimum support threshold and the patternbased model size, i.e., the number of generated itemsets, on the performance 21 PatTexSum OTS TexLexAn p R Pr F R Pr F R Pr F Natural Disaster 16 0.116 0.288 0.141 0.040 0.120 0.053 0.038 0.114 0.045 Royal Wedding 12 0.036 0.215 0.058 0.034 0.174 0.054 0.030 0.150 0.047 Technology 5 0.141 0.465 0.210 0.042 0.208 0.067 0.042 0.172 0.065 Sports 10 0.145 0.297 0.189 0.055 0.133 0.075 0.071 0.149 0.093 Education 8 0.039 0.241 0.064 0.036 0.170 0.054 0.034 0.150 0.051 Table 1. Performance comparison in terms of ROUGE-2 score. dataset PatTexSum OTS TexLexAn p R Pr F R Pr F R Pr F Natural-Disaster 16 0.060 0.125 0.068 0.005 0.012 0.006 0.005 0.011 0.006 Royal-wedding 12 0.009 0.082 0.015 0.003 0.018 0.005 0.003 0.018 0.005 Technology 5 0.113 0.356 0.167 0.009 0.065 0.016 0.003 0.011 0.005 Sports 10 0.059 0.112 0.077 0.004 0.010 0.006 0.022 0.036 0.027 Education 8 0.017 0.141 0.030 0.003 0.012 0.005 0.003 0.009 0.004 Table 2. Performance comparison in terms of ROUGE-4 score. dataset of the PatTexSum summarizer. To also test the impact of the tf-idf statistic on the performance of the pattern-based summarizer, we entail (i) neglecting the relevance score evaluation (i.e., by simply selecting the top-ranked maximal coverage sentence provided by the itemset miner [11]), and (ii) considering other statistical measures in place of the tf-idf score. Among all the evaluated scores, the tf-idf statistic turns out to be most effective measure in discriminating among sentences. In Figures 2(a) and 2(b) we reported the F-measure achieved by PatTexSum, by either considering or not the relevance score in the sentence evaluation, and by varying, respectively, the support threshold on Technology and the model size on the Natural Disaster document collection. For the sake of brevity, we reported only the results obtained with the ROUGE-4 score. Analogous results have been obtained for the other ROUGE scores, for precision and recall measures, and for all other configurations. The usage of the relevance score based on the tf-idf statistic always improves the performance of PatTexSum in the range of those values of p and min sup yielding the highest F-measure. This improvement is due to its ability to well discriminate sentence term occurrence among documents. When higher support thresholds (e.g., 5%) are enforced, many informative patterns are discarded, thus the model becomes too general to yield high summarization performance. Oppositely, when very low support thresholds (e.g., 0.1%) are enforced, data overfitting occurs, i.e., the model is too much specialized to effectively and concisely summarize the whole document collection content. At medium support thresholds (e.g., 1.5%) the best balancing between model specialization and generalization is achieved, thus, PatTexSum produces very concise yet informative summaries. 22 0.08 0.2 With SR Without SR 0.18 0.07 0.16 0.06 F-measure F-measure 0.14 0.12 0.1 0.08 0.06 0.05 0.04 0.03 0.04 0.02 0.02 With SR Without SR 0.01 0 0 0.5 1 1.5 2 2.5 3 Minsup (%) 3.5 4 4.5 6 5 (a) Technology. p=5. Impact of the support threshold. 8 10 12 14 16 p 18 20 22 24 (b) Natural Disaster. min sup=1.5%. Impact of the pattern-based model size. Fig. 2. PatTexSum performance analysis by either considering or not of the relevance score (SR). Rouge-4 score. F-measure. The model size may also significantly affect the summarization performance. When a limited number of itemsets (e.g., p = 6) is selected, the relevant knowledge hidden in the news category Natural Disaster is not yet fully covered by the extracted patterns (see Figure 2(b)), thus the generated summaries are not highly informative. When p = 16 the pattern-based model provides the most informative and non-redundant knowledge. Consequently, the multi-document pattern-based summarization becomes very effective. When a higher number of itemsets is included in the model, the quality of the generated summaries worsens as the model is still informative but redundant. The best values of model size and support threshold achieved by each news category depend on the analyzed document term distribution. 4 Conclusions and future works This paper presents a multi-document summarizer that combines the knowledge provided by a pattern-based model, composed of frequent itemsets, with a statistical evaluation, based on the well-founded tf-idf measure, to select the most representative and not redundant sentences. Albeit the application of frequent itemsets to represent most valuable correlations among transactional data is well-established, their usage in text summarization has never been investigated so far. The proposed summarizer exploits a greedy approach to combine knowledge discovered from two different data representations, i.e., the transactional and bag-of-word representations, and select the minimal set of most relevant sentences. Experiments conducted on real-life news articles show both the effectiveness and the efficiency of the proposed text summarization method. 23 Future works will address: (i) the extension of the proposed approach to address the problem of incremental summary updating, and (ii) the exploitation of new techniques to address the set covering problem. References 1. R. Agrawal, T. Imieliński, and A. Swami. Mining association rules between sets of items in large databases. In ACM SIGMOD Record, volume 22, pages 207–216. 2. E. Baralis, G. Bruno, and A. Fiori. Minimum number of genes for microarray feature selection. 30th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC-08), pages 5692–5695, 2008. 3. S. Bird, E. Klein, and E. Loper. Natural language processing with Python. O’Reilly Media, 2009. 4. S. Brin, R. Motwani, and C. Silverstein. Beyond market baskets: Generalizing association rules to correlations. In SIGMOD Conference, pages 265–276, 1997. 5. S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proceedings of the seventh international conference on World Wide Web 7, pages 107–117, 1998. 6. J. M. Conroy, J. Goldstein, J. D. Schlesinger, and D. P. Oleary. Left-brain/rightbrain multi-document summarization. In In Proceedings of the Document Understanding Conference, 2004. 7. T. G. Dietterich. Approximate statistical test for comparing supervised classification learning algorithms. Neural Computation, 10(7), 1998. 8. S. Jaroszewicz and D. A. Simovici. Interestingness of frequent itemsets using bayesian networks as background knowledge. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 178–186, 2004. 9. K.-N. Kontonasios and T. D. Bie. An information-theoretic approach to finding informative noisy tiles in binary databases. In SIAM International Conference on Data Mining, pages 153–164, 2010. 10. C.-Y. Lin and E. Hovy. Automatic evaluation of summaries using n-gram cooccurrence statistics. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, pages 71–78, 2003. 11. M. Mampaey, N. Tatti, and J. Vreeken. Tell me what I need to know: Succinctly summarizing data with itemsets. In Proceedings of the 17th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2011. 12. D. R. Radev. Lexrank: Graph-based lexical centrality as salience in text summarization. Journal of Artificial Intelligence Research, 22:2004, 2004. 13. D. R. Radev, H. Jing, M. Stys, and D. Tam. Centroid-based summarization of multiple documents. Information Processing and Management, 40(6):919 – 938, 2004. 14. N. Rotem. Open text summarizer (ots). Retrieved July, 3(2006):2006, 2003. 15. H. Takamura and M. Okumura. Text summarization model based on the budgeted median problem. In Proceeding of the 18th ACM conference on Information and knowledge management, pages 1589–1592, 2009. 16. P. Tan, M. Steinbach, V. Kumar, et al. Introduction to data mining. Pearson Addison Wesley Boston, 2006. 24 17. N. Tatti. Probably the best itemsets. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 293–302, 2010. 18. N. Tatti and H. Heikinheimo. Decomposable families of itemsets. In Proceedings of the European conference on Machine Learning and Knowledge Discovery in Databases - Part II, pages 472–487, 2008. 19. TexLexAn. Texlexan: An open-source text summarizer, 2011. 20. D. Wang and T. Li. Document update summarization using incremental hierarchical clustering. In Proceedings of the 19th ACM international conference on Information and knowledge management, pages 279–288, 2010. 25 An Expectation Maximization Algorithm for Probabilistic Logic Programs Elena Bellodi and Fabrizio Riguzzi ENDIF – Università di Ferrara – Via Saragat, 1 – 44122 Ferrara, Italy. {elena.bellodi,fabrizio.riguzzi}@unife.it Abstract. Recently much work in Machine Learning has concentrated on representation languages able to combine logic and probability, leading to the birth of a whole field called Statistical Relational Learning. In this paper we present a technique for parameter learning targeted to a family of formalisms where uncertainty is represented using Logic Programming tools - the so-called Probabilistic Logic Programs such as ICL, PRISM, ProbLog and LPAD. Since their equivalent Bayesian networks contain hidden variables, an EM algorithm is adopted. To speed the computation expectations are computed directly on the Binary Decision Diagrams that are built for inference. The resulting system, called EMBLEM for “EM over BDDs for probabilistic Logic programs Efficient Mining”, has been applied to various datasets and showed good performances both in terms of speed and memory. 1 Introduction In the field of Statistical Relational Learning (SRL) logical-statistical languages are used to effectively learn in complex domains involving relations and uncertainty. They have been successfully applied in social networks analysis, entity recognition, information extraction, etc. Similarly, a large number of works in Logic Programming has attempted to combine logic and probability, among which the distribution semantics [11] is a prominent approach. It underlies for example PRISM [11], the Independent Choice Logic, Logic Programs with Annotated Disjunctions (LPADs) [15], ProbLog [3] and CP-logic. The approach is appealing because efficient inference algorithms appeared [3,9], which adopt Binary Decision Diagrams (BDD). In this paper we present the EMBLEM system for “EM over BDDs for probabilistic Logic programs Efficient Mining” that learns parameters of probabilistic logic programs under the distribution semantics by using an Expectation Maximization (EM) algorithm: it is an iterative method to estimate some unknown parameters Θ of a model, given a dataset where some of the data is missing,to find maximum likelihood estimates of Θ. The translation of these programs into graphical models requires the use of hidden variables and therefore of EM: the main characteristic of our system is the computation of expectations using BDDs. Since there are transformations with linear complexity that can convert a program in a language into the others[2], we will use LPADs for their general syntax. 26 EMBLEM has been tested on the IMDB, Cora and UW-CSE datasets and compared with RIB [10], LeProbLog [3], Alchemy [8] and CEM, an implementation of EM based on the cplint interpreter [9]. The paper is organized as follows. Section 2 presents LPADs and Section 3 describes EMBLEM. Section 4 presents experimental results. Section 5 discusses related works and Section 6 concludes the paper. 2 Logic Programs with Annotated Disjunctions Formally a Logic Program with Annotated Disjunctions [15] consists of a finite set of annotated disjunctive clauses. An annotated disjunctive clause Ci is of the form hi1 : Πi1 ; . . . ; hini : Πini : −bi1 , . . . , bimi . In such a clause hi1 , . . . hini are logical atoms and bi1 , . . . , bimi are P logical literals, {Πi1 , . . . , Πini } are real i numbers in the interval [0, 1] such that P nk=1 Πik ≤ 1. bi1 , . . . , bimi is called the ni body and is indicated with body(Ci ). If k=1 Πik < 1 the head of the annotated disjunctive clause implicitly contains an extra atom null Pnithat does not appear in the body of any clause and whose annotation is 1 − k=1 Πik . We denote by ground(T ) the grounding of an LPAD T . An atomic choice is a triple (Ci , θj , k) where Ci ∈ T , θj is a substitution that grounds Ci and k ∈ {1, . . . , ni }. (Ci , θj , k) means that, for the ground clause Ci θj , the head hik was chosen. In practice Ci θj corresponds to a random variable Xij and an atomic choice (Ci , θj , k) to an assignment Xij = k. A set of atomic choices κ is consistent if (C, θ, i) ∈ κ, (C, θ, j) ∈ κ ⇒ i = j, i.e., only one head is selected for the same ground clause. A composite choice κ is a consistent set of atomic choices. The probability P (κ) of a composite choiceQ κ is the product of the probabilities of the individual atomic choices, i.e. P (κ) = (Ci ,θj ,k)∈κ Πik . A selection σ is a composite choice that, for each clause Ci θj in ground(T ), contains an atomic choice (Ci , θj , k). We denote the set of all selections σ of a program T by ST . A selection σ identifies a normal logic program wσ defined as wσ = {(hik ← body(Ci ))θj |(Ci , θj , k) ∈ σ}. wσ is called a world of T . Since selections are composite choices we can assign a probability to possible worlds: Q P (wσ ) = P (σ) = (Ci ,θj ,k)∈σ Πik . We consider only sound LPADs in which every possible world has a total well-founded model. Subsequently we will write wσ |= Q to mean that the query Q is true in the well-founded model of the program wσ . P The probability of a query Q according to an LPAD T is given by P (Q) = σ∈E(Q) P (σ) where E(Q) is {σ ∈ ST , wσ |= Q}, i.e., the set of selections corresponding to worlds where the query is true. To reduce the computational cost of answering queries in our experiments, random variables can be directly associated to clauses rather than to their ground instantiations: atomic choices then take the form (Ci , k), meaning that head hik is selected from program clause Ci , i.e., that Xi = k. Example 1. The following LPAD T encodes a very simple model of the development of an epidemic or pandemic: 27 C1 C2 C3 C4 = epidemic : 0.6; pandemic : 0.3 : −f lu(X), cold. = cold : 0.7. = f lu(david). = f lu(robert). Clause C1 has two groundings, C1 θ1 with θ1 = {X/david} and C1 θ2 with θ2 = {X/robert}, so there are two random variables X11 and X12 . The possible worlds in which a query is true can be represented using a Multivalued Decision Diagram (MDD). An MDD represents a function f (X) taking Boolean values on a set of multivalued variables X by means of a rooted graph that has one level for each variable. Each node is associated to the variable of its level and has one child for each possible value of the variable. The leaves store either 0 or 1. Given values for all the variables X, we can compute the value of f (X) by traversing the graph starting from the root and returning the value associated to the leaf that is reached. A MDD can be used to represent the set E(Q) by considering the multivalued variable Xij associated to Ci θj of ground(T ). Xij has values {1, . . . , ni } and the atomic choice (Ci , θj , k) corresponds to the propositional equation Xij = k. If we represent with an MDD W V the function f (X) = σ∈E(Q) (Ci ,θj ,k)∈σ Xij = k then the MDD will have a path to a 1-leaf for each possible world where Q is true. While building MDDs simplification operations can be applied that delete or merge nodes. In this way a reduced MDD is obtained with respect to a Multivalued Decision Tree (MDT), i.e., a MDD in which every node has a single parent, all the children belong to the level immediately below and all the variables have at least one node. For example, the reduced MDD corresponding to the query epidemic from Example 1 is shown in Figure 1(a). The labels on the edges represent the values of the variable associated to the node: nodes at first and second level have three outgoing edges, corresponding to the values of X11 and X12 , since C1 has three head atoms (epidemic, pandemic, null); similarly X21 has two values since C2 has two head atoms (cold, null), hence the associated node has two outgoing edges. X11 k 1 k k kkkk 3 kkkk 1 k k TTTT 2 TTT2T 1 TTTT T X12 1 n1 R X111 3 2 1 X21 2 X121 X211 3 0 ; n2 n3 Y V T R O L 1 (b) BDD. (a) MDD. Fig. 1. Decision diagrams for Example 1. 28 H 0 It is often unfeasible to find all the worlds where the query is true so inference algorithms find instead explanations for it, i.e. composite choices such that the query is true in all the worlds whose selections are a superset of them. Explanations however, differently from possible worlds, are not necessarily mutually exclusive with respect to each other, but exploiting the fact that MDDs split paths on the basis of the values of a variable and the branches are mutually disjoint, the probability of the query can be computed. Most packages for the manipulation of a decision diagram are however restricted to work on Binary Decision Diagrams (BDD), i.e., decision diagrams where all the variables are Boolean. A node n in a BDD has two children: the 1-child, indicated with child1 (n), and the 0-child, indicated with child0 (n). The 0-branch, the one going to the 0-child, is drawn with a dashed line. To work on MDDs with a BDD package we must represent multivalued variables by means of binary variables. For a multivalued variable Xij , corresponding to ground clause Ci θj , having ni values we use ni − 1 Boolean variables Xij1 , . . . , Xijni −1 and we represent the equation Xij = k for k = 1, . . . ni − 1 by means of the conjunction Xij1 ∧ Xij2 ∧ . . . ∧ Xijk−1 ∧ Xijk , and the equation Xij = ni by means of the conjunction Xij1 ∧Xij2 ∧. . . ∧Xijni −1 . BDDs obtained in this way can be used as well for computing the probability of queries by associating to each Boolean variable Xijk a parameter πik that represents P (Xijk = 1). If we define g(i) = {j|θj is a substitution grounding Ci } then P (Xijk = 1) = πik for all j ∈ g(i). The parameters are obtained from those of multivalued variables up to k = ni − 1. Figure 1(b) shows in this way: πi1 = Πi1 , . . . πik = Qk−1Πik j=1 (1−πij ) the reduced BDD corresponding to the MDD on the left, with binary variables for each level. 3 EMBLEM EMBLEM applies the algorithm for performing EM over BDDs, proposed in [14,6], to the problem of learning the parameters of an LPAD. EMBLEM takes as input a number of goals that represent the examples and for each one generates the BDD encoding its explanations. The examples are organized in a set of interpretations (sets of ground facts) each describing a portion of the domain of interest. The queries correspond to ground atoms whose predicate has been indicated as “target” by the user. The predicates can be treated as closed-world or open-world. In the first case the body of clauses with a target predicate in the head is resolved only with facts in the interpretation, in the second case it is resolved both with facts in the interpretation and with clauses in the theory. If the last option is set and the theory is cyclic, we use a depth bound on SLD-derivations to avoid going into infinite loops. Given a program containing the clauses C1 and C2 from Example 1 and the interpretation {epidemic, f lu(david), f lu(robert)}, we obtain the BDD in Figure 1(b) that represents the query epidemic. Then EMBLEM enters the EM cycle, in which the steps of expectation and maximization are repeated until the log-likelihood of the examples reaches a local maximum. For a single example Q: 29 – Expectation: computes E[cik0 |Q] and E[cik1 |Q] for all rules Ci and k = 1, . . . , ni − 1, where cikx is the number of times a variable Xijk takes value x P for x ∈ {0, 1}, with j in g(i). E[cikx |Q] is given by j∈g(i) P (Xijk = x|Q). – Maximization: computes πik for all rules Ci and k = 1, . . . , ni − 1: πik = E[cik1 |Q] E[cik0 |Q]+E[cik1 |Q] If we have more than one example the contributions of each example simply sum up when computing E[cijx ]. P (Xijk =x,Q) with P (Xijk = x|Q) is given by P (Xijk = x|Q) = P (Q) P (Xijk = x, Q) = X P (Q, Xijk = x, σ) σ∈E(Q) = X P (Q|σ)P (Xijk = x|σ)P (σ) σ∈E(Q) = X P (Xijk = x|σ)P (σ) σ∈E(Q) where P (Xijk = 1|σ) = 1 if (Ci , θj , k) ∈ σ for k = 1, . . . , ni − 1 and 0 otherwise. Since there is a one to one correspondence between the worlds where Q is true and the paths to a 1 leaf in a Binary Decision Tree (a MDT with binary variables), Y X P (Xijk = x|ρ) π(d) P (Xijk = x, Q) = ρ∈R(Q) d∈ρ where ρ is a path and if σ corresponds to ρ then P (Xijk = x|σ)=P (Xijk = x|ρ). R(Q) is the set of paths in the BDD for query Q that lead to a 1 leaf, d is an edge of ρ and π(d) is the probability associated to the edge: if d is the 1-branch from a node associated to a variable Xijk , then π(d) = πik , if d is the 0-branch from a node associated to a variable Xijk , then π(d) = 1 − πik . Now consider a BDT in which only the merge rule is applied, fusing together identical sub-diagrams. The resulting diagram, that we call Complete Binary Decision Diagram (CBDD), is such that every path contains a node for every level. For a CBDD P (Xijk = x, Q) can be further expanded as X Y P (Xijk = x, Q) = π(d) ρ∈R(Q)∧(Xijk =x)∈ρ d∈ρ where (Xijk = x) ∈ ρ means that ρ contains an x-edge from a node associated to Xijk . We can then write X Y Y P (Xijk = x, Q) = π(d) π(d) n∈N (Q)∧v(n)=Xijk ∧ρn ∈Rn (Q)∧ρn ∈Rn (Q,x) d∈ρn d∈ρn where N (Q) is the set of nodes of the BDD, v(n) is the variable associated to node n, Rn (Q) is the set containing the paths from the root to n and Rn (Q, x) 30 is the set of paths from n to the 1 leaf through its x-child. X X Y Y X π(d) π(d) P (Xijk = x, Q) = n∈N (Q)∧v(n)=Xijk ρn ∈Rn (Q) ρn ∈Rn (Q,x) d∈ρn = X X n∈N (Q)∧v(n)=Xijk = X Y X π(d) ρn ∈Rn (Q) d∈ρn ρn ∈Rn (Q,x) d∈ρn Y π(d) d∈ρn F (n)B(childx (n))πikx n∈N (Q)∧v(n)=Xijk where πikx is πik if x=1 and (1 − πik ) if x=0. F (n) is the forward probability [6], the probability mass of the paths from the root to n, while B(n) is the backward probability [6], the probability mass of paths from n to the 1 leaf. If root is the root of a tree for a query Q then B(root) = P (Q). The expression F (n)B(childx (n))πikx represents the sum of the probabilities of all the paths passing through the x-edge of node n and is indicated with ex (n). Thus X P (Xijk = x, Q) = ex (n) (1) n∈N (Q),v(n)=Xijk For the case of a BDD, i.e., a diagram obtained by applying also the deletion rule, Formula 1 is no longer valid since also paths where there is no node associated to Xijk can contribute to P (Xijk = x, Q). These paths might have been obtained from a BDD having a node m associated to variable Xijk that is a descendant of n along the 0-branch and whose outgoing edges both point to child0 (n). The correction of formula (1) to take into account of this aspect is applied in the Expectation step. We now describe EMBLEM in detail. EMBLEM’s main procedure consists of a cycle in which the procedures Expectation and Maximization are repeatedly called. The first one returns the log likelihood LL of the data that is used in the stopping criterion: EMBLEM stops when the difference between LL of the current iteration and the one of the previous iteration drops below a threshold ǫ or when this difference is below a fraction δ of the current LL. Procedure Expectation takes as input a list of BDDs, one for each example, and computes the expectation for each one, i.e. P (Q, Xijk = x) for x all P variables Xijk in the BDD. In the procedure we use η (i, k) to indicate j∈g(i) P (Q, Xijk = x). Expectation first calls GetForward and GetBackward that compute the forward, the backward probability of nodes and η x (i, k) for non-deleted paths only. Then it updates η x (i, k) to take into account deleted paths. The expectations are updated in this way: for all rules i and k = 1 to ni − 1 E[cikx ] = E[cikx ] + η x (i, k)/P (Q),where P (Q) is the backward probability of the root. Procedure Maximization computes the parameters values for the next EM iteration. Procedure GetForward traverses the diagram one level at a time starting from the root level, where F(root)=1, and for each node n computes its contribution to the forward probabilities of its children. Function GetBackward 31 computes the backward probability of nodes by traversing recursively the tree from the root to the leaves. More details can be found in [1]. 4 Experiments EMBLEM has been tested over three real world datasets: IMDB1 , UW-CSE2 and Cora3. We implemented EMBLEM in Yap Prolog4 and we compared it with RIB [10]; CEM, an implementation of EM based on the cplint inference library [9]; LeProblog [4], and Alchemy [8]. All experiments were performed on Linux machines with an Intel Core 2 Duo E6550 (2333 MHz) and 4 GB of RAM. To compare our results with LeProbLog and Alchemy we exploited the translations of LPADs into ProbLog [2] and MLN [10] respectively. For the probabilistic logic programming systems (EMBLEM, RIB, CEM and LeProbLog) we consider various options: associating a distinct random variable to each grounding of a probabilistic clause or a single random variable to a nonground clause, to express whether the clause is used or not (the latter case makes the problem easier); putting a limit on the depth of derivations, thus eliminating explanations associated to derivations exceeding the limit (necessary for problems that contain cyclic clauses, such as transitive closure clauses); setting the number of restarts for EM based algorithms. All experiments for probabilistic logic programming systems have been performed using open-world predicates. IMDB regards movies, actors, directors and movie genres and is divided into five mega-examples. We performed training on four mega-examples and testing on the remaining one. Then we drew a Precision-Recall curve and computed the Area Under the Curve (AUCPR and AUCROC). We defined 4 different LPADs, two for predicting the target predicate sameperson/2, and two for predicting samemovie/2. We had one positive example for each fact that is true in the data, while we sampled from the complete set of false facts three times the number of true instances in order to generate negative examples. For predicting sameperson/2 we used the same LPAD of [10]. We ran EMBLEM on it with the following settings: no depth bound (theory is acyclic), random variables associated to instantiations of the clauses (learning time is very low) and a number of restarts chosen to match the execution time of EMBLEM with that of the fastest other algorithm. The queries that LeProbLog take as input are obtained by annotating with 1.0 each positive example for sameperson/2 and with 0.0 each negative example for sameperson/2 obtained by random sampling. We ran LeProbLog for a maximum of 100 iterations or until the difference in Mean Squared Error (MSE) between two iterations got smaller than 10−5 ; this was done also in all the subsequent experiments. For Alchemy we used the preconditioned rescaled conjugate 1 2 3 4 http://alchemy.cs.washington.edu/data/imdb http://alchemy.cs.washington.edu/data/uw-cse http://alchemy.cs.washington.edu/data/cora http://www.dcc.fc.up.pt/~ vsc/Yap 32 gradient discriminative algorithm for every dataset and in this case we specified sameperson/2 as the only non-evidence predicate. A second LPAD, also taken from [10], has been created to evaluate the performance of the algorithms when some atoms are unseen. The settings are the same as the ones for the previous LPAD. In this experiment Alchemy was run with the −withEM option that turns on EM learning. Table 1 shows the AUCPR and AUCROC averaged over the five folds for EMBLEM, RIB, LeProbLog, CEM and Alchemy. Results for the two LPADs are shown respectively in the IMDB-SP and IMDBu-SP rows. Table 2 shows the learning times in hours. For predicting samemovie/2 we used the LPAD: samemovie(X,Y):p:- movie(X,M),movie(Y,M),actor(M). samemovie(X,Y):p:- movie(X,M),movie(Y,M),director(M). samemovie(X,Y):p:- movie(X,A),movie(Y,B),actor(A),director(B), workedunder(A,B). samemovie(X,Y):p:- movie(X,A),movie(Y,B),director(A),director(B), genre(A,G),genre(B,G). To test the behaviour when unseen predicates are present, we transformed the program for samemovie/2 as we did for sameperson/2 [10]. We ran EMBLEM on them with no depth bound, one variable for each instantiation of a rule and one random restart. With regard to LeProbLog and Alchemy, we ran them with the same settings as IMDB-SP and IMDBu-SP, by replacing sameperson with samemovie. Table 1 shows, in the IMDB-SM and IMDBu-SM rows, the average AUCPR and AUCROC for EMBLEM, LeProblog and Alchemy. For RIB and CEM we obtained a lack of memory error (indicated with “me”). The Cora database contains citations to computer science research papers. For each citation we know the title, authors, venue and the words that appear in them. The task is to determine which citations are referring to the same paper, by predicting the predicate samebib(cit1,cit2). From the MLN proposed in [13]5 we obtained two LPADs. The first contains 559 rules and differs from the direct translation of the MLN because rules involving words are instantiated with the different constants, only positive literals for the hasword predicates are used and transitive rules are not included. The Cora dataset comprises five mega-examples each containing facts for the four predicates samebib/2, samevenue/2, sametitle/2 and sameauthor/2, which have been set as target predicates. We ran EMBLEM on this LPAD with no depth bound (theory is acyclic), a single variable for each instantiation of a rule (learning time is reasonable) and a number of restarts chosen to match the execution time of EMBLEM with that of the fastest other algorithm. The second LPAD adds to the previous one the transitive rules for the predicates samebib/2, samevenue/2, sametitle/2, for a total of 563 rules. In this case we had to run EMBLEM with a depth bound equal to two (theory becomes cyclic and with higher values of depth learning time was overlong) and a single 5 Available at http://alchemy.cs.washington.edu/mlns/er. 33 variable for each non-ground rule (LPAD too complex to be treated with a variable for each instantiation); the number of restarts was one. As for LeProbLog, we separately learned the four predicates because learning the whole theory at once would give a lack of memory error. We annotated with 1.0 each positive example for samebib/2, sameauthor/2, sametitle/2, samevenue/2 and with 0.0 the negative examples for the same predicates, which were contained in the dataset provided with the MLN. As for Alchemy, we learned weights with the four predicates as the non-evidence predicates. Table 1 shows in the Cora and CoraT (Cora transitive) rows the average AUCPR and AUCROC obtained by training on four mega-examples and testing on the remaining one. CEM and Alchemy on CoraT gave a memory error while RIB was not applicable because it was not possible to split the input examples into smaller independent interpretations as required by RIB. The UW-CSE dataset contains information about the Computer Science department of the University of Washington through 22 different predicates, such as yearsInProgram/2, advisedBy/2, taughtBy/3 and is split into five mega-examples. The goal here is to predict the advisedBy/2 predicate, namely the fact that a person is advised by another person: this was our target predicate. The negative examples have been generated by applying the closed world assumption to advisedBy/2. The theory used was obtained from the MLN of [12]6 and contains 86 rules. We ran EMBLEM on it with a single variable for each instantiation of a rule, a depth bound of two (cyclic theory) and one random restart (to limit time, in comparison with the other faster algorithms). The annotated queries that LeProbLog takes as input have been created by annotating with 1.0 each positive example and with 0.0 each negative example for advisedBy/2. As for Alchemy, we learned weights with advisedBy/2 as the only non-evidence predicate. Table 1 shows the AUCPR and AUCROC averaged over the five mega-examples for all the algorithms. Table 3 shows the p-value of a paired two-tailed t-test at the 5% significance level of the difference in AUCPR and AUCROC between EMBLEM and RIB/LeProbLog/CEM/Alchemy (significant differences in bold). From the results we can observe that over IMDB EMBLEM has comparable performances with CEM for IMDB-SP, with similar execution time. On IMDBu-SP it has better performances than all other systems(see AUCPR), with a learning time equal to the fastest other algorithm. On IMDB-SM it reaches the highest area value in less time (only one restart is needed). On IMDBu-SM it still reaches the highest area with one restart but with a longer execution time. Over Cora it has comparable performances with the best other system CEM but in a significantly lower time and over CoraT is one of the few systems to be able to complete learning, with better performances in terms of area (especially AUCPR) and time. Over UW-CSE it has significant better performances with respect to all the algorithms. Longer learning times are needed for EMBLEM on IMDBu-SM and UW-CSE datasets, but in both cases AUCPR achieves significantly higher values. LeProblog reveals itself to be the closest 6 Available at http://alchemy.cs.washington.edu/mlns/uw-cse. 34 Table 1. Results of the experiments on all datasets. IMDBu refers to the IMDB dataset with the theory containing unseen predicates. CoraT refers to the theory containing transitive rules. Numbers in parenthesis followed by r mean the number of random restarts (when different from one) to reach the area specified. “me” means memory error during learning, “no” means that the algorithm was not applicable. AUCPR is the area under the Precision-Recall curve, AUCROC is the area under the ROC curve, both averaged over the five folds. E is EMBLEM, R is RIB, L is LeProbLog, C is CEM, A is Alchemy. Dataset E IMDB-SP 0.202(500r) IMDBu-SP 0.175(40r) IMDB-SM 1.000 IMDBu-SM 1.000 Cora 0.995(120r) CoraT 0.991 UW-CSE 0.883 AUCPR R L 0.199 0.096 0.166 0.134 me 0.933 me 0.933 0.939 0.905 no 0.970 me 0.270 C 0.202 0.120 0.537 0.515 0.995 me 0.644 AUCROC A E R L C A 0.107 0.931(500r) 0.929 0.870 0.930 0.907 0.020 0.900(40r) 0.897 0.921 0.885 0.494 0.820 1.000 me 0.983 0.709 0.925 0.338 1.000 me 0.983 0.442 0.544 0.469 1.000(120r) 0.992 0.994 0.999 0.704 me 0.999 no 0.998 me me 0.294 0.993 me 0.932 0.873 0.961 Table 2. Execution time in hours of the experiments on all datasets. R is RIB, L is LeProbLog, C is CEM and A is Alchemy. Time(h) EMBLEM R L C A IMDB-SP 0.01 0.016 0.35 0.01 1.54 IMDBu-SP 0.01 0.0098 0.23 0.012 1.54 IMDB-SM 0.00036 me 0.005 0.0051 0.0026 IMDBu-SM 3.22 me 0.0121 0.0467 0.0108 Cora 2.48 2.49 13.25 11.95 1.30 CoraT 0.38 no 4.61 me me UW-CSE 2.81 me 1.49 0.53 1.95 Dataset Table 3. Results of t-test on all datasets, relative to AUCPR and AUCROC. p is the p-value of a paired two-tailed t-test (significant differences at the 5% level in bold) between EMBLEM and all the others. R is RIB, L is LeProbLog, C is CEM, A is Alchemy. p - AUCPR p - AUCROC E-R E-L E-C E-A E-R E-L E-C IMDB-SP 0.2167 0.0126 0.3739 0.0134 0.3436 0.0012 0.3507 IMDBu-SP 0.1276 0.1995 0.001 4.5234e-5 0.2176 0.1402 0.0019 IMDB-SM me 0.3739 0.0241 0.1790 me 0.3739 0.018 IMDBu-SM me 0.3739 0.2780 2.2270e-4 me 0.3739 0.055 Cora 0.011 0.0729 1 0.0068 0.0493 0.0686 0.4569 CoraT no 0.0464 me me no 0.053 me UW-CSE me 1.5017e-4 0.0088 4.9921e-4 me 0.0048 0.2911 Dataset 35 E-A 0.015 1.01e-5 0.2556 6.54e-4 0.0327 me 0.0048 system to EMBLEM from the point of view of performances, able in addition to always complete learning, but with longer times (except for IMDBu-SM and UW-CSE). Looking at the overall results, AUCPR and AUCROC are higher or equal for EMBLEM than the other systems except on IMDBu-SP, where LeProbLog achieves a non-statistically significant higher AUCROC. Differences between EMBLEM and the other systems are statistically significant in 22 out of 43 cases. 5 Related Work Our work has close connection with various other works. [6] proposed an EM algorithm for learning the parameters of Boolean random variables given observations of the values of a Boolean function over them, represented by a BDD. EMBLEM is an application of that algorithm to probabilistic logic programs. Independently [14] also proposed an EM algorithm over BDD to learn parameters for the CPT-L language. [5] presented the CoPrEM algorithm that performs EM for the ProbLog language. We differ from this work in the construction of BDDs: they build a BDD for an interpretation while we build it for single ground atoms for the specified target predicate(s), the one(s) for which we are interested in good predictions. Moreover CoPrEM treats missing nodes as if they were there and updates the counts accordingly. Other approaches for learning probabilistic logic programs employ constraint techniques, or use EM, or adopt gradient descent. Among the approaches that use EM, [7] first proposed to use it to induce parameters and the Structural EM algorithm to induce ground LPADs structures. Their EM algorithm however works on the underlying Bayesian network. RIB [10] performs parameter learning using the information bottleneck approach, which is an extension of EM targeted especially towards hidden variables. Among the works that use a gradient descent technique we remind LeProbLog [4], which tries to find the parameters of a ProbLog program that minimize the MSE of the query probability and uses BDD to compute the gradient. Alchemy [8] is a state of the art SRL system that offers various tools for inference, weight learning and structure learning of Markov Logic Networks (MLNs). MLNs significantly differ from the languages under the distribution semantics since they extend first-order logic by attaching weights to logical formulas, but do not allow to exploit logic programming techniques. 6 Conclusions We have proposed a technique which applies an EM algorithm to BDDs for learning the parameters of Logic Programs with Annotated Disjunctions. It can be applied to all languages that are based on the distribution semantics and exploits the BDDs that are built during inference to efficiently compute the expectations for hidden variables. We executed the algorithm over the real datasets IMDB, UW-CSE and Cora, and evaluated its performances - together with four 36 other systems - through the AUCPR. These results show that EMBLEM uses less memory than RIB, CEM and Alchemy, allowing it to solve larger problems. Moreover its speed allows to perform a high number of restarts making it escape local maxima. In the future we plan to extend EMBLEM for learning LPADs structure. References 1. Bellodi, E., Riguzzi, F.: EM over binary decision diagrams for probabilistic logic programs. Tech. Rep. CS-2011-01, ENDIF, Università di Ferrara (2011) 2. De Raedt, L., Demoen, B., Fierens, D., Gutmann, B., Janssens, G., Kimmig, A., Landwehr, N., Mantadelis, T., Meert, W., Rocha, R., Santos Costa, V., Thon, I., Vennekens, J.: Towards digesting the alphabet-soup of statistical relational learning. In: NIPS Workshop on Probabilistic Programming (2008) 3. De Raedt, L., Kimmig, A., Toivonen, H.: ProbLog: A probabilistic prolog and its application in link discovery. In: International Joint Conference on Artificial Intelligence. pp. 2462–2467 (2007) 4. Gutmann, B., Kimmig, A., Kersting, K., Raedt, L.D.: Parameter learning in probabilistic databases: A least squares approach. In: European Conference on Machine Learning. LNCS, vol. 5211, pp. 473–488. Springer (2008) 5. Gutmann, B., Thon, I., De Raedt, L.: Learning the parameters of probabilistic logic programs from interpretations. Tech. Rep. CW 584, Department of Computer Science, Katholieke Universiteit Leuven, Belgium (June 2010) 6. Ishihata, M., Kameya, Y., Sato, T., Minato, S.: Propositionalizing the em algorithm by bdds. Tech. Rep. TR08-0004, CS Dept., Tokyo Institute of Technology (2008) 7. Meert, W., Struyf, J., Blockeel, H.: Learning ground CP-Logic theories by leveraging Bayesian network learning techniques. Fund. Inf. 89(1), 131–160 (2008) 8. Richardson, M., Domingos, P.: Markov logic networks. Mach. Learn. 62(1-2), 107– 136 (2006) 9. Riguzzi, F.: Extended semantics and inference for the Independent Choice Logic. Log. J. IGPL 17(6), 589–629 (2009) 10. Riguzzi, F., Mauro, N.D.: Applying the information bottleneck to statistical relational learning. Mach. Learn. (2011), to appear 11. Sato, T.: A statistical learning method for logic programs with distribution semantics. In: International Conference on Logic Programming. pp. 715–729. MIT Press (1995) 12. Singla, P., Domingos, P.: Discriminative training of Markov logic networks. In: National Conference on Artificial Intelligence. pp. 868–873. AAAI Press/The MIT Press (2005) 13. Singla, P., Domingos, P.: Entity resolution with Markov logic. In: International Conference on Data Mining. pp. 572–582. IEEE Computer Society (2006) 14. Thon, I., Landwehr, N., Raedt, L.D.: A simple model for sequences of relational state descriptions. In: European conference on Machine Learning. LNCS, vol. 5212, pp. 506–521. Springer (2008) 15. Vennekens, J., Verbaeten, S., Bruynooghe, M.: Logic programs with annotated disjunctions. In: International Conference on Logic Programming. LNCS, vol. 3131, pp. 195–209. Springer (2004) 37 Clustering XML Documents by Structure: a Hierarchical Approach (Extended Abstract) G. Costa, G. Manco, R. Ortale, and E. Ritacco ICAR-CNR Via Bucci 41c 87036 Rende (CS) - Italy Abstract. A new parameter-free approach to clustering XML documents by structure is proposed. The idea is to consider various forms of structural patterns occurring in the XML documents to form a hierarchy of nested clusters. At any level in the hierarchy, clusters explain how the XML documents can be grouped on the basis of common structural patterns of the form considered at that level. The resulting explanation is progressively refined at the subsequent level, where another type of structural patterns is used to divide the individual clusters from the above level into subgroups, revealing meaningful and previously uncaught structural differences. Each cluster in the hierarchy is summarized through a novel technique into a corresponding representative, that provides a clear and differentiated understanding of the structural information within the cluster. 1 Introduction The problem of clustering XML documents by structure has been extensively investigated, with the consequent development of several approaches, such as [5, 7–10]. XML trees can share various forms of common structural components, ranging from simple node/edge and pairwise tags [11], to more complex substructures such as groups of siblings, paths (either root-to-node [11] or root-toleaf [7]), as well as subtrees or even summaries [9, 10]. Therefore, if the addressed form of structural patterns does not accord with the underlying properties of XML data, valuable relationships of structural resemblance between the XML documents can be missed, with a consequent degrade of clustering effectiveness. Moreover, judging differences only in terms of one type of structural components may not suffice to effectively separate the available XML documents. This paper proposes a new hierarchical approach to clustering that considers various forms of structural patterns in the XML documents to progressively derive a hierarchy of nested clusters. In addition, the characterization of each cluster is accomplished by means of new summarization method, aimed at subsuming the structural properties within each cluster in terms of strongly representative substructures. 38 2 Partitioning XML Trees We introduce the notation used throughout the paper as well as some basic concepts. The structure of XML documents without references can be modeled in terms of rooted ordered labeled trees, that represent the hierarchical relationships among the document elements (i.e., nodes). Definition 1. XML Tree. An XML tree is a rooted, labeled, ordered tree, represented as a tuple t = (rt , Vt , Et , λt ), whose individual components have the following meaning. Vt a set of nodes and rt ∈ Vt is the root node of t, i.e. the only node with no entering edges. Et ⊆ Vt × Vt is a set of edges, catching the parent-child relationships between nodes of t. Finally, λt : Vt 7→ Σ is a node labeling function and Σ is an alphabet of node tags (i.e., labels). t u Parent-child relationship in t is denoted by ni ≺ nj , where ni , nj ∈ Vt such that ∃(ni , nj ) ∈ Et , ni is the parent while nj is the child. Ancestor-descendant relationship is indicated as ni ≺p nj , where p is the distance (in nodes) between the ancestor and the descendant (ni ≺1 nj is equivalent to ni ≺ nj ). Tree-like structures are also used to represent generic structural patterns occurring across a collection of XML trees. Definition 2. Substructure. Let t and s be two XML trees. s is said to be a substructure of t, if there exists a total function ϕ : Vs → Vt , that satisfies the following conditions for each n, ni , nj ∈ Vs . Firstly, (ni , nj ) ∈ Es iff ϕ(ni ) ≺p ϕ(nj ) in t with p ≥ 1. Secondly, λs (n) = λt [ϕ(n)]. t u The mapping ϕ preserves node labels and hierarchical relationships. In this latter regard, depending on the value of p, two definitions of substructures can be distinguished. In the simplest case p = 1 and a substructure s is simply an induced tree pattern that matches a contiguous portion of t, this is indicated as s v t. When p ≥ 1 [6, 12], s matches not necessarily contiguous portions of t, this is denoted as s ⊆ t and s is also said to be an embedded tree pattern of t. Our clustering is based on structural similiraty. Two documents are similar if they share some elements, which can be nodes, edges (parent-child relationship), paths (ancestor-descendant relationship), etc... For this reason we choose to cluster the documents in a multi-stage way, considering one by one a set of elements, belonging to a specific feature space (nodes, edge, paths, etc). At a given stage i, finding clusters in the high-dimensional feature space S (i) is a challenging issue for various reasons [3]. The XML trees are partitioned by the AT-DC algorithm [3], which is an effective hierarchical and parameter-free method for transactional clustering. The main clustering procedure is reported in fig. 1. It consists of m stages of clustering (line 1). The end user incorporates (at line 1) valuable domain knowledge and application semantics into the clustering process, by establishing the most appropriate set of structural features S (i) for each stage as well as the overall number m of stages. 39 Generate-Hierarchy(D) Input: a set D = {t1 , . . . , tN } of XML trees; Output: a set ∪i P (i) of multiple cluster partitions; 1: let S (i) be the set of features at stage i, with i = 1, . . . , m; 2: let i ← 1; 3: let P ← {D }; 4: while i ≤ m do 5: while P = 6 ∅ do 6: let C be a cluster in P; 7: P ← P − C; 8: R ← Generate-Clusters(C, S (i) ); 9: for each C 0 ∈ R do 10: let C 0 ← R − {C 0 } be the set of siblings of C 0 ; 11: end for 12: P (i) ← P (i) ∪ R; 13: end while 14: for each C ∈ P (i) do 15: Rep(C) ← MineRep(C, C, α); 16: end for 17: P ← P (i) ; 18: i ← i + 1; 19: end while 20: RETURN ∪i P (i) ; Fig. 1. The hierarchical clustering process The generic stage i (lines 4-19) consists of two phases: cluster separation and summarization. Cluster separation exploits AT-DC to divide the individual clusters belonging to the current partition P with respect to the feature space S (i) (lines 5 - 13). At the beginning, i.e. when i = 1, the current partition P includes a single cluster, which coincides with the whole dataset D of XML trees (line 3). The partition P (i) resulting at the end of stage i (line 13) is itself a collection of partitions. More precisely, at the current stage i, each parent cluster C from P (i−1) is divided into an appropriate number of child clusters (line 8), which together form the partition R of the foresaid C. At this point, each child cluster C 0 in R is associated (lines 9-11) with its siblings C 0 = R − C (for the cluster summarization purpose) and R is then added to the ongoing P (i) . Cluster summarization (lines 14-16) is applied to each cluster C from the obtained P (i) . It consists of a procedure, discussed in section 3, which associates C with a set Rep(C) of representative substructures, that subsume the structural information within C. P (i) becomes (at line 17) the current partition P for the subsequent stage i + 1. At this stage, AT-DC is re-applied to further divide every cluster C ∈ P (i) with respect to another set of structural features, i.e., S (i+1) . The choice of a distinct feature space at each stage guarantees a progressively increasing degree of structural homogeneity. Moreover, at each distinct stage, representatives provide a summarization of the tree structures within the corresponding clusters in terms of (a combination of) the structural features considered a that particular stage. Hence, the representative of a subcluster highlights local patterns of structural homogeneity, that are not caught by the representative of the parent cluster. 40 3 Cluster Summarization The representative of a cluster of XML trees is modeled as a set of highly representative tree patterns, which provide an intelligible summarization of the most relevant structural properties in the cluster. Notice that, as mentioned before, a cluster is already characterized by a set of relevant features. However, features can be combined further, and they do not necessarily allow to distinguish among different clusters. A set of tree patterns is actually viewed as the representative of a cluster of XML trees, if its frequency in the cluster is much higher then elsewhere, and it exhibits a strong degree of correlation with the documents of the cluster. A representative can be computed by merging patterns. To avoid combinatory explosion, we consider only two types of tree pattern composition, namely parent-child and sibling tree patterns composition. Definition 3. Parent-child tree pattern. A parent-child tree pattern is an arrangement of two basic tree patterns, in which one of the two tree patterns is rooted at some leaf node of the other tree pattern. Let si and sj be two generic tree patterns. Also, assume that l is some leaf node of si . The operator si /l sj defines a new parent-child tree pattern s, such that |Vs | = |Vsi | + |Vsj | and |Es | = |Esi | + |Esj | + 1, wherein the root rsj of sj is a child of l. Given any two tree patterns si and sj , the set of all possible parent-child tree patterns in which the root of sj is a child of the individual leaves of si is denoted as [ {si /l sj } si / sj = l∈Lsi where Lsi represents the set of leaves of si . t u A parent-child tree pattern is a vertical arrangement of two component tree patterns. Instead, a sibling tree pattern follows from an horizontal arrangement of its components. Definition 4. Sibling tree pattern. Given two tree patterns with a same label at their roots, a sibling tree pattern is a composite structure, whose root-to-leaf paths are the union of the root-to-leaf paths in the two component patterns. Let si and sj be two tree patterns such that λsi (rsi ) = λsj (rsj ). The Representative Discovery Procedure is an Apriori-based technique whose candidate generation phase is performed through these compositions. 4 Evaluation The behavior of the devised clustering approach is now investigated through an empirical evaluation with three objectives: the assessment of clustering quality, the evaluation of cluster-summarization and a performance comparison. 41 All experiments were conducted on a Windows machine, with an Intel Itanium processor, 2Gb of memory and 2Ghz of clock speed. Standard benchmark data sets were employed for a direct comparison against the competitors. Realworld data, named Real, encompasses the following collections: Astronomy (217 documents), Forum (264 messages), News (64 documents), Sigmod (51 documents), Wrapper (53 documents). The distribution of tags within the above documents is quite heterogeneous, due to the complexity of the DTDs associated with the classes, and to the documents’ semantics. Three further synthetic data sets were generated from as many collections of DTSs reported in [5]. The first synthesized data set, referred to as Synth1, comprises 1000 XML documents produced from a collection of 10 heterogeneous DTDs (illustrated in fig. 6 of [5]), that were individually used to generate 100 XML documents. These DTDs exhibit strong structural differences and, hence, most clustering algorithms can produce high-quality results. A finer evaluation can be obtained by investigating the behavior of the compared algorithms on a collection of XML documents, that are very similar to one another from a structural point of view. To perform such a test, a second synthesized data set, referred to as Synth2 and consisting of 3000 XML documents, was assembled from 3 homogeneous DTDs (illustrated in fig. 7 of [5]), individually used to generate 1000 XML documents. Experiments over Synth2 clearly highlight the ability of the competitors at operating in extremely-challenging applicative-settings, wherein the XML documents share multiple forms of structural patterns. Additionally, Synth3 is a collection consisting of the synthesized documents in [7], which exhibit a 30% degree of overlap. Synth3 allows us to compare the effectiveness of the devised approach to the approach proposed in [7]. Clustering effectiveness is evaluated over each partition Pi = {C1 , . . . , Ck } and it is measured in terms of average precision and recall [1]. Table 1 shows the results of clustering on such collections. As we can see, precision and recall are optimal, even for the collection Synth2 of homogeneous documents. Collection N. of Docs Classes Clusters Avg Precision Avg Recall Avg Γ Time Real Synth1 Synth2 Synth3 Synth4 649 1000 3000 1400 800 5 10 3 7 8 5 10 3 7 10 1 1 1 1 1 1 1 1 1 0.8 0.9558 0.9455 0.3833 0.7875 0.7127 20.48s 13.32s 7.5s 2.68s 3.68s Table 1. Evaluation of separability and homogeneity To deeply investigate the Generate-Hierarchy procedure effectiveness, we produce a new data set Synth4 which requires the multi-layer clustering over all the features we consider. It is composed by 800 documents whose schema is shown in fig. 2. The DTDs capture substantial similarities and differences. In particular, all dataset exhibit different paths (but they can share some edges). The documents in DTD4 can be further split, since they can exhibit trees with paths ending 42 Fig. 2. DTDs for the Synth3 dataset in the node A6. Also, node frequencies in DTD4 can substantially differ, thus differentiating this DTD from the others even at a node level. This situation is fully captured by the clustering algorithm, as shown in fig. 3(a). DTD4 is separated from DTD3 at the node level and further split in two subclusters at the edge level, according to whether edges contain or not edge (A9, A6). Also, the trees containing such an edge can be further split according on whether or not they contain the path from A10 to A6. Notice that, by the contrary, DTD8 does not behave similarly, since there is no such a node like A6 that differentiates the trees in the class. (a) Cluster hierarchy for Synth3 (b) Cluster hierarchy for Sigmod Fig. 3. The evaluation of the multi-stage clustering is further confirmed by experimenting on Sigmod. As already mentioned, this dataset consists of documents complying with three different DTDs. In particular, the distribution of the documents is unbalanced, since one of the DTDs, named IndexTermsPage, contains much more documents the the other ones. Figure 3(b) shows that GenerateHierarchy separates all documents complying to different DTDs and further splits the documents in the class related to IndexTermsPage, according to whether or not these documents contain the optional elements described in the DTD (mainly, categoryAndSubjectDescriptorsTuple, category, content and term). In particular, the separation of such a class leads to two subclasses C1 43 and C2 , that can be described by two DTDs, both subsumed by IndexTermsPage. The difference between C1 and C2 is the absence (in C1 ) and the presence in C2 , of the elements of IndexTermsPage. The evaluation of the accuracy of cluster summarization is inspired to an idea originally proposed in [5] for a different purpose, i.e., measuring the structural homogeneity of a set of intermediate clusters obtained while partitioning a collection of XML documents. Let t be an XML tree and R a set of substructures. The representativeness γ(R, t) of R with respect to t is the fraction of nodes in t matched by the embedded substructures of R: γ(R, t) = | ∪s∈R,s⊆t {n ∈ V|Vs 7→ V ⊆ Vt }| |Vt | where, Vt and Vs are the sets of nodes of, respectively, the XML tree t and the generic substructure s. V is instead the subset of the nodes in t matched by the nodes of s (which is the meaning of notation Vs 7→ V). Representativeness can be easily generalized to clusters. The representativeness Γ [Rep(C)] of the representative Rep(C) with respect to a cluster C can be defined as the average representativeness of the documents in the cluster. A connection between cluster representativeness and structural homogeneity explains unexpectedly low Γ values. Representativeness is high when the representative frequently occurs in a cluster but not in the other ones. In the case of homogeneous documents the erasure of candidate structures is very frequent: the only structures that survives are very specific, so they are infrequent in the cluster (for example in Synth2 ). Table 1 shows the average Γ value exhibited in each experiment. In order to evaluate the scalability of the algorithm, we used the DTDs for Synth1 and produced respectively 100, 1000, 10.000 and 100.000 documents with 2, 4, and 8 clusters. The results are shown in fig. 4. As we can see, the algorithm is linear both in the number of documents and in the number of clusters. Fig. 4. Performance in milliseconds for data sets of different size At the end of this intensive empirical evaluation, the devised approach can be compared against a state-of-the-art competitor, namely the XProj approach [5]. By looking at the performance of XProj reported in [5], it can be observed that 44 our clustering approach attains the same quality. However, two strong advantages of the proposed approach are: the development of a hierarchy of nested clusters, explaining multiple forms of structural relationships in the data; the summarization of a cluster of XML documents, which provides an intelligible subsumption of its structural properties. Also, notice that the scalability of our approach is orders of magnitude higher than the one of XProj. Finally, the devised approach is fully-automatic (i.e., parameter-free), whereas the optimal performance of XProj, on each data set, is the consequence of a complex setting process. 5 Conclusions A new approach to clustering of XML documents was proposed, that produces a hierarchy of nested clusters. Along the paths from the root to the leafs of the hierarchy, the approach progressively separates the XML data by looking at the occurrence of different types of structural patterns in their structures. Also, each cluster in the hierarchy is subsumed, through a novel summarization method, by a set of representative substructures, that provide an understanding of the structural properties considered in the cluster. A comparative evaluation proved that the devised approach is on a par and even better than established competitors in terms of effectiveness, scalability and cluster summarization. References 1. R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. AddisonWesley, 1999. 2. R. Baumgartener, S. Flesca, and G. Gottlob. Visual web information extraction with lixto. In Procs. VLDB’01 Conf., pages 119 – 128, 2001. 3. E. Cesario, G. Manco, and R. Ortale. Top-down parameter-free clustering of highdimensional categorical data. IEEE TKDE, 19(12):1607 – 1624, 2007. 4. V. Crescenzi, G. Mecca, and P. Merialdo. Roadrunner: Towards automatic data extraction from large web sites. In Procs. VLDB’01 Conf., pages 109–118, 2001. 5. C. C. Aggarwal et al. Xproj: A framework for projected structural clustering of xml documents. In Procs. SIGKDD’07 Conf., pages 46 – 55, 2007. 6. C. Wang et al. Efficient pattern-growth methods for frequent tree pattern mining. In Procs. PAKDD’04 Conf., pages 441 – 451, 2004. 7. G. Costa et al. A tree-based approach to clustering xml documents by structure. In Procs. PKDD’04 Conf., pages 137 – 148, 2004. 8. M. L. Lee et al. Xclust: Clustering xml schemas for effective integration. In Procs. CIKM’02 Conf., pages 292 – 299, 2002. 9. T. Dalamagas et al. A methodology for clustering xml documents by structure. Information Systems, 31(3):187 – 228, 2006. 10. W. Lian et al. An efficient and scalable algorithm for clustering xml documents by structure. IEEE TKDE, 16(1):82 – 96, 2004. 11. S. Helmer. Measuring the structural similarity of semistructured documents using entropy. In Procs. VLDB’07 Conf., pages 1022 – 1032, 2007. 12. M. J. Zaki. Efficiently mining frequent trees in a forest: Algorithms and applications. IEEE TKDE, 17(8):1021 – 1035, 2005. 45 Outlier Detection For XML Documents (Extended Abstract) Giuseppe Manco and Elio Masciari ICAR-CNR {manco,masciari}@icar.cnr.it Abstract. XML (eXtensible Markup Language) became in recent years the new standard for data representation and exchange on the WWW. This has resulted in a great need for data cleaning techniques in order to identify outlying data. In this paper, we present a technique for outlier detection that single out anomalies with respect to a relevant group of objects. We exploit a suitable encoding of XML documents that are encoded as signals of fixed frequency that can be transformed using Fourier Transforms. Outlier are identified by simply looking at the signal spectra. The results show the effectiveness of our approach. 1 Introduction An outlier is an observation that differs so much from other observations as to arouse suspicion that it was generated by a different mechanism [8]. There exist several approaches to the identification of outliers, namely, statistical-based [5], deviation-based [4], distance-based [3], density-based [6], projection-based [1], MDEF-based [12], and others. Abstracting from the specific method being exploited the general outlier detection task is the problem of identifying deviations from the general patterns characterizing a data set. Detecting outliers is important in many application scenarios, as an example it can be used for improving data cleaning approaches, where outliers are often data noise or errors diminishing the accuracy of data mining. Outlier detection is also the core of applications such as fraud detection, stock market analysis, intrusion detection, marketing, network sensors, and email spam detection, where irregular patterns entail special attention. Due to the increasing usage of semi-structured data models like XML (eXtensible Markup Language), that is the new standard for data representation and exchange on the Web there is a great need for outlier detection strategies entailed for such data. Although outlier detection methods are well established for relational data, adapting them directly to XML data is unfeasible because XML and relational data models differ in several aspects. First, XML data contain multiple levels of nested elements (or attributes) organized in a tree-based structure, whereas relational data models have a flat tabular structure. Indeed, the hierarchical structure of XML data induce an ordering lacking in relational data. Also, the modeling objectives for XML and relational data are different, and therefore different relationships are represented. In relational data models, the primary-foreign key relationships between entities form the basis for data normalization and referential integrity. On the contrary, relationships between the XML elements are encoded in hierarchies, often with direct semantic correspondence to the real-world relations such as containment and composition. Despite its importance, XML outlier detection has not been paid the attention it deserves. There exists few works addressing structural and attribute outlier detection for XML. The main distinction is between class outlier and attribute outlier, i.e. respectively outlier based on the overall structure of the document and outlier based on univariate points that exhibit deviating correlation 46 behavior with respect to other attributes[10]. In [10] an approach is presented for correlation based attribute outlier detection, while main approaches for class outlier detection try to adapt the techniques defined for the relational setting to the semistructured one. They have been mainly proposed for data cleaning purposes like in [15, 14, 16]. Our approach. We will tackle in this paper the class outlier detection problem. The basic intuition exploited in this paper is that an XML document has a “natural” interpretation as a time series (namely, a discrete-time signal), in which numeric values summarize some relevant features of the elements enclosed within the document. We can get an example evidence of this observation by simply indenting all the tags in a given document according to their nesting level. Indeed, the sequence of indentation marks (as they appear within the document rotated by 90 degrees), can be looked at as a time series, whose shape roughly describes the document’s structure. Hence, a key tool in the analysis of time-series data is the use of the Discrete Fourier Transform (DFT): some useful properties of the Fourier transform, such as energy concentration or invariance under shifts, enable to analyze and manipulate signals in a very powerful way. The choice of comparing the frequency spectra follows both effectiveness and efficiency issues. Indeed, the exploitation of DFT leads to abstract from structural details which, in most application contexts, should not affect the similarity estimation (such as, e.g., different numbers of occurrences of a shared element or small shifts in the actual positions where it appears). This eventually makes the comparison less sensitive to minor mismatches. Moreover, a frequency based approach allows to estimate the similarity through simple measures (e.g., vector distances) which are computationally less expensive than techniques based on the direct comparison of the original document structures. To summarize, we propose to represent the structure of an XML document as a time series, in which each occurrence of a tag in a given context corresponds to an impulse. By analyzing the frequency spectra of the signals, we can hence state the degree of (structural) similarity between documents. It is worth noticing that the overall cost of the proposed approach is only O(N log N ), where N is the maximum number of tags in the documents to be compared. Once defined an effective distance measure we will exploit a distance-based outlier detection algorithm in order to single out the outlying documents. 2 Problem Statement and Overview of the Proposal We begin by presenting the basic notation for XML documents that will be used hereafter. An XML document is characterized by tags, i.e., terms enclosed between angled brackets. Tags define the structure of an XML document and provide the semantics of the information enclosed. A tag is associated with a tag name (representing the term enclosed between angled brackets), and can appear in two forms: either as a start-tag (e.g., <author>), or as an end-tag (characterized by the / symbol, like, e.g., in </author>). Finally, a tag instance denotes the occurrence of a tag within a certain document. It is required that, in a well-formed XML document, tags are properly nested, i.e. each start-tag has a corresponding end-tag at the same level. Therefore, an XML document can be considered as an ordered tree, where each node (an element) represents a portion of the document, delimited by a pair of start-tag and end-tag instances, and denoted by the tag name associated with the instances. The structure of an XML document corresponds to the shape of the associated tree. In a tree, several types of structural information can be detected, which correspond to different refinement levels: for example, attribute/element labels, edges, paths, subtrees, etc. Defining the similarity among two documents essentially means 47 choosing an appropriate refinement level and comparing the documents according to the features they exhibit at the chosen level. Different choices may result in rather dissimilar behaviors: in particular, comparing simple structural components (such as, e.g., labels or edges) allows for an efficient computation of the similarity, but typically produces loose-grain similarity values. On the other hand, complex structural components would make the computation of similarity inefficient, and hence unpractical. Consider, for example, the documents represented in Fig. 1. If a comparison of nodes or edges is exploited, documents book1 and book2 appear to be similar, even though the subtrees rooted at the book element appear with different frequencies. Accounting for frequencies does not always help: for example, if the order of appearance of the subtrees of the xml element in book3 were changed, the resulting tree still should have the same number of nodes, edges and even paths. In principle, approaches based on tree-edit distance [11] can better quantify the difference between XML trees; however, they turn out to be too expensive in many applications contexts, as they are generally quadratic w.r.t. document sizes. Finally, notice that solutions based on detecting local substructures [9] to be used as features may be even hard to handle, for they showing two main disadvantages: first, they may imply ineffective representations of the trees in high dimensional spaces, and second, costly feature extraction algorithms are required. xml xml book book xml book book title book title author title (a) book1 author title author book title author (b) book2 author title author book title author name email (c) book3 Fig. 1. book1 and book2 have the same elements, but with different cardinality. By contrast, book3 induces a different structure for the author element. In our opinion, an effective notion of structural similarity should take into account a number of issues. First of all, it is important to notice that each document may induce a definition of the elements involved. Thus, an appropriate comparison between two documents should rely on the comparison of such definitions: the more different they are, the more dissimilar are the documents and this information has to be exploited for signaling candidate outliers. Our main objective is the development of an efficient method which is able to approximate the above features at best. Thus, we can state the problem of finding the structural similarity in a set of XML documents as follows. Given a set D = {d1 , . . . , dn } of XML documents, we aim at building a similarity matrix S, i.e., a matrix representing, for each pair (di , dj ) of documents in D, an optimal measure of similarity sij . Here, optimality refers to the capability of reflecting the above described differences. Observe that we do not address here the problem of finding which parts of two documents are similar or not, as, e.g., tree-edit based techniques do. We propose a technique which is essentially based on the idea of associating each document with a time series representing, in a suitable way, both its basic elements and their relationships within the document. More precisely, we can assume a preorder visit of the tree-structure of an XML docu- 48 ment. As soon as we visit a node of the tree, we emit an impulse encoding the information corresponding to the tag. The resulting signal shall represent the original document as a time series, from which relevant features characterizing a document can be efficiently extracted. As a consequence, the comparison of two documents can be accomplished by looking at their corresponding signals. The main features of the approach can be summarized as follows: 1) Each element is encoded as a real value. Hence, the differences in the values of the sequence provide for an evaluation of the differences in the elements contained by the documents; 2) The substructures in the documents are encoded using different signal shapes. As a consequence, the analysis of the shapes in the sequences realizes the comparison of the definitions of the elements. 3) Context information can be used to encode both basic elements and substructures, so that the analysis can be tuned to handle in a different way mismatches which occur at different hierarchical levels. In a sense, the analysis of the way the signal shapes differ can be interpreted as the detection of different definitions for the elements involved in the documents. Moreover, the analysis of the frequencies of common signal shapes can be seen as the detection of the differences between the occurrences associated with a repetition marker. In this context, the proposed approach can be seen as an efficient technique, which can satisfactorily evaluate how much two documents are similar w.r.t. the structural features previously discussed. Notably, the use of time-series for representing complex XML structures, combined with an efficient frequency-based distance function, is the key for quickly evaluating structural similarities: if N is the maximum number of tags in two documents, they can be compared in only O(N log N ) time. In particular, the use of DFT supports the above described notion of similarity: if two documents share many elements having a similar definition, they will be recognized as similar, even when there are repeated and/or optional sub-elements. Indeed, working on frequency spectra makes the comparison less sensitive to the differences in the number of occurrences of a given element and to small shifts in the actual positions where it occurs in the documents. The details on representing an XML document as a signal are omitted here due to space limitations, a complete explanation can be found in [7]. 3 Comparing Documents using DFT Once defined a proper document encoding, we can now detail the similarity measures for XML documents, sketched in section 1. As already mentioned, we can assume that we are visiting the tree-structure of an XML document d (using a preorder visit) starting from an initial time t0 . We also assume that each tag instance occurs after a fixed time interval ∆. The total time spent to visit the document is N ∆, where N is the size of tags(d). During the visit, as we find a tag, we produce an impulse which depends on a particular tag encoding function e and on the overall structure of the document (i.e., the document encoding function enc). As a result of the above physical simulation, the visit of the document produces a signal hd (t), which usually changes its intensity in the time interval [t0 , t0 + N ∆). The intensity variations are directly related to the opening/closure of tags: ½ [enc(d)](k) if t0 + k∆ ≤ t < t0 + (k + 1)∆ hd (t) = 0 if t < t0 or t ≥ t0 + N ∆ Comparing such signals, however, might be as difficult as comparing the original documents. Indeed, comparing documents having different lengths, requires 49 the combination of both resizing and alignment operations. Moreover, the intensity of a signal strongly depends on the encoding scheme adopted, which can in turn depend from the context (as in the case, e.g., of the multilevel encoding scheme). In order to compare two documents di and dj , hence, we can exploit the properties of the corresponding transforms. In particular [2, 13], a possibility is to exploit that, by Parseval’s theorem, energy (total power) is an invariant in the transformation (and hence the information provided by the encoding remains unchanged in the transform). However, a more effective discrimination can exploit the difference in the magnitude of frequency components: in a sense, we are interested (i ) in abstracting from the length of the document, and (ii ) in knowing whether a given subsequence (representing a subtree in the XML document) exhibits a certain regularity, no matter where the subsequence is located within the signal. In particular, we aim at considering as (almost) similar documents exhibiting the same subtrees, even if they appear at different positions. Now, as the encoding guarantees that each relevant subsequence is associated with a group of frequency components, the comparison of their magnitudes allows the detection of similarities and differences between documents. Observe that measuring the energy of the difference signal would result in a low similarity value. On the other side, if the phases of the documents’ transforms are disregarded, documents are more likely to be considered as similar. A viable approximation can be the interpolation of the missing coefficients starting from the available ones. It is worth noticing that the approximation error due to interpolation is inversely proportional to min(Ndi , Ndj ): the more elements are available in a document d, the better the DFT approximates the (continuous) Fourier Transform of the signal hd (t), and consequently the higher is the degree of reliability of interpolation. As a practical consequence, the approach is expected to exhibit good results with large documents, providing poorer performances with small documents. Definition 1. Let d1 , d2 be two XML documents, and enc a document encoding function, such that h1 = enc(d1 ) and h2 = enc(d2 ). Let DFT be the Discrete Fourier Transform of the (normalized) signals. We define the Discrete Fourier Transform distance of the documents as the approximation of the difference of the magnitudes of the DFT of the two encoded documents:   12 X ¡¯ ¯ ¯ ¯¢2 ¯[DFT(h ¯ ¯ ˜ ¯  ˜ dist(d1 , d2 ) =  1 )](k) − [DFT(h 2 )](k) M/2 k=1 ˜ is an interpolation of DFT to the frequencies appearing in both d1 where DFT and d2 , and M is the total number of points appearing in the interpolation, i.e., M = Ndi + Ndj − 1 points. u t 3.1 Outlier Identification Once defined a technique to state the similarity between two XML documents we need to define a strategy that, exploiting such a technique, identifies anomalies in the data set. Definition 2 (Fourier Based XML Outlier). Given a set of XML documents S, a positive integer k, and a positive real number R, a documents d ∈ S is a DB(k, R)-outlier, or a distance-based outlier with respect to parameters k and R, if less than k objects in S lie within distance R from o w.r.t. our distance metric. 50 The threshold values R and k has to be chosen depending on the scenario being monitored. Once defined our notion of outlier, we can design an effective method for outlier detection. Algorithm 1 Function Compute Outlier: INPUT: a set of XML documents S = {d1 , · · · , dn }, a pair of threshold values R and k, an XML document dnew ; OUTPUT: Y es if dnew is an outlier no otherwise; begin temp = 0 For each di ∈ S do dist = computeDF T Distance(di , dnew ) if dist > R temp = temp + 1 if temp > k return Y es return N o; end Function computeDF T Distance evaluates the DF T distance between the XML documents being analyzed and the XML documents previously collected. If the computed distance is greater than the threshold distance set by the user we will increase an auxiliary variable temp. If temp is greater than the threshold value k the document is marked as outlying. Proposition 1. Algorithm 1 works in time O(|S|Ṅ log(N )). The running time can be trivially computed by observing that for each document being analyzed we have to compute the Fourier Transformation and this operation is performed in O(N log(N )) time that is the dominant operation for the Algorithm. 4 Experimental Results In this section, we present the experiments we performed to assess the effectiveness of the proposed approach in detecting outliers. To this purpose, a collection of tests is performed, and for each test some relevant groups of homogeneous documents (document classes) are considered. The direct result of each test is a similarity matrix representing the degree of similarity for each pair of documents in the data set and the number of detected outliers. The evaluation of the results relies on some a priori knowledge about the document classes being used that was obtained by domain experts or available from the datasets providers. We performed several experiments on a wide variety of real datasets. More in detail we report here due to space limitations the results on the following data. The documents used belong to two main classes:1) Astronomy, a data set containing 217 documents extracted from an XML-based metadata repository, that describes an archive of publications owned by the Astronomical Data Center at NASA/GSFC (http://adc.gsfc.nasa.gov/); 2) Sigmod, a data set composed of 51 XML documents containing issues of SIGMOD Record. Such documents were obtained from the XML version of the ACM SIGMOD Web site. For each class we added some outlying documents by perturbating the original DTD’s. We compared our approach with the one proposed in [16], we refer to it as N oise. In order to perform a simple quantitative analysis we produce for each test a similarity matrix, aimed at evaluating the resulting neighbors similarities (i.e., the average of the values computed for documents belonging to the same class), and to compare them with the outer similarities (i.e., the similarity computed by considering only documents belonging to different classes). To this 51 purpose, values inside the matrix can be aggregated according to the class of membership of the related elements: given a set of documents belonging to n prior classes, a similarity matrix S about these documents can be summarized by a n × n matrix CS , where the generic element CS (i, j) represents the average similarity between class i and class j. P  x,y∈Ci ,x6=y DIST (x,y) iff i = j i |×(|Ci |−1) CS (i, j) = Px∈C|C,y∈C DIST (x,y) i j  otherwise |Ci |×|Cj | where DIST (x, y) is the chosen distance metric (N oise metric or our F ourier metric). The above definition is significant since we normalize the metric value so eventually we can use different approaches, this will allow us to compare performance in the ideal setting for any approach. The higher are the values on the diagonal of the corresponding CS matrix w.r.t. those outside the diagonal, the higher is the ability of the similarity measure to separate different classes. In the following we report a similarity matrix for each dataset being considered, as it will be clear the reported results show that our technique is quite effective for outlier detection. In particular, the similarity matrix will give an intuition about the ability of the approach to catch the neighboring documents for each class while the number of outliers detected for each dataset is reported in a separate table. We used for the experiments the following parameter values: as k value the maximum number of documents supposed to belong to each class and as R value the average distance between documents belonging to the same class. Measuring Effectiveness for Astronomy. For this dataset our prior knowledge is the partition of the documents into two classes. As it is easy to see in Figure 2(a) and (b) F ourier outperforms N oise by allowing a perfect assignment of the proper neighboring class to each document. N oise Class 1 Class 2 F ourier2D Class 1 Class 2 Class 1 Class 2 0.6250 1 1 0.6250 Class 1 0.9790 0.8528 Class 2 0.8528 0.9915 (a) (b) Fig. 2. N oise and F ourier similarity matrices for Astronomy dataset In Figure 3 the number of detected outlier is reported. The actual number of outliers for each class is 7 so as it is easy to see F ourier exactly detected all the outliers in the dataset, such a result is quite understandable considering that the similarity matrix for F ourier exactly recognized the neighboring documents. M ethod N oise F ourier Class 1 Class 2 5 7 4 7 Fig. 3. N oise and F ourier2D number of detected outliers for Astronomy dataset Measuring Effectiveness for Sigmod. In this case there were 3 main classes as it is shown in Figure 4(a) and (b). Also in this case F ourier outperforms N oise. As we can see, differences among the various classes are marked with higher precision by F ourier. This is mainly due to the fact that our approach is quite discriminative since it takes into account all the document features. For this dataset the number of actual outliers was 8 for each class, as it is easy to see in Figure 5 F ourier still outperforms N oise for this dataset. 52 N oise Class 1 Class 2 Class 3 Class 1 0.9986 0.7759 0.7055 Class 2 0.7759 0.9889 0.7566 Class 3 0.7055 0.7566 0.9920 F ourier Class 1 Class 2 Class 3 Class 1 Class 2 Class 3 0.9885 0.7439 0.7108 0.7439 0.9899 0.7223 0.7108 0.7223 0.9874 (a) (b) Fig. 4. N oise and F ourier similarity matrices for Sigmod dataset M ethod Class 1 Class 2 Class 3 N oise F ourier 6 7 4 8 5 8 Fig. 5. N oise and F ourier number of detected outliers for Sigmod dataset 5 Conclusion In this paper we addressed the problem of detecting outliers in XML data. The technique we have proposed is mainly based on the idea of representing a document as a signal. Thereby, the similarity between two documents can be computed by analyzing their Fourier transforms thus defining a distance measure that can be exploited to define distance based outliers. Experimental results showed the effectiveness of the approach in detecting outlying XML documents. References 1. C.C. Aggarwal and P. Yu. Outlier detection for high dimensional data. In SIGMOD01, 2001. 2. R. Agrawal, C. Faloutsos, and A. Swami. Efficient Similarity Search in Sequence Databases. In FODO’93, pages 69–84, 1993. 3. F. Angiulli and F. Fassetti. Dolphin: An efficient algorithm for mining distancebased outliers in very large datasets. TKDD, 3(1), 2009. 4. A. Arning, C. Aggarwal, and P. Raghavan. A linear method for deviation detection in large databases. In KDD’96, page 164169, 1996. 5. V. Barnett and T. Lewis. Outliers in Statistical Data. John Wiley & Sons, 1994. 6. M.M. Breunig, H. Kriegel, R. Ng, and J. Sander. Lof: Identifying density-based local outliers. In SIGMOD00, 2000. 7. S. Flesca, G. Manco, E. Masciari, L. Pontieri, and A. Pugliese. Fast detection of xml structural similarity. IEEE TKDE, 17(2):160–175, 2005. 8. D. Hawkins. Identification of Outliers. Monographs on Applied Probability and Statistics. Chapman & Hall, 1980. 9. H. Kashima and T. Koyanagi. Kernels for semi-structured data. In Procs. Int. Conf. on Machine Learning (ICML’02), pages 291–298, 2002. 10. Judice L. Y. Koh, Mong Li Lee, Wynne Hsu, and Wee Tiong Ang. Correlationbased attribute outlier detection in xml. In Proceedings of the 2008 IEEE 24th International Conference on Data Engineering, pages 1522–1524, Washington, DC, USA, 2008. IEEE Computer Society. 11. A. Nierman and H.V. Jagadish. Evaluating structural similarity in XML documents. In Procs. 5th Int. Workshop on the Web and Databases (WebDB 2002), 2002. 12. S. Papadimitriou, H. Kitagawa, P. Gibbons, and C. Faloutsos. Loci: Fast outlier detection using the local correlation integral. In ICDE’03, page 315326, 2003. 13. D. Rafiei and A. Mendelzon. Efficient retrieval of similar time series. In FODO’98, 1998. 14. Choh Man Teng. Polishing blemishes: Issues in data correction. IEEE Intelligent Systems, 19:34–39, 2004. 15. Melanie Weis and Felix Naumann. Dogmatix tracks down duplicates in xml. In Proceedings of the 2005 ACM SIGMOD international conference on Management of data, SIGMOD ’05, pages 431–442, New York, NY, USA, 2005. ACM. 16. Xingquan Zhu and Xindong Wu. Class noise vs. attribute noise: a quantitative study of their impacts. Artif. Intell. Rev., 22:177–210, November 2004. 53 P2P support for OWL-S discovery Domenico Redavid† , Stefano Ferilli? , and Floriana Esposito? † ? Artificial Brain S.r.l., Bari, Italy Computer Science Department, University of Bari “Aldo Moro”, Italy [email protected] {ferilli, esposito}@di.uniba.it Abstract. The discovery of Web services is often influenced by the rigid structure of registries containing their XML description. In recent years some methods that replace the traditional UDDI registry with Peer To Peer networks for the creation of catalogs of Web services have been proposed in order to make this structure flexible and usable. This paper proposes a different view by placing the semantic description of services as content of P2P networks and showing that all the needed information for an efficient Web service discovery is already contained in its OWL-S description. 1 Introduction The discovery of Web services (Ws)1 is achieved through Universal Description, Discovery and Integration (UDDI)2 , which provides a standard mechanism to register and search WS descriptions. An UDDI registry is an indexed database that contains instances of Web Services Description Language (WSDL)3 in turn based on eXtensible Markup Language (XML)4 and independent from hardware platforms. A requester needing to use a service queries the UDDI registry to find the one that best meets its needs. The register returns an access point and WSDL description which are then used by the requester to build needed SOAP5 messages to communicate with the actual service. The UDDI registry is supported by a worldwide network of nodes connected in a federation. When a client sends information to a registry, this is propagated to other nodes. In this way it implements data redundancy, providing a certain degree of reliability. However, data replication implies lower consistency and is not a scalable approach. Another limitation of UDDI is the search mechanism: it can focus only on a single search criterion such as name, location, category 1 2 3 4 5 W3C Web of Services - http://www.w3.org/standards/webofservices/ Universal Description, Discovery and Integration v3.0.2 (UDDI), OASIS Specification - http://uddi.org/pubs/uddi_v3.htm Web Services Description Language (WSDL) Version 2.0 Part 0: Primer, W3C Recommendation 26 June 2007 - http://www.w3.org/TR/wsdl20-primer/ W3C XML Technology - http://www.w3.org/standards/xml/ SOAP Version 1.2 Part 0: Primer (Second Edition), W3C Recommendation 27 April 2007 - http://www.w3.org/TR/soap12-part0/ 54 of business, etc.. Within the Service-Oriented Architecture (SOA) [3], the register has a role similar to yellow pages where a list of services can be found. To fully exploit the potential of this type of architecture, the register should be consultable not only by humans but also by software systems that need to find, select and compose services in an automatic way. In recent years research has been focusing on Peer-to-Peer (P2P) technologies [2] that offer Distributed Hash Table (DHT) [11] functionalities. A P2P network provides a typical distributed decentralized approach where multiple computers are interconnected and communicate by exchanging messages. DHT partitions the items of a key set among participating nodes, and can send messages to the owner of a given key in an efficient manner. The P2P network with DHT support is scalable and solves the problem of data redundancy, but supports only exact match for keywords. The inclusion in the P2P network of references to service semantics could be a turning point, because such information could be exploited for the automatic discovery of services with the help of semantic matchmaking techniques. In this paper we discuss how this vision can be realized. Section 2 introduces the basic concepts related to WS registries and Catalogs based on P2P protocols, and the OWL-S language for the representation of the semantics associated with services. Section 3 describes an implementation of the P2P network created by means of the Open Chord API and OWL-S. Finally, Section 4 presents an analysis on the potential of the proposed approach. 2 2.1 Background Web service registries Web services are software systems identified by means of a Web address and designed to support interoperability between computers on a network. They have public interfaces defined and described as XML documents in a format, such as WSDL, that can be processed by a machine in an automatic way. Their definitions can be sought from other software systems, which can directly interact with the Web service operations described in the interface by activating the appropriate messages enclosed in an SOAP envelope. These messages are usually transported via the Hyper Text Transport Protocol (HTTP) and formatted according to the XML standard. For the purposes of this discussion is important to point out what are the current approaches to the organization of Web services. These approaches can be broadly classified as centralized or decentralized. The traditional centralized approach includes UDDI, where a central registry is used to store descriptions of Web services. The current UDDI approach attempts to mitigate the disadvantages of centralization by replicating the entire information on different sites. Replication, however, may temporarily improve performance if the number of UDDI users is limited. But with the rise of the replicated sites, decreases the consistency of the duplicated data. The replication of UDDI data is not a scalable approach. For this reason, different approaches on decentralized registries have been proposed in order to connect individual customers through the P2P network. Since this technology organizes peers into a hypercube, the 55 management becomes inefficient in the presence of a large amount of data. A solution to this problem is given by [14] where a method for the reduction in size of the indexing scheme that maps the multidimensional information space with physical peers is presented. However, this method does not use semantic descriptions. The Web Service discovery based on P2P is also discussed in [18] and [5] where ad hoc model frameworks are proposed for this purpose. As a starting point, in the next section we will analyze a proposed approach that combines ontologies and P2P based on DHT as a sophisticated solution to these issues. 2.2 Web Service catalog system based on DHT Without a central registry, the easiest way to find out the location of a service in a distributed system is to send the query to each participant (service provider). While this approach might work for a small number of service providers, it is certainly not scalable in a large distributed system. When a system includes thousands of nodes, facilities that allow the selection of a subset of nodes that will be fitted with the functionality exposed in the catalogs are needed. The new generation of P2P systems includes complete DHT features [11] for decentralized applications. Some groups have proposed innovative approaches such as CAN [11], Pastry [13] and Chord [17], which eliminate the defects of the first P2P systems like Gnutella6 and Napster7 . Although they are implemented in different ways, all these systems have interfaces to support access to DHT. These interfaces permit to request shared information. In contrast to UDDI, the P2P network content is usually described and indexed locally within each peer, while search queries are propagated through the network. A central index that spans the whole network is not required. Given a key, the corresponding data items can be efficiently located using up to O(log(n)) network messages, where n is the total number of nodes in the system [17]. In addition, distributed systems evolve while remaining scalable to a large number of nodes. Current efforts are directed towards this functionality in order to provide a catalog of services fully distributed and scalable. The approach chosen for the purposes of this paper is Chord because it proposes an original approach to the problem of efficient location and is able to maintain the bandwidth close to the optimal in the management of arrival and departure of competing nodes [8]. Chord uses routed queries to locate a key, minimizing the number of visits on large amount of nodes. What distinguishes Chord from other P2P applications is the ease of use and provable performance and correctness. In essence, Chord supports one operation: given a key, it is mapped on a node. The location data can be implemented by associating each key with a datum. In detail, it routes a key through a sequence of O(log(n)) other nodes toward the destination. This requires that a Chord node has information about O(log(n)) other nodes for efficient routing. When the information is out of date, it is proved that the performance degrades gracefully. This is important in practice because nodes will 6 7 Gnutella Web site - http://rfc-gnutella.sourceforge.net/ Napster Web site- http://free.napster.com/ 56 join and leave arbitrarily, and consistency about O(log(n)) nodes may be hard to maintain. Only one piece of information per node needs to be correct in order to guarantee correct routing of queries. The Chord protocol uses the SHA − 1 hash functions to assign a m-bit identifier to each node and key. Furthermore, it uses Consistent Hash functions that allow nodes to leave and enter the network with minimal disruption [6]. The integer m is chosen to be large enough to make negligible the probability that two nodes (keys) received the same identifier. The hash function calculates the key identifier performing hashing over the IP address of the node. The key identifiers of the nodes are arranged in a circular ring of dimension 2m called Chord ring. The identifiers on the Chord ring are numbered from 0 to 2m − 1. A key is assigned to a node whose identifier is equal to or greater than the key identifier. This node is called the successor of the node k and is the first node k on the circle in a clockwise direction. When a node n wants to find a certain key k, it uses a lookup function that returns the successor of n if k is between n and its successor, otherwise forwards the query in the circle. Furthermore, in order to provide more efficient lookup, parts of the routing information are stored in the nodes. In particular, each node n maintains a routing table with at most m entries (where m is the number of bits in the identifiers), called finger table. Stabilization primitives are used to maintain updated finger tables, as well as the Chord ring itself. This structure can be used to create Web Service Catalogs. For example, in [19] each node in the system is a service provider or requester so that both these two actors are connected together in the Chord ring. When a service provider Ni wants to publish a service, creates the service catalog item, i.e., the tuple C = key, Summary. The Chord protocol routes the catalog information to the corresponding node of the system in accordance with the key in the catalog. Thus, each node in the system contains part of information of catalog and all the nodes together constitute the global catalog system implementing the functionality of a traditional UDDI registry. With WSDL, a Web Service can be expressed as a set of operations, each of which implements a certain amount of functions. An operation is specified by its name and the types of input and output messages. The service name is used as the key catalog information for the DHT hashing algorithm. In line with this, the operations included in the service and messages associated with these operations are used as a summary. The catalog for a Web service W si has the structure: CW si = (Key, Summary, N ), where: – Key is the name of W si , – Summary contains the operations and its messages included in W si , – N is the node that publishes W si . In the same paper [19] is proposed a mapping between the information contained in the nodes (related to the WSDL) with ontology classes that represent them in OWL-S services. In contrast, our approach foresees that the information contained in the catalogs are taken directly from the OWL-S descriptions already available online. 57 2.3 Web Ontology Language for Services (OWL-S) Semantic Web Services[9] provide an ontological framework for describing services, messages, and concepts in a machine-readable format, enabling logical reasoning on service descriptions. The Web Ontology Language for Services (OWL-S) provides a Semantic Web Services framework on which an abstract description of a service can be formalised. It is an upper ontology described with OWL8 whose root class is Service, therefore, every described service maps onto an instance of this concept. The upper level Service class is associated with three other classes: Service Profile. The service profile specifies the functionality of a service. This concept is the top-level starting point for the customizations of the OWL-S model that supports the retrieval of suitable services based on their semantic description. It describes the service by providing several types of information: • Human readable information: such as the service description, service name, contact information, etc.; • Functionalities: i.e. parameter type identifiers, identifiers for the input and output, parameters of service methods, preconditions, results, etc.; • Service parameters: which include parameter identifiers (e.g. name, value) used by the service; • Service categories: these include identifiers for defining the category of service, i.e. category name, taxonomy, value, code; Service Model. The service model exposes to clients how to use the service, by detailing the semantic content of requests, the conditions under which particular outcomes will occur, and, where necessary, the step by step processes leading to those outcomes. In other words, it describes how to ask for (invoke) the service and what happens when the service is carried out. From the point of view of the processes, the service model defines the concept Process that describes the composition of one or more services in terms of their constituent processes. A Process can be atomic, composite or simple: an atomic process is a description of a non-decomposable service that expects one message and returns one message in response. A composite process consists of a set of processes within some control structures defining a workflow. Whereas a simple process provides a service abstraction that allows to view a composite service as an atomic one. Each process can have any number of inputs, a set of preconditions, all of which must hold for the process to be successfully invoked, and any number of results (outputs and/or effects) that come from a successful execution of the service. Service Grounding. A grounding is a mapping from an abstract to a concrete specification of those service description elements that are required for interacting with the service. In general, a grounding indicates a communication protocol, a message format and other service-specific details (e.g., port 8 OWL Web Ontology Language, W3C Recommendation 10 February 2004 - http: //www.w3.org/TR/owl-features/ 58 numbers, the serialization techniques of inputs and outputs, etc.). From the point of view of processes, a service grounding enables the transformation from inputs and outputs of an atomic process into a concrete atomic process grounding constructs. Fig. 1. Schema Mapping WSDL-OWL-S As we can see from Figure 1, OWL-S grounding maps the semantic description of the service with the corresponding WSDL. This means that the information contained in the Summary of Catalog shown in the previous section can be directly obtained from OWL-S. Since each OWL-S instance, as well as all its constituent parts, has its own URI, such information is always available online. 3 OWL-S discovery with P2P 3.1 Open Chord Open Chord9 is an open source implementation of Chord. Its architecture consists of three levels (Figure 2). On the lower level is located the implementation of the communication protocol used (Communication Layer), based on a network protocol (such as Java Socket). Currently, two implementations are provided: Local communication protocol, that has been developed for testing purposes, and Socket-based protocol, that facilitates reliable communication between Open Chord peers based on TCP/IP sockets. 9 Open Chord Web site - http://open-chord.sourceforge.net/ 59 Fig. 2. Open CHORD architecture The abstraction level (Communication abstraction layer ) provides two abstract classes that must be implemented to realize the communication protocol: – Proxy, that represents a reference to remote peers in the Open Chord overlay network. – Endpoint, that provides a connection point for remote peers conform to a specific communication protocol. Concrete implementations for a communication protocol are determined with help of the URL of a peer. The Chord logic level, which implements the functionality of Chord, offers two interfaces for Java applications that abstract from the implementation of the Chord DHT routing. Both interfaces (i.e., Chord and AsynChord ), which can be used by an application built on-top of Open Chord to retrieve, remove, and store data in synchronous and asynchronous way from/to the underlying DHT, provide some common methods that are important to create, join, and leave an Open Chord DHT. The Chord logic level is also responsible for data replication and maintenance of the necessary properties to keep running the DHT, as described in [17]. 3.2 A prototype implementation To simulate an OWL-S P2P network using the Open Chord API a simple graphical application that displays a drop down menu consisting of the items File, Edit, and View has been developed. By selecting ’Create Network’ from ’File’ menu a 60 Fig. 3. Screenshot of the prototype single peer will be created, consisting simply of a new URL. Only the first node has the ability to create a new network, to add other nodes the join method of the interface Chord will be invoked. This method works similarly to the method used to create the network, but in addition to the node that is to be added, an existing URL, that is already part of the network, is required. This is called bootstrap peer. To test the operations of the P2P network we have taken a set of services from the dataset OWLS-TC10 , placed them in a local folder and inserted them in the network using the function ’Create nodes with’ from the ’File’ menu that automatically creates a number of nodes equal to the number of files in the local folder. Subsequently, by selecting ’Insert’ from ’Edit’ menu, a dialog appears that allows to insert new nodes in the network individually, specifying the URL of the bootstrap node and the URL of the new node which must be different from those used for the other nodes in the network. The next step is the population of the network. To work with DHT the choice of the key is a fundamental step. Our key was the output value of the OWL-S profile of the selected services. The output value is extracted parsing the service profile available online. The value we have associated with the key is the URI of the OWL-S service itself. Selecting ’Populate the Network’ from ’File’ menu this procedure is executed in automatic way for all services contained in the local folder. The synchronous retrieval of the value associated with the key is carried out invoking the Open Chord method retrieve(Key). The result is an array of strings containing the URL of zero or more OWL-S services, depending on whether the searched key is the output of one or more services inserted in the network. Figure 10 OWLS-TC service retrieval test collection - http://projects.semwebcentral.org/ projects/owls-tc 61 Fig. 4. An example of results setting BOOK as key 4 shows the results obtained for the key BOOK using the ’Search. . . ’ pop up opening from ’Modify’ menu. By applying methods that use lexical ontologies (e.g., WordNet11 ) you can obtain the synonyms of the key. Invoking the method retrieve(Key) on synonyms, those services that do not have as output the initial key are found, providing a solution to the problem of exact matching between the searched key and the output of the service. 4 Discussion The use of P2P networks for SWS discovery opens the way for the application of intelligent methods to satisfy the requests of Web users. The availability of semantic descriptions of services allows the realization of new scenarios in which the weight of the reasoning for the attainment of a goal moves increasingly towards software systems. In the scenario shown in Figure 5, a user queries a software agent, capable of interpreting natural language, asking him to find a service that returns a certain result (Goal). The agent uses the P2P network for the discovery of services that may be suitable to meet the demand. Since the P2P network returns the semantic descriptions of services, the agent can apply automated reasoning methods to select and compose the most appropriate services with respect to the available user inputs. If the services are described by using different ontologies, it will use semantic alignment tools12 and approaches in the literature [4, 1] during the execution of these operations. This scenario 11 12 WordNet, a lexical database for English - http://wordnet.princeton.edu/ wordnet/ Alignment API and Alignment Server - http://alignapi.gforge.inria.fr/ 62 Fig. 5. Scenario describes the automatic orchestration of SWS and is particularly suitable for use with OWL-S [12]. Looking in particular to the discovery use case, there are various matchmaking techniques that exploit the OWL-S description to determine which services are best suited to fulfil the request. Srinivasan et al. [16] propose an enhancement of CODE OWL-S IDE [15] where the used matching procedure is based on the algorithm described in [10]. It defines a flexible matching mechanism based on subsumption in Description Logics. More sophisticated solutions are provided by OWLS-MX [7], an hybrid Semantic Web Service matchmaker that retrieves services for a given query written in OWL-S itself. In other words for every OWL-S service representing the description of the desired service (query), it returns an ordered set of relevant services ranked according to their degree of (syntactic and/or semantic) matching with the query. This approach complements logic based reasoning with approximate matching relying on Information Retrieval metrics. Figure 6 illustrates the graphical user interface (GUI) of a software component that we have developed as a support for the test of matchmaking on retrieved services. At the top there are two text fields designed to insert user inputs and the searched output of the service, respectively. By pressing the Ok button the request is processed. The results will vary depending on the chosen sort order (syntactic, semantic or hybrid) that reflect those available with OWLS-MX API13 . Finally, clicking on a listed service the following information 13 OWLS-MX Semantic Web Service Matchmaker API - http://www.semwebcentral. org/projects/owls-mx/ 63 Fig. 6. Service discovery GUI will be displayed: name, URI, inputs and outputs. The work presented in this paper represents only a starting point towards a SWS discovery system based solely on semantic descriptions of services. Future work includes extensive use of the annotations included in the OWL-S profile for the selection of services that best meet the user needs. For this purpose, the domain ontologies matchmaking methods will be combined with lexical ontology based approaches in order to analyze the lexical text descriptions of the service during the discovery process. References [1] David, J., Euzenat, J., Scharffe, F., dos Santos, C.T.: The alignment api 4.0. Semantic Web 2(1), 3–10 (2011) [2] Doyle, J.F.: Peer-to-peer: harnessing the power of disruptive technologies. Ubiquity 2001 (May 2001) [3] Erl, T.: Service-Oriented Architecture: Concepts, Technology, and Design. Prentice Hall PTR, Upper Saddle River, NJ, USA (2005) [4] Euzenat, J., Shvaiko, P.: Ontology matching. Springer-Verlag, Heidelberg (DE) (2007) [5] Gharzouli, M., Boufaida, M.: Pm4sws: A p2p model for semantic web services discovery and composition. Journal of Advances in Information Technology 2(1) (2011) [6] Karger, D.R., Lehman, E., Leighton, F.T., Panigrahy, R., Levine, M.S., Lewin, D.: Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the world wide web. In: STOC. pp. 654–663 (1997) [7] Klusch, M., Fries, B., Sycara, K.: Automated semantic web service discovery with OWLS-MX. In: AAMAS ’06: Proceedings of the fifth international joint conference 64 [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] on Autonomous agents and multiagent systems. pp. 915–922. ACM Press, New York, NY, USA (2006) Liben-Nowell, D., Balakrishnan, H., Karger, D.R.: Observations on the dynamic evolution of peer-to-peer networks. In: Druschel, P., Kaashoek, M.F., Rowstron, A.I.T. (eds.) IPTPS. Lecture Notes in Computer Science, vol. 2429, pp. 22–33. Springer (2002) McIlraith, S.A., Son, T.C., Zeng, H.: Semantic Web Services. IEEE Intelligent Systems 16(2), 46–53 (2001) Paolucci, M., Kawamura, T., Payne, T.R., Sycara, K.P.: Semantic Matching of Web Services Capabilities. In: ISWC ’02: Proceedings of the First International Semantic Web Conference on The Semantic Web. pp. 333–347. Springer-Verlag, London, UK (2002) Ratnasamy, S., Francis, P., Handley, M., Karp, R.M., Shenker, S.: A scalable content-addressable network. In: SIGCOMM. pp. 161–172 (2001) Redavid, D., Esposito, F., Iannone, L.: A comparative study on semantic web services frameworks from the dynamic orchestration perspective. In: In Proceedings of International Conference on Knowledge Engineering and Ontology Development (KEOD). pp. 355–359 (October 2010) Rowstron, A.I.T., Druschel, P.: Pastry: Scalable, decentralized object location, and routing for large-scale peer-to-peer systems. In: Guerraoui, R. (ed.) Middleware. Lecture Notes in Computer Science, vol. 2218, pp. 329–350. Springer (2001) Schmidt, C., Parashar, M.: Flexible information discovery in decentralized distributed systems. In: HPDC. pp. 226–235. IEEE Computer Society (2003) Srinivasan, N., Paolucci, M., Sycara, K.: CODE: A Development Environment for OWL-S Web services. Tech. Rep. CMU-RI-TR-05-48, Robotics Institute, Carnegie Mellon University, Pittsburgh, PA (October 2005) Srinivasan, N., Paolucci, M., Sycara, K.: Semantic Web Service Discovery in the OWL-S IDE. In: HICSS ’06: Proceedings of the 39th Annual Hawaii International Conference on System Sciences. p. 109.2. IEEE Computer Society, Washington, DC, USA (2006) Stoica, I., Morris, R., Liben-Nowell, D., Karger, D.R., Kaashoek, M.F., Dabek, F., Balakrishnan, H.: Chord: a scalable peer-to-peer lookup protocol for internet applications. IEEE/ACM Trans. Netw. 11(1), 17–32 (2003) Xu, B., Chen, D.: Semantic web services discovery in p2p environment. In: ICPP Workshops. p. 60. IEEE Computer Society (2007) Yu, S., Zhu, Q., Xia, X., Le, J.: A novel web service catalog system supporting distributed service publication and discovery. In: Ni, J., Dongarra, J. (eds.) IMSCCS (1). pp. 595–602. IEEE Computer Society (2006) 65 Marine Traffic Engineering through Relational Data Mining Antonio Bruno1 and Annalisa Appice1,2 1 Dipartimento di Informatica, Università degli Studi di Bari Aldo Moro via Orabona, 4 - 70126 Bari - Italy 2 CILA (Centro Interdipartimentale per la ricerca in Logica e Applicazioni) [email protected],[email protected] Abstract. The automatic discovery of maritime traffic models can achieve useful information for the identification, tracking and monitoring of vessels. Frequent patterns represent a means to build human understandable representations of the maritime traffic models. This paper describes the application of a multi-relational method of frequent pattern discovery into the marine traffic investigation. Multi-relational data mining is here demanded for the variety of the data and the multiplicity of the vessel positions (latitude-longitude) continuously transmitted by the AIS (Automatic Identification System) installed on shipboard. This variety of information leads to a relational (or complex) representation of the vessels which by the way permits to naturally model the total temporal order over consecutive AIS transmissions of a vessel. The viability of relational frequent patterns as a model of the maritime traffic is assessed on navigation data truly collected in the gulf of Taranto. 1 Introduction Marine traffic engineering is a research field originally defined in 1970s [10] at the aim of investigating the marine traffic data and building a human interpretable model of the maritime traffic. Through the understanding of this model, the Vessel Traffic Service (VTS) would improve the port and fairway facilities as well as the traffic regulation. Intuitively, the complexity of building a significant maritime traffic model resides in the requirement of a model able to reflect the spatial distribution and the temporal characteristics of the traffic flow. Although, the marine traffic engineering was a popular research field between the 1970s and the 1980s, after the 1990s, the relevant literature and research projects in this field appeared less frequently. This little interest to research in marine traffic engineering was caused to the actual difficulty in collecting traffic data. In fact, the required observation time was long and several technological limitations raised in the observation time. Today, the data collection problem is definitely overcome. The widespread use of Automatic Identification System (AIS) has had a significant impact on the maritime technology and any VTS is now fit to obtain a large volume of traffic information which comprises the timestamped latitude and longitude of the monitored vessels. On the other hand, 66 the galloping developments in data mining research have paved the way to face the problem of automatically analyzing this large volume of traffic data, by the now available, in order to extract the knowledge required to feed the service marine traffic management and the VTS decision making systems. Both these factors, traffic data availability and data mining techniques, have boosted the recent renewed scientific interest towards the marine traffic engineering. Clustering [9], classification [5] and association rule discovery [11] techniques have been employed to analyze AIS data and discover characteristics and/or rules for the marine traffic flow forecast and the development and programming of marine traffic engineering. Although, these studies have proved that data mining techniques are able to provide the extra aid for the situational awareness in maritime traffic control, it is a fact that no marine traffic model described in these works is able to capture the truly temporal characteristics of each AIS transmission. In fact, AIS transmissions are timestamped, but a traditional data mining technique looses the time label of the AIS data, and represents a navigation trajectory as a set, rather than a sequence, of consecutive latitude-longitude vessel positions. In this paper, we resort to multi-relational data mining to address the task of learning a human interpretable model of the maritime traffic in the sea ports, where several vessels are entering and leaving the port. The innovative contribution of this work is that, at the best of our knowledge, this is the first study in maritime traffic engineering which correctly spans the traffic data over several data tables (or relations) of a relational database and discover relational patterns (i.e. patterns which may involve several relations at once) to describe the traffic maritime model. In this multi-relational representation, we are able to model vessel data and AIS data as distinct relational data tables (one for each data type). This leads to distinguish between the reference objects of analysis (vessel data) and the task-relevant objects (AIS data), and to represent their interactions. The modeled interactions also include the total temporal order over the AIS transmissions for the same vessel. SPADA [6] is a multi-relational data mining method that discovers relational patterns and association rules. Relational patterns extracted by SPADA have been proved to be effective for the capture of the behavioral model underlying census data [1] and workflow data [12]. In the case of traffic data, we use SPADA to discover interesting associations between a vessel (reference objects) and a navigation trajectory. Each navigation trajectory represents a spatio-temporal pattern obtained by tracing subsequent AIS transmissions (task-relevant objects) of vessels. This kind of spatio-temporal rules automatically identify the well traveled navigation courses. This information can be employed in several ways. To opportunely arrange the navigation traffic incoming a gulf in order to avoid collision or traffic jams. To discover vessels which suspiciously deviates from the planned navigation course. The main limitation of SPADA in this application is the high computational complexity which makes the analysis of large databases practically unfeasible. To overcome this limitation, we run SPADA by considering the distributed version of SPADA described in [2]. 67 In order to prove the viability of the multi-relational approach in marine traffic engineering, we describe a relational representation of the traffic data derived from monitoring vessels entered and left the gulf of Taranto (South of Italy) between September 1, 2010 (00:04:23) and October 9, 2010 (23:58:52) (Section 2) and we briefly illustrate the multi-relational method for relational pattern discovery (Section 3). Finally, we comment significance of navigation traffic model we have extracted and its viability in the marine traffic engineering (Section 4). Finally, some conclusions are drawn. 2 Marine Traffic Data For this study, we consider the navigation traffic data collected for 106 vessels entering and/or leaving the gulf of Taranto between September 1, 2010 (00:04:23) and October 9, 2010 (23:58:52). The traffic data are obtained from [13]. As in [11], the area of the gulf is converted into a geographic grid of 0.005◦ × 0.005◦ squared cells. Each cell of the grid is then enumerated by a progressive number. For each vessel, the following data are collected: – the name of the vessel, – the MMSI, that is, a numeric code that unambiguously identifies the vessel, – the vessel category, that is, wing, pleasure craft, tug, low enforcement, cargo, tanker or other, and – the sequence of AIS messages which were sent by the transceiver installed on shipboard. The AIS transceiver sends dynamic messages every two to thirty seconds depending on the vessel speed, and every three minutes while the vessel is at the anchor. As we are interested in describing the observable change of the vessel position within the geographic grid, we decide to consider only those AIS transmissions which reflect a change of the cell occupied by the vessel. Each AIS message includes the following data: – – – – – the the the the the vessel MMSI; received time (day-month-year hour-minutes-seconds); latitude and longitude of the vessel; course over ground; vessel speed; The latitude and longitude coordinates of each AIS transmission are transformed into the identifier of the cell containing the coordinates. By following the suggestion reported in [11], the course over ground is discretized every 45◦ thus obtaining N, E, W, S, NE, NW, SE, SW, while the speed is discretized in low, medium and high. After this transformation, properties of vessels (name and category), data of the AIS transmission (cell, speed, direction) and interaction between vessel and transmitted AIS data are stored as ground atoms into the extensional part of a deductive database. An example of data stored in the 68 database for the vessel named ALIDA S is reported below. mmssi(247205900). name(247205900, alida s). category(247205900, cargo). ais(247205900, 2010-10-07 19:51:30). ais(247205900, 2010-10-07 20:45:26). ais(247205900, 2010-10-07 21:50:19). ais(247205900, 2010-10-07 21:55:23). cell(247205900, 2010-10-07 19:51:30, 312). cell(247205900, 2010-10-07 20:45:26, 313). cell(247205900, 2010-10-07 21:50:19, 312). cell(247205900, 2010-10-07 21:55:23, 311). direction(247205900, 2010-10-07 19:51:30, northwest). direction(247205900, 2010-10-07 20:45:26, northwest). direction(247205900, 2010-10-07 21:50:19, northwest). direction(247205900, 2010-10-07 21:55:23, northwest). speed(247205900, 2010-10-07 19:51:30, medium). speed(247205900, 2010-10-07 20:45:26, medium). speed(247205900, 2010-10-07 21:50:19, low). speed(247205900, 2010-10-07 21:55:23, low). The key predicate mmsi() identifies the reference object (vessel) of the unit of analysis. The property predicates name(), category(), position(), direction() and speed () define the value (in bold) taken by an attribute of an object (reference object as for name() and category() or task relevant object as for position(), direction() and speed ()). Finally the structural predicate ais() relates reference objects (vessel) with task-relevant objects (AIS transmissions). This way, the extensional part of deductive database for SPADA is fed with 19137 atoms partitioned between 106 units of analysis. 3 Maritime Traffic Model Discovery Studies for association rule discovery in Multi-Relational Data Mining [6] are rooted in the field of Inductive Logic Programming (ILP) [8]. In ILP both relational data and relational patterns are expressed in a first-order logic and the logical notions of generality order and of the downward/upward refinement operator on the space of patterns are used to define both the search space and the search strategy. In the specific case of SPADA, properties of both reference and task relevant objects are represented as the extensional part DE of a deductive database D [4], while the domain knowledge is represented as a normal logic program which defines the intensional part DI of the deductive database D. In the application of SPADA in the marine traffic engineering, the extensional database stores information on the traffic data (e.g., vessel and AIS data) as reported in Section 2, while the intensional database includes the definition 69 of relations which are implicit in data, but useful for capturing the model underlying the data. In this study, the intensional part of database surely includes some definition of a relation next (which makes explicit the temporal order over the AIS transmissions that is implicit in the timestamp of each transmission). A possible definition of the relation next is the following: next(V, A1, A2) ← ais(V, T1), ais(V, T2), cell(V, T1, A1), cell(V, T2, A2), A1= A2, T1<T2, not(ais(V,T), T1<T, T<T2) which defines the direct sequence relation between two consecutive AIS transmissions of the same vessel. In SPADA, the set of ground atoms in DE is partitioned into a number of non-intersecting subsets D[e] (unit of analysis) each of which includes facts concerning the AIS transmissions involved in a specific vessel trip e. The partitioning of DE is coherent with the individual-centered representation of training data [3], which has both theoretical (PAC-learnability) and computational advantages (smaller hypothesis space and more efficient search). The discovery process is performed by resorting to the classical levelwise method described in [7], with the variant that the syntactic ordering between patterns is based on θ-subsumption. By SPADA, fragments of the traffic models underlying the navigations of the various traced vessels can be expressed in the form of relational navigation rules in the form: mmsi(V ) ⇒ µ(V ) [s, c], where mmsi(V ) is the atom that identifies a vessel, while µ(V ) is a conjunction of atoms which provides a description of a fragment of the navigation trajectory traced for V . Each atom in µ(V ) describes either the next relation between AIS transmissions or a property of the vessel (type or length) or a datum included in the AIS transmission (id of the crossed geographical cell, navigation direction, velocity). An example of discovered association rule is the following: vessel(V)⇒ cell(V,T,123), next(V,123,124), next(V,124,125) [s=63%, c=100%] The support s estimates the probability p(vessel(V ) µ(V )) on D. This means that s% of the units of analysis D[e] are covered by vessel(V ) µ(V ), that is a substitution θ = {V ← e}·θ1 exists such that vessel(V ) µ(V )θ ⊆ D[e]. The confidence c estimates the probability p(µ(V )|vessel(V )). Our proposal is to employ SPADA in order to process large traffic data volume and to collect the navigation rules discovered by SPADA in order to obtain an interpretable description of the model underlying the maritime traffic. As the 70 navigation rules describe fragments of the trajectories frequently crossed by the monitored vessels, they are then visualized in a GIS environment for the human interpretation. At the aim of this study, we have further extended SPADA by integrating a rule post-processing module which filters out uninteresting rules and ranks the output of the filtering phase on the basis of the rule significance. Then, the top-k rules compose the maritime traffic model. Interesting rules correspond to non-redundant rules. Formally, let R be the navigation rule set output by SPADA. A rule r ∈ R is labeled as redundant in R iff there exists a rule r ∈ R and the substitution θ such that rθ ⊂ r . For example, let us consider the set of navigation rules which comprises: r1: vessel(V)⇒ cell(V,T,123). r2: vessel(V)⇒ cell(V,T,123), next(V,123,124). r3: vessel(V)⇒ cell(V,T,123), next(V,123,124), next(V,124,125). Both r1 and r2 are redundant in R due to the presence of r3. Redundant rules are implicit in non-redundant rules (although, they may have different support, they are always frequent rules), hence we can filtered out the redundant navigation rules without loosing any knowledge in the maritime traffic model which is built finally. Filtered rules are ranked on the basis of significance expressed by pattern length (number of atoms in the rule and support value). By decreasing k, we prune less significant knowledge in the model. 4 Maritime Traffic Models A relational model of the maritime traffic in the gulf of Taranto (South of Italy) was extracted by considering two experimental settings, denoted as S1 and S2. In the former setting (S1), the intensional part is populated with the definition of the ternary “next” predicate formulated as in Section 2. In the latter setting (S2), the intensional part is populated with an intensional definition of both a new “cell” predicate and a “next” predicate which incorporate the information on the speed and direction of navigation as follows: cell(V, T, C, S, D) ← cell(V, T, C), speed(V, T, S), direction(V, T, D). next(V, A1, A2, S, D) ← ais(V, T1), ais(V, T2), cell(V, T1, A1), cell(V, T2, A2), speed(V, T2, S), direction(V, T2, S) A1= A2, T1<T2, not(ais(V, T, A), T1<T, T<T2). In both settings, SPADA is run to discover relational rules with 0.1 as minimum support and 3 as minimum pattern length. In the first setting, SPADA 71 outputs the geometrical description fragments of navigation trajectories incoming and leaving the gulf of Taranto. The number of discovered rules is 126. After filtering out redundant rules, 41 rules are ranked according to the significance criterion. The top ranked navigation rule is reported below: vessel(V)⇒ category(V,cargo), cell(V, T, 903), next(V, 903, 904), next(V, 904, 944), next(V, 944, 945), next(V, 945, 946), next(V, 946, 986), next(V, 986, 987). [s=10.3%, c=100%] This rule states that 10.3% vessels monitored in the gulf of Taranto in the period under study are cargo vessels which follow a navigation trajectory crossing across the cells identified by 903, 904, 944, 945, 946, 986 and 987 in this order. The maritime traffic model obtained by selecting the top-5 navigation rules is plotted in Figure 1. By visualizing this model we are able to see the geometrical representation of maritime trajectories which may be busy in the gulf of Taranto. This information may be employed from the service maritime management in order to opportunely program the maritime traffic in the gulf of Taranto and avoid gridlock or vessel accident. In the second setting, SPADA discovers a more detailed description of the navigation trajectories frequently crossed in the gulf. In fact, the description mined for each navigation trajectory now comprises both direction and velocity of the vessel at each crossed cell in the trajectory. With this setting, SPADA discovers 11 navigation rules. After filtering out redundant rules, 8 rules are ranked according to the significance criterion. The top ranked navigation rule is reported below: vessel(V)⇒ category(V,cargo), cell(V, T, 945, low, north east), next(V, 945, 946, low, north east), next(V, 946, 986, low, north east). [s=11.3%, c=100%] This rule states that 11.3% of vessels in this study move across the cells 945, 946 and 986 maintaining a low velocity and north-east navigatition dierection. Although this navigation rule describes a shorter trajectory than the top ranked rule of the first setting, it provides a deeper insight in the navigation behaviour (velocity and direction) of vessels crossing these cells, which where ignored before. 5 Conclusions In this paper, we presented a preliminary study of the application of relational data mining to the marine traffic engineering. Relational data mining is here demanded to represent multiplicity and variety of data continuously transmitting 72 (a) Visualization. (b) Ranking. Fig. 1. The top-5 relational models of the incoming and outcoming navigation trajectories frequently crossed in the gulf of Taranto. from a vessel during the navigation time. In particular, we prove the viability of a multi-relational approach to obtain human interpretable maritime models of the maritime traffic by considering the AIS data transmitted from vessels in the gulf of Taranto. The results are encouraging and open appealing and novel directions of research in the field of the marine traffic engineering. As future work, we plan to explore the task of discovering relational rules which include a disjunction of atoms in the rule body in order to describe those trajectories which include one or more ramification in the path. Additionally, we intend to use the discovered navigation trajectories to obtain a prediction model that permits to predict the position of a vessel at any future time. This task requires the consideration of either geographical constraints such as the presence of the mainland (or in general physical obstacles) or navigation constraints such as velocity, direction, timetable and so on. Acknowledgment This work is partial fulfillment of the research objective of ATENEO-2010 project entitled “Modelli e Metodi Computazionali per la Scoperta di Conoscenza in Dati Spazio-Temporali”. References 1. A. Appice, M. Ceci, A. Lanza, F. A. Lisi, and D. Malerba. Discovery of spatial association rules in geo-referenced census data: A relational mining approach. Intelligent Data Analysis, 7(6):541–566, 2003. 2. A. Appice, M. Ceci, A. Turi, and D. Malerba. A parallel, distributed algorithm for relational frequent pattern discovery from very large data sets. Intell. Data Anal., 15(1):69–88, 2011. 73 3. H. Blockeel and M. Sebag. Scalability and efficiency in multi-relational data mining. SIGKDD Explorations, 5(1):17–30, 2003. 4. S. Ceri, G. Gottlob, and L. Tanca. Logic Programming and Databases. SpringerVerlag New York, Inc., New York, NY, USA, 1990. 5. R. Lagerweij. Learning a Model of Ship Movements. Thesis for Bachelor of Science - Artificial Intelligence, University of Amsterdam, 2009. 6. F. A. Lisi and D. Malerba. Inducing multi-level association rules from multiple relations. Machine Learning, 55(2):175–210, 2004. 7. H. Mannila and H. Toivonen. Levelwise search and borders of theories in knowledge discovery. Data Mining and Knowledge Discovery, 1(3):241–258, 1997. 8. S. Muggleton. Inductive Logic Programming. Academic Press, London, 1992. 9. C. Tang and Z. Shao. Modelling urban land use change using geographically weighted regression and the implications for sustainable environmental planning. In Q. Peng, K. C. P. Wang, Y. Qiu, Y. Pu, X. Luo, and B. Shuai, editors, Proceedings of the 2nd International Conference on Transportation Engineering, pages 4465– 4470. ASCE, American Society of Civil Engineering, 2009. 10. S. Toyoda and Y. Fujii. Marine traffic engineering. The Journal of Navigation, 24:24–34, 1971. 11. M.-C. Tsou. Discovering knowledge from ais database for application in vts. The Journal of Navigation, 63:449–469, 2010. 12. A. Turi, A. Appice, M. Ceci, and D. Malerba. A grid-based multi-relational approach to process mining. In S. S. Bhowmick, J. Küng, and R. Wagner, editors, Proceedings of the 19th International Conference on Database and Expert Systems Applications, DEXA 2008, volume 5181 of Lecture Notes in Computer Science, pages 701–709. Springer, 2008. 13. web url:. http://www.marinetraffic.com/ais. 74 Author Index Annalisa Appice, 66 Saima Jabeen, 14 Elena Baralis, 14 Elena Bellodi, 26 Antonio Bruno, 66 Luca Cagliero, 14 Gianni Costa, 38 Fabio Leuzzi, 2 Giuseppe Manco, 38, 46 Elio Masciari, 46 Riccardo Ortale, 38 Sašo Džeroski, 1 Floriana Esposito, 54 Stefano Ferilli, 2, 54 Alessandro Fiori,14 Domenico Redavid, 54 Fabrizio Riguzzi, 26 Ettore Ritacco, 38 Fulvio Rotella, 2

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Proceedings of The Workshop on Mining Complex Patterns