Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Distributed mining first-order frequent patterns Jan Blaťák and Luboš Popelı́nský KD Group at Faculty of Informatics Masaryk University in Brno, Czech Republic {xblatak,popel}@fi.muni.cz Abstract. The first version of distributed RAP, the ILP system for finding first-order maximal frequent patterns is introduced. Then we describe our current work on adaptation of parallel attribute-value techniques for processing multi-relational data1 . 1 Introduction In our earlier work [3, 4] we presented RAP, the system for finding first-order maximal frequent patterns in multi-relational data. RAP was successfully applied to many tasks like propositional feature construction [3], propositionalization of mutagenesis and carcinogenesis domains [3] or mining in the STULONG database [5]. Nevertheless, it may be very difficult to use any ILP system, including RAP, for mining in a large volume of data. Frequently, an ILP system is not able to find all frequent patterns in high dimensional data with serial algorithms on single processor because of limited resources: too many candidates need to be processed. In [16] the research area of parallel execution techniques is mentioned as one of the principal research directions that need to be followed in future. Frequent pattern mining is not only a good candidate for parallelizing, but there is a real necessity to do it. In this paper we describe a new version of RAP which exploits distributed architecture2 . In Section 2 we describe the current version of distributed RAP. Section 3 outlines attribute-value algorithms for mining frequent patterns that look promising for multi-relational data. Then we mention coding of patterns and language bias. Section 4 gives overview of several tasks which are appropriate for solving with the distributed version of RAP. Concluding remarks can be found in Section 5. 1 2 This work has been partially supported by the Czech Ministry of Education under the Grant no. 143300003. In our point of view a distributed system is a parallel system whose nodes (computational units – computers) does not share any resources (memory, peripheries, etc.). 2 Distributed RAP Distributed data. RAP now can find frequent patterns over distributed data in the same way as Savasere’s distributed algorithm [17]. Each node first find all frequent patterns on its fragment of database. Then each node sends these patterns to other nodes. At the end nodes compute global pattern frequencies from the frequencies got from the other node. Distributed hypothesis space. Also the hypothesis space can be distributed in current RAP version. However, we have observed that the straightforward method (i.e. without an intelligent partitioning the hypothesis space) does not result in a significant speed-up of computation. The reason is that same patterns can be generated on different nodes. serial loading gen. refinements discretizing selecting ref. chck. infreq chck. known comp. coverage all 300 relative time (in %) 250 200 150 100 50 0 2 4 6 8 node id 10 12 14 16 Figure 1. 16 nodes:The relative time of the RAP algoritm steps. Results Several experiments were has been performed with mutagenesis data 3 and B4 background knowledge (atoms, bonds and 2-D structures) were used. The values of logP, LUMO and partial charges were discretized by RAP into three intervals. Minimal frequency thershold was set to 25 % of all compounds. We used best-first search with one, four and sixteen nodes. Figure 1 displays the relative time of finding LMFP on sixteen nodes. The value 100 % is a time of using the serial RAP algorithm on the whole data (the dashed line – serial). We can observe that the coverage computation and discretization steps speedup lineary. Refinement selection (the most time consuming step when best-first search is used) consumes about 30 % of time of the serial algoritm. 4 It must be 3 4 S Srinivasan, R.D. King, and M. Sternberg. Theories for mutagenicity: a study of first-order and feature based induction. Artifical Intelligence, 85(1-2):277–299, 1996. The computational nodes 9 and 11 generated LFMP much longer than serial because they generated very long patterns (about 14 literals, twice as long as the other nodes). However,these patterns were not globaly frequent. stressed that the distributed RAP generated more maximal patterns then the non-distributed version (cf. 17 patterns for 4 nodes, 46 paterns for 16 nodes with 7 patterns for the non-distributed version). 3 3.1 Future work Distributed computation of coverage Agrawal and Schafer [1] adapted the Apriori algorithm [2] for mining association rules over distributed database and designed the Count Distribution (CD) algorithm. All nodes process their part of input database by the Apriori algorithm. After they compute coverage of all candidates they exchange the information about support of these candidates and process patterns which are frequent in next iteration. Similar approach employed Konstantopoulos [12] in the Aleph system and showed that this algorithm performs – when a simple background knowledge is used – better than the serial algorithm. The main disadvantage of these methods is that each node stores all frequent patterns and all candidates in a memory. The system can fail because candidates consume all memory of single nodes. Also the load-balancing cannot be used easily. The problems with insufficient memory can be solved by partitioning the set of candidate patterns over computational units. Han et al. [11] proposed the Intelligent Data Distribution (IDD) algorithm which distributes candidates over nodes which compute the coverage on the whole database. Han showed that the algorithm performs much better than the CD algorithm. Efficient distribution of the hypothesis space (candidates) over nodes can be done in two ways. A new refinement operator which will be able to generate only such patterns which are relevant for specific node can be implemented. This approach can provide really good results because it does not require any additional time for communication and nodes can explore the hypothesis space independently. Another way is to use master-worker design. The master node than distribute candidate set over workers. We explore both methods. The main advantage of the first one is that it does not require any additional resources (network connection and communication interface). The second one is much easier to implement. The master can better balance work between nodes. Also new methods of candidate generation can be easily integrated. A hybrid approach is described in [11]. Hybrid Distribution (HD) algorithm combines the CD and the IDD algorithms. Nodes are split into several groups and the database is distributed over these groups. The candidates are partitioned and each part of candidates is processed by one node in the group. An ILP system which distributes the database and also the hypothesis space has been proposed by Dehaspe and De Raedt [6]. Both HD and Dehaspe’s algorithms outperform serial algorithms. Moreover, the HD algorithm achieves much better results than the IDD and the CD algorithms by using separately. 3.2 Coding patterns. Exchanging information about candidates between computational nodes can drastically decrease performance of parallel algorithm, namely if whole patterns are sent. It is better to assign unique code to each pattern and use only such a code in communication. The simplest way to obtain such a code is to use some algorithm for tree encoding on the tree representation of frequent pattern (Prolog goal). To prevent from obtaining different codes for logically equivalent patterns, literals in the pattern are being reordered. The new ordering is achieved in two steps. First, literals are sorted by their name and arity. If there are two and more literals of the same name and arity in the pattern we determine their order by using their arguments. For example, by reordering pattern q(Y, a, X), p(X, Y ), q(X, a, Z) we obtain pattern p(X, Y ), q(X, a, Z), q(Y, a, X) because the variable X occurs before variable Y in the literal p/2. By encoding reordered pattern we get the same code for all equivalent patterns. 3.3 Language bias When mining first order patterns, frequently one does not need to work with full first order language. Mining in multi-relational database without any handwritten domain knowledge is an example. The domain knowledge needed can be automatically or semi-automatically constructed from the database schema. In NLP, namely in morphological disambiguation task, it has been shown [14] that permitting only one input variable in a literal does not prevent a learning system from solving the disambiguation task and results in significant speed up of coverage computation. In [13] a wide spectrum of techniques for declarative language bias has been introduced. 4 Applications Many natural language processing (NLP) tasks can be seen as finding frequent patterns in morphologically, syntactically or semantically tagged large corpora. The found patterns can be used e.g. for finding morphological rules, word sense disambiguation and multi-word expressions. To solve real NLP tasks, data become too large. Moreover, the search space is enormous, because of many possible combinations of tags. Protein databases contain sequences of proteins, their structure, description of their functionality, etc. For example the International Protein Sequence Database (pir.georgetown.edu) contains 283 000 sequences. Although it is very important to analyze such a data, the complexity of this task is above capabilities of current ILP systems. Mining in XML documents is also a challenging task. For instance the International Movie Database (IMDb, http://www.imdb.com/) contains giga bytes of data about movies and actors, user’s comments etc. This kind of data is very difficult to analyze with attribute-value learning systems because of complex structure of XML documents. 5 Concluding remarks Methods for evaluation of Datalog queries based on e.g. magic sets can be useful, too. In the past, parallel evaluation of Datalog queries has been explored [7, 9]. The privacy preserving techniques, introduced e.g. in [8] can be easily employed even in mining first-order data. The distributed version of RAP will be running on a cluster of PC’s. We also intend to use grid architecture. References 1. Agrawal R. and Shafer J. C. Parallel mining of association rules. IEEE Transactions on Knowledge and Data Engineering, 8(6):962–969, 1996. 2. Agrawal R. and Srikant R. Fast algorithms for mining association rules in large databases. In Proc. of VLDB’94 Conf., pp. 487–499. Morgan Kaufmann, 1994. 3. Blaťák J. and Popelı́nský L. Feature construction with RAP. In Proc. of the WP at the 13th ILP Conf., pp. 1–11. Univ. of Szeged, 2003. 4. Blaťák J., Popelı́nský L., and Nepil M. RAP: Framework for mining frequent Datalog patterns. In Proc. of the 1st KDID ws. at ECML/PKDD 2002, pp. 85–86. 5. Blaťák J. Mining first-order frequent patterns in the STULONG database. In Proc. of the ECML/PKDD 2004 Challenge, 2004. 6. Dehaspe L. and De Raedt L. Parallel inductive logic programming. In Proc. of the MLnet Familiarization Workshop on Statistics, ML and KDD, 1995. 7. Dong G. and Su J. Increment boundedness and non-recursive incremental evaluation of Datalog queries. Lecture Notes in Computer Science, 893:397–406, 1995. 8. Frieser J. and Popelı́nský L. DIODA: Secure mining in horizontally partitioned data. In Proc. of the P&S Issues in DM ws. at ECML/PKDD 2004, pp. 12, 2004. 9. Ganguly S., Silberschatz A., and Tsur S. A framework for the parallel processing of datalog queries. In Proc. of the SIGMOD int. conf. on MD. ACM Press, 1990. 10. Graham J., Page C.D., and Kamal A. Accelerating the Drug Design Process through Parallel ILP DM. In Proc. of the CSB’03, pp. 3, 2003. 11. Han E.-H., Karypis G., and Kumar V. Scalable parallel data mining for association rules. In SIGMOD, Proc. ACM SIGMOD Int. Conf. on MD. ACM Press, 1997. 12. Konstantopoulos S. T. A Data-Parallel Version of Aleph. In P&D computing for ML. In conjunction with the ECML/PKDD 2003 Conf., 2003. Springer-Verlag. 13. Nédellec, C., Rouveirol, C., Adé, H., Bergadano, F., Tausend, B. Declarative Bias in ILP. In De Raedt, L.(ed.),Advances in Inductive Logic Programming, pp. 82-103, IOS Press 1996. 14. Nepil M. INDEED: a system for relational rule induction in domains with rich space of constant terms. In Proc. of the WP at the 13th ILP Conf.. Univ. of Szeged, 2003. 15. Ohwada H., Nishiyama H., and Mizoguchi F. Concurrent execution of optimal hypothesis search for inverse entailment. In Proc. of the 10th ILP Conf. , vol. 1866, pp. 165–173. Springer-Verlag, 2000. 16. Page D. and Srinivasan A. ILP: A short Look Back and a Longer Look Forward. Journal of Machine Learning Research, 4:415–430, 2003. 17. Savasere A., Omiecinski E., and Navathe S. B. An efficient algorithm for mining association rules in large databases. In The VLDB Journal, pp. 432–444, 1995. 18. Skillicorn D. B. and Wang Y. Parallel and sequential algorithms for data mining using inductive logic. Knowl. Inf. Syst., 3(4):405–421, 2001.