Download Distributed mining first-order frequent patterns

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Expectation–maximization algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Transcript
Distributed mining first-order frequent patterns
Jan Blaťák and Luboš Popelı́nský
KD Group at Faculty of Informatics
Masaryk University in Brno, Czech Republic
{xblatak,popel}@fi.muni.cz
Abstract. The first version of distributed RAP, the ILP system for finding first-order maximal frequent patterns is introduced. Then we describe
our current work on adaptation of parallel attribute-value techniques for
processing multi-relational data1 .
1
Introduction
In our earlier work [3, 4] we presented RAP, the system for finding first-order
maximal frequent patterns in multi-relational data. RAP was successfully applied to many tasks like propositional feature construction [3], propositionalization of mutagenesis and carcinogenesis domains [3] or mining in the STULONG
database [5].
Nevertheless, it may be very difficult to use any ILP system, including RAP,
for mining in a large volume of data. Frequently, an ILP system is not able to
find all frequent patterns in high dimensional data with serial algorithms on
single processor because of limited resources: too many candidates need to be
processed. In [16] the research area of parallel execution techniques is mentioned
as one of the principal research directions that need to be followed in future.
Frequent pattern mining is not only a good candidate for parallelizing, but there
is a real necessity to do it.
In this paper we describe a new version of RAP which exploits distributed
architecture2 . In Section 2 we describe the current version of distributed RAP.
Section 3 outlines attribute-value algorithms for mining frequent patterns that
look promising for multi-relational data. Then we mention coding of patterns and
language bias. Section 4 gives overview of several tasks which are appropriate for
solving with the distributed version of RAP. Concluding remarks can be found
in Section 5.
1
2
This work has been partially supported by the Czech Ministry of Education under
the Grant no. 143300003.
In our point of view a distributed system is a parallel system whose nodes (computational units – computers) does not share any resources (memory, peripheries,
etc.).
2
Distributed RAP
Distributed data. RAP now can find frequent patterns over distributed data
in the same way as Savasere’s distributed algorithm [17]. Each node first find
all frequent patterns on its fragment of database. Then each node sends these
patterns to other nodes. At the end nodes compute global pattern frequencies
from the frequencies got from the other node.
Distributed hypothesis space. Also the hypothesis space can be distributed
in current RAP version. However, we have observed that the straightforward
method (i.e. without an intelligent partitioning the hypothesis space) does not
result in a significant speed-up of computation. The reason is that same patterns
can be generated on different nodes.
serial
loading
gen. refinements
discretizing
selecting ref.
chck. infreq
chck. known
comp. coverage
all
300
relative time (in %)
250
200
150
100
50
0
2
4
6
8
node id
10
12
14
16
Figure 1. 16 nodes:The relative time of the RAP algoritm steps.
Results Several experiments were has been performed with mutagenesis data
3
and B4 background knowledge (atoms, bonds and 2-D structures) were used.
The values of logP, LUMO and partial charges were discretized by RAP into
three intervals. Minimal frequency thershold was set to 25 % of all compounds.
We used best-first search with one, four and sixteen nodes. Figure 1 displays the
relative time of finding LMFP on sixteen nodes. The value 100 % is a time of
using the serial RAP algorithm on the whole data (the dashed line – serial).
We can observe that the coverage computation and discretization steps speedup lineary. Refinement selection (the most time consuming step when best-first
search is used) consumes about 30 % of time of the serial algoritm. 4 It must be
3
4
S Srinivasan, R.D. King, and M. Sternberg. Theories for mutagenicity: a study of
first-order and feature based induction. Artifical Intelligence, 85(1-2):277–299, 1996.
The computational nodes 9 and 11 generated LFMP much longer than serial because
they generated very long patterns (about 14 literals, twice as long as the other nodes).
However,these patterns were not globaly frequent.
stressed that the distributed RAP generated more maximal patterns then the
non-distributed version (cf. 17 patterns for 4 nodes, 46 paterns for 16 nodes with
7 patterns for the non-distributed version).
3
3.1
Future work
Distributed computation of coverage
Agrawal and Schafer [1] adapted the Apriori algorithm [2] for mining association
rules over distributed database and designed the Count Distribution (CD) algorithm. All nodes process their part of input database by the Apriori algorithm.
After they compute coverage of all candidates they exchange the information
about support of these candidates and process patterns which are frequent in
next iteration. Similar approach employed Konstantopoulos [12] in the Aleph
system and showed that this algorithm performs – when a simple background
knowledge is used – better than the serial algorithm.
The main disadvantage of these methods is that each node stores all frequent
patterns and all candidates in a memory. The system can fail because candidates
consume all memory of single nodes. Also the load-balancing cannot be used easily. The problems with insufficient memory can be solved by partitioning the
set of candidate patterns over computational units. Han et al. [11] proposed
the Intelligent Data Distribution (IDD) algorithm which distributes candidates
over nodes which compute the coverage on the whole database. Han showed that
the algorithm performs much better than the CD algorithm.
Efficient distribution of the hypothesis space (candidates) over nodes can be
done in two ways. A new refinement operator which will be able to generate
only such patterns which are relevant for specific node can be implemented.
This approach can provide really good results because it does not require any
additional time for communication and nodes can explore the hypothesis space
independently. Another way is to use master-worker design. The master node
than distribute candidate set over workers. We explore both methods. The main
advantage of the first one is that it does not require any additional resources
(network connection and communication interface). The second one is much
easier to implement. The master can better balance work between nodes. Also
new methods of candidate generation can be easily integrated.
A hybrid approach is described in [11]. Hybrid Distribution (HD) algorithm
combines the CD and the IDD algorithms. Nodes are split into several groups
and the database is distributed over these groups. The candidates are partitioned
and each part of candidates is processed by one node in the group.
An ILP system which distributes the database and also the hypothesis space
has been proposed by Dehaspe and De Raedt [6]. Both HD and Dehaspe’s algorithms outperform serial algorithms. Moreover, the HD algorithm achieves much
better results than the IDD and the CD algorithms by using separately.
3.2
Coding patterns.
Exchanging information about candidates between computational nodes can
drastically decrease performance of parallel algorithm, namely if whole patterns
are sent. It is better to assign unique code to each pattern and use only such
a code in communication. The simplest way to obtain such a code is to use
some algorithm for tree encoding on the tree representation of frequent pattern
(Prolog goal). To prevent from obtaining different codes for logically equivalent patterns, literals in the pattern are being reordered. The new ordering is
achieved in two steps. First, literals are sorted by their name and arity. If there
are two and more literals of the same name and arity in the pattern we determine their order by using their arguments. For example, by reordering pattern
q(Y, a, X), p(X, Y ), q(X, a, Z) we obtain pattern p(X, Y ), q(X, a, Z), q(Y, a, X)
because the variable X occurs before variable Y in the literal p/2. By encoding
reordered pattern we get the same code for all equivalent patterns.
3.3
Language bias
When mining first order patterns, frequently one does not need to work with
full first order language. Mining in multi-relational database without any handwritten domain knowledge is an example. The domain knowledge needed can
be automatically or semi-automatically constructed from the database schema.
In NLP, namely in morphological disambiguation task, it has been shown [14]
that permitting only one input variable in a literal does not prevent a learning
system from solving the disambiguation task and results in significant speed up
of coverage computation. In [13] a wide spectrum of techniques for declarative
language bias has been introduced.
4
Applications
Many natural language processing (NLP) tasks can be seen as finding frequent patterns in morphologically, syntactically or semantically tagged large corpora. The found patterns can be used e.g. for finding morphological rules, word
sense disambiguation and multi-word expressions. To solve real NLP tasks, data
become too large. Moreover, the search space is enormous, because of many possible combinations of tags. Protein databases contain sequences of proteins,
their structure, description of their functionality, etc. For example the International Protein Sequence Database (pir.georgetown.edu) contains 283 000
sequences. Although it is very important to analyze such a data, the complexity of this task is above capabilities of current ILP systems. Mining in XML
documents is also a challenging task. For instance the International Movie
Database (IMDb, http://www.imdb.com/) contains giga bytes of data about
movies and actors, user’s comments etc. This kind of data is very difficult to
analyze with attribute-value learning systems because of complex structure of
XML documents.
5
Concluding remarks
Methods for evaluation of Datalog queries based on e.g. magic sets can be useful,
too. In the past, parallel evaluation of Datalog queries has been explored [7, 9].
The privacy preserving techniques, introduced e.g. in [8] can be easily employed
even in mining first-order data. The distributed version of RAP will be running
on a cluster of PC’s. We also intend to use grid architecture.
References
1. Agrawal R. and Shafer J. C. Parallel mining of association rules. IEEE Transactions on Knowledge and Data Engineering, 8(6):962–969, 1996.
2. Agrawal R. and Srikant R. Fast algorithms for mining association rules in large
databases. In Proc. of VLDB’94 Conf., pp. 487–499. Morgan Kaufmann, 1994.
3. Blaťák J. and Popelı́nský L. Feature construction with RAP. In Proc. of the WP
at the 13th ILP Conf., pp. 1–11. Univ. of Szeged, 2003.
4. Blaťák J., Popelı́nský L., and Nepil M. RAP: Framework for mining frequent
Datalog patterns. In Proc. of the 1st KDID ws. at ECML/PKDD 2002, pp. 85–86.
5. Blaťák J. Mining first-order frequent patterns in the STULONG database. In
Proc. of the ECML/PKDD 2004 Challenge, 2004.
6. Dehaspe L. and De Raedt L. Parallel inductive logic programming. In Proc. of the
MLnet Familiarization Workshop on Statistics, ML and KDD, 1995.
7. Dong G. and Su J. Increment boundedness and non-recursive incremental evaluation of Datalog queries. Lecture Notes in Computer Science, 893:397–406, 1995.
8. Frieser J. and Popelı́nský L. DIODA: Secure mining in horizontally partitioned
data. In Proc. of the P&S Issues in DM ws. at ECML/PKDD 2004, pp. 12, 2004.
9. Ganguly S., Silberschatz A., and Tsur S. A framework for the parallel processing
of datalog queries. In Proc. of the SIGMOD int. conf. on MD. ACM Press, 1990.
10. Graham J., Page C.D., and Kamal A. Accelerating the Drug Design Process
through Parallel ILP DM. In Proc. of the CSB’03, pp. 3, 2003.
11. Han E.-H., Karypis G., and Kumar V. Scalable parallel data mining for association
rules. In SIGMOD, Proc. ACM SIGMOD Int. Conf. on MD. ACM Press, 1997.
12. Konstantopoulos S. T. A Data-Parallel Version of Aleph. In P&D computing for
ML. In conjunction with the ECML/PKDD 2003 Conf., 2003. Springer-Verlag.
13. Nédellec, C., Rouveirol, C., Adé, H., Bergadano, F., Tausend, B. Declarative Bias
in ILP. In De Raedt, L.(ed.),Advances in Inductive Logic Programming, pp. 82-103,
IOS Press 1996.
14. Nepil M. INDEED: a system for relational rule induction in domains with rich
space of constant terms. In Proc. of the WP at the 13th ILP Conf.. Univ. of
Szeged, 2003.
15. Ohwada H., Nishiyama H., and Mizoguchi F. Concurrent execution of optimal
hypothesis search for inverse entailment. In Proc. of the 10th ILP Conf. , vol.
1866, pp. 165–173. Springer-Verlag, 2000.
16. Page D. and Srinivasan A. ILP: A short Look Back and a Longer Look Forward.
Journal of Machine Learning Research, 4:415–430, 2003.
17. Savasere A., Omiecinski E., and Navathe S. B. An efficient algorithm for mining
association rules in large databases. In The VLDB Journal, pp. 432–444, 1995.
18. Skillicorn D. B. and Wang Y. Parallel and sequential algorithms for data mining
using inductive logic. Knowl. Inf. Syst., 3(4):405–421, 2001.