Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
High Performance Distributed Systems for Data Mining Paolo Palmerini Univerità Ca’ Foscari, Venice, Italy [email protected] November 19, 2002 Abstract rently possible to collect and store in many and diverse fields of science and business, is one of the challenges that computer science researchers are currently facing. The set of algorithms and techniques that where developed in the last decades to extract interesting patterns from huge data repositories is called Data Mining (DM). Such techniques are part of a bigger framework, referred to as Knowledge Discovery in Databases (KDD), that covers the whole process, from data preparation to knowledge modeling. Within this process, DM techniques and algorithms are the actual tools that analysts have at their disposal to find unknown patterns and correlation in the data. Typical DM tasks are classification (assign each record of a database to one of a predefined set of classes), clustering (find groups of records that are close according to some defined metrics) or association rules (determine implication rules for a subset of record attributes). A considerable number of algorithms have been developed to perform these and others tasks, from many fields of science, from machine learning to statistics through neural and fuzzy computing. What was a hand tailored set of case specific recipes, about ten years ago, is now recognized as a proper science [46]. It is sufficient to consider the remarkable wide spectrum of applications where DM techniques are currently being applied to understand the ever growing interest from the research community in this discipline. Among the traditional sciences we mention astronomy [35], high energy physics, biology and The set of algorithms and techniques used to extract interesting patterns and trends from huge data repositories is called Data Mining. Due to the typical complexity of computations and the amount of data handled, the performance control of Data Mining algorithms is still an open problem. Despite the many important results that have been obtained so far for specific cases, a more general framework is needed for the development of Data Mining applications, where performances can be controlled effectively. We propose a research activity aimed at the design of a hardware/software architecture for Data Mining. Such architecture will be based on the generalization of the results found during the design of High Performance Data Mining Algorithms and will be aware of the current trends in High Performance Large Scale Distributed Computing Platforms, namely Cluster of Workstations and Computational Grids. Although the thesis will be focused on the architecture design, we also plan to investigate the implementation issues of such a system, by means of simulations and with the deployment of a working small scale instance of such architecture. Part of this research is done in collaboration with prof. Zaki at the Rensselaer Polytechnic Institute, Troy, NY, USA. 1 Introduction The ability of extracting useful and non-trivial information from the huge amount of data that is cur1 medicine that have always provided a rich source of applications to data miners. An important field of application for data mining techniques is also the World Wide Web [38]. The Web provides the ability to access a one of the largest data repositories, which in most cases still remains to be analyzed and understood. Recently, Data Mining techniques are also being applied to social sciences, home land security and counter terrorism [33]. Due to its relatively recent development, Data Mining still poses many challenges to the research community. New methodologies are needed in order to mine more interesting and specific information from the data, new frameworks are needed to harmonize more effectively all the steps of the KDD process, new solutions will have to manage the complex and heterogeneous source of information that is available to the analysts. One of the problems that has always been claimed as one of the most important to address, but has never been solved in general terms, is about the performance of DM algorithms. As a matter of fact, the complexity of such algorithms depend not only on external properties of the input data, like size, number of attributes, number of records, and so on, but also on the data internal properties, such as correlations and other statistical features that can only be known at run time. This makes the problem of controlling the performance of DM algorithms extremely difficult. This thesis is focused on the design of a High Performance and Distributed System for Data Mining. We will distinguish among a DM algorithm which is a single DM kernel, a DM application, which is a more complex element whose execution in general involves the execution of DM algorithms several, and a DM System, which is the framework withing which DM applications are being executed. A DM System is therefore composed by a software environment that provides all the functionalities to compose DM applications, and a hardware back-end onto which the DM applications are executed. In the rest of this document, a more precise definition of the characteristics of the Data Mining System proposed is given, together with a motivation for its realization and a survey of the main recent results obtained in this field. 2 2.1 State of the Art and Open Problems Parallel and Distributed Data Mining The performance of DM algorithms have always been a main concern for data miners. Just to mention one example, consider the remarkable amount of algorithms that exists for solving one popular DM problem, the so called Frequent Set Counting problem. Since its first introduction in 1993 by Agrawal [2] who contextually proposed his popular Apriori algorithm, a number of other algorithms still now populate the scientific literature [42], [12], [29], [54], [36], [44], [3], [39]. As performance were concerned, parallel High Performance Computing Platforms have always constitute a natural target architecture for them. In [53] M. J. Zaki reviews most of the efforts in the field of parallel association mining, by analyzing different approaches and strategies for parallelization. Decision tree construction is in general a not trivial task to parallelize. Notable results are reported in [50] and [32]. Clustering algorithms generally present a structure which is easier to parallelize. Several parallelization of the popular k-means algorithm have been proposed on distributed memory architectures [19], on large PC cluster [51], and on a cluster of SMP [5]. Regardless this considerable amount of work related to the performance and the efficiency of DM algorithms, very few general results can be outlined so far. As in the case of the cited FSC problem, there is not an evident best solution. Rather different algorithms perform differently depending on the input data, the target architecture and the user defined parameter values. There is the need for detailed analytical performance model that take into account all the factors that have an influence on resource usage (CPU, memory and disks), in order to devise adaptable techniques that allows to adopt 2 the best solution known for the specific case. Some work have already been started along the line of performance modeling. In [49], David Skillicorn argues that bench-marking and implementations are very expensive approaches for parallel DM algorithms performance debugging. He proposes a number of cost-effective alternative measures (counting computations, data accesses, and communication). These measures can provide a reasonably accurate picture of an application’s performance. In [31] [11] there is an analysis of resource usage and workload characterization for DM algorithms. The observation that many different Data Mining algorithms share common structure and properties, has been pointed out in many works [30, 49, 17, 7]. Nevertheless a unification of the partial results found on single algorithms is still an open problem. The main lines of research were conducted at language level, as in [43], where Parthasarathy and Subramonian present a language construct (a SIMD DOALL) for the design of parallel Data Mining programs. In [22] and [21], Saltz et al. introduce a set of language extensions and a prototype compiler for supporting high-level object-oriented programming of data intensive reduction operation over multidimensional data, using a run-time system called Active Data Repository. Some further efforts have also been devoted toward the definition of complete DM systems, that include all the aspects related to distributed knowledge discovery: from handling distributed data, to applying parallel processing for pattern identification. Papyrus [28] is the most complete such project. Clusters of data and compute hosts form the whole system that is distributed over a wide area network. Performance characteristics for inter cluster and intra cluster communications are considered, in determining whether data, models or results are to be transferred in order to achieve high efficiency. On top of the clustered hosts, a layered architecture is built, composed by a set of tools devised to facilitate the local Data Mining and wide area combining process. A different approach, more domain specific, is the one of the SUBDUE system. SUBDUE [27] is a Knowledge Discovery System for structural databases. Parallel and distributed implementations of the system are described in [27] and discussed. For what concerns the investigation of architecture for DM, we can mention [18]. It is worth mentioning that all these projects seem to have been abandoned in the last years, while a great deal of attention is still posed on performance and generalization of results. We think that one limitation of such previous projects was to have not properly considered the current trends in distributed architectures. Therefore many solutions had to be found from scratch thus resulting in a global lack of effectiveness of the system proposed. This is probably the case for the Papyrus system, whose layered architecture presents more than one point of similarity with computational Grids (see below for an introduction on Grids). We argue that awareness of the features of modern HPC and large scale distributed architecture, together with an in-depth analysis of DM algorithms performance costs, will lead to the realization of a general DM system, able to actually scale to arbitrary data size, adaptive to different hardware characteristics, and effective in handling inherently distributed data. 2.2 The architectural framework Clusters of workstations (COWs) are now a widely spreading platform for High Performance Computing [13] [52]. Due to the performance achieved by commodity hardware components and open source operating systems, it is not an exception to find Linux based COWs among the top ten most powerful machines on earth [34]. Our concern on performance will therefore lead us to the development of solutions specifically targeted at COW-based platforms. Another crucial architectural constraint imposed by data mining application is the inherently distributed nature of data. Such data cannot in general, either for privacy or feasibility reasons, be gathered at a single site. Therefore the natural architecture for the development of DM applications would be a distributed one. Large scale distributed computing platform are recently being described within a unified paradigm called the Grid [24]. In the words of I. 3 Foster, one of the fathers of the Grid concepts [26]: among the components. K-Grid users interact with the K-Grid by composing and submitting DAGs, i.e. the application of a set of DM kernels on a set of datasets. For example we can perform an initial clustering on a given dataset in order to extract groups of homogeneous records, and then look for association rules within each cluster. In this scenario, one important service of the KGrid is the one that is in charge of mapping task requests onto physical resources. The user will in fact have a transparent view of the system and possibly little or no knowledge of of the physical resources where the computations will be executed, neither he or she knows where the data actually reside. The only thing the user must be concerned with, is the semantic of the application, i.e. what kind of analysis he or she wants to perform and on which data. Many efforts have already been devoted to the problem of scheduling distributed jobs on Grid platforms. Some of such schedulers are Nimrod-g [1], Condor [20] and AppLeS [8]. Recently a general architecture for grid schedulers has been outlined in [48] by J. Schopf. She describes three main phases of the activity of a grid scheduler. The first phase is devoted to resource discovery. During this phase a set of candidate machines where the application can be executed, is built. This set is obtained by filtering the machines where the user has enough privileges to access and at the same time satisfy some minimum requirements, expressed by the user. In the second phase one specific resource is selected among the ones determined in the previous phase. This choice is performed based on information about system status e.g. machine loads, network traffic - and again possible user requirements in term of execution deadline or limited budget. Finally, the third phase is devoted to actual job execution, from reservation to completion, through submission and monitoring. The second phase described above is the most challenging, since it is strictly application dependent. Many of the schedulers mentioned above propose their own solution to the problem. Nevertheless, there are some characteristics of scheduling DM tasks, that make inadequate the previous approaches. First of all we lack an accurate analytical cost model for DM tasks. In the case of the Nimrod-g The real and specific problem that underlies the Grid concept is coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organizations (VO). The sharing that we are concerned with is not primarily file exchange but rather direct access to computers, software, data, and other resources, as is required by a range of collaborative problemsolving and resource-brokering strategies emerging in industry, science, and engineering. Among them, Data Mining is one of the most challenging. A number of middleware for the deployment of actual Grids have been developed in the last five years. Among them, the most successful are the Globus Toolkit [23], Nexus [25] and Condor [20]. We can consider Parallel DM (PDM) and Distributed DM (DDM), and in particular GRID-aware DDM, as results of the natural evolution of DM technologies. Recently, a framework for such applications on Grid platforms has been proposed as the Knowledge Grid (K-Grid) [16]. The K-Grid is a middleware for distributed KDD. It is composed by two layers. At the bottom there is the layer of core services, implemented over standard grid middleware, like Globus. A set of higher level services provide specialized functions for the Knowledge Discovery process. These services can be used to construct complex Problem Solving Environments, which exploit Data Mining kernels as basic software components that can be applied one after the other, in a modular way [15]. One important issue concerning grid computing, is about resource management. Current grid technology [23] simply provides the tools to implement the management of resources. On top of these tools, some prototype of resource brokers have been implemented [14] [4], but the efforts in this directions still need to be continued. A general DM task on the K-Grid can therefore be described as a Directed Acyclic Graph (DAG) whose nodes are the DM algorithms being applied, and the links represent data dependencies 4 • Scalability. The system will be able to handle data volumes of the order of Terabytes. system, the parametric, exactly known cost of each job allows the system to foresee with a high degree of accuracy which is going to be the execution time of each job. This does not hold for DM, where the execution time of an algorithm in general depend on the input parameters in a non linear way, and also on the dataset internal correlations, so that, given the same algorithm, the same set of parameters and two dataset of identical dimensions, the execution time can vary of orders of magnitude. The same can be said for other performance metrics, as memory requirement and I/O activity. The other characteristic is that scheduling a DM task in general implies scheduling computation and data transfer. Traditional schedulers typically only address the first problem, i.e. scheduling computations. In the case of DM, since the dataset are typically big, it is also necessary to properly take into account the time needed to transfer data and to consider when and if it is worth to move data to a different location in order to optimize resource usage or overall completion time. To resume, the design of a Grid scheduler specific for Data Mining applications is related to the ability of modeling the cost of Data Mining algorithms and the ability to take into proper consideration the communications needed to handle really huge datasets. • Adaptability. DM algorithms will be able to adapt to variable resource availability (memory and disk space). • Distributed. DM algorithms will handle dataset whose location is by default distributed across several sites. Grid technologies will provide the necessary framework for the management of distributed resources. Our work will be articulated in the lines of investigation illustrated in the following sections. 3.1 Data Mining Algorithms and Cost Models During the last years we have studied the performances of several DM algorithms ([5], [39], [40], [6], [47]). Most notably, an in-depth analysis of an efficient algorithm for association mining, led to the realization of the DCI algorithm, which at the moment of writing is one of the fastest such algorithms [39]. The main features of the DCI algorithm is that it is adaptable to the actual resource availability (like amount of memory available to the application), it is scalable, so that it can handle datasets whose dimension far exceeds that of the physical memory, and it is rather general, in that it efficiently mine frequent in datasets with different internal proper3 Design of a High Perfor- patterns ties. mance and Distributed Data The experience made during the development of the DCI algorithm, led to more general results about Mining Server resource usage and requirements by DM algorithms. We are currently working on the definition of emIn this thesis we will focus on the performance is- pirical cost models for DM algorithms. We recently sues of Parallel and Distributed Data Mining Appli- proposed a methodology [41] aimed at obtaining excations. Our goal is to obtain the design specification perimental cost models, by means of sampling, a of a hardware/software architecture for DM applica- technique traditionally adopted to obtain knowledge tions. models at reduced computational cost [55], [45]. The Our work will lead to an innovative system with idea is to apply the DM algorithm on a small sample respect to the following features: of the input dataset, in order to have a hint on the algorithm performance, both in terms of quality of the • Generality. The system will not be targeted at results found and in terms of resource usage. The one specific DM algorithm or application. main problem is that, as already pointed out, in DM 5 algorithm performance depend on the unknown internal properties, which also holds for the small sample. More specifically, we do not have any apriori reason to think that the performance will scale linearly with the sample size. Our intuition is to find a statistical characterization of the sampled dataset (based on entropy calculation), that should allow to define when a sample good , i.e. it maintains the properties of the actual dataset, and when it is not1 3.2 more effective for DM, like the vertical model for association mining2 • Language structures and constructs for generalized Data Mining algorithms. Within the context of OO languages, we provide a set of templates and classes for the implementation of generic DM algorithms. For example for Association Mining we provide a Pattern class, which can be instantiated on specific databases (for example of retail data or of sequences of trees and more complex structures). Then the classical operations of association mining (like support count, subpattern generation, intersections, etc.) can be provided to the programmer in a standardized way. Scheduling Data Mining Jobs on HPDC Architectures This line of research is devoted at the definition of effective scheduling policies for mapping DM tasks with no dependencies the K-Grid, using the cost models obtained previously. We will focus both on local optimization strategies, at the level of a single cluster of workstations, and at global optimizations, at intra cluster level. We expect hardware resource usage optimizations, like memory and disks, will have a bigger impact on the local policies, whereas communications will play a more important role in the global ones. The main tool of investigation along this line will be constituted by simulation [37]. 3.3 • Local scheduler for the actual execution of the algorithms developed within the DM Server on a COW architecture. The algorithms developed within this DM Server will actually be transformed in small job requests to a scheduler which is managing the local resources (typically a COW) onto which the Server is running. 4 Conclusions A Data Mining Server Data Mining applications still poses many problems This part of the research is performed in collabora- to the scientific community. Among them, issues retion with prof. M. Zaki at the Rensselaer Polytechnic lated to the performance of such algorithm still limit their ability to affectively handling really huge data Institute, Troy, NY, USA. Realization of a complete Data Mining Server that volumes, possibly distributed across several sites and will provide efficient access to data and a language with variable resource availability. In this thesis we claim that a detailed study of framework for the implementation of Data Mining the performance requirements of Data Mining algoAlgorithms. rithms can lead to quite general framework within • Development of data models suited for data which High Performance and Distributed DM applimining algorithms. We plan to adopt an ap- cations can be developed. proach similar to the one adopted in the Monet Our research is focused on the definition of gen[9] DBMS, where a highly fragmented data eral cost models for DM algorithms, that will allow model based on Binary Association Tables [10] to devise scheduling strategies for DM applications is adopted. This model allows to integrate the 2 In association mining the two common data representation standard relational model with other models, used are the so called horizontal, where rows indicate records and columns items, and the vertical one, where for each item http://miles.cnuce.cnr.it/ palmeri/datam/sampling/simul we store the list of rows where the item appears. 1 Updated status of this research can be checked out at 6 able to optimize resource usage. Our work will reIl materiale presentato per la conclufer to commonly recognized main trends in HPC and sione di ogni corso é reperibile all’indirizzo large scale distributed computing. Namely Cluster of http://www.dsi.unive.it/ palmeri Workstations and Computational Grids. Although the main goal of the thesis will be an architectural design of a Data Mining System where References DM algorithms can be executed, we also plan to ob[1] D. Abramson, J. Giddy, I. Foster, and L. Kotler. tain a small scale working example of such a system. High performance parametric modeling with nimrod/g: Killer application for the global grid? In International Parallel and Distributed ProAppendice Piano di Studi cessing Symposium Cancun, Mexico, 2000. Riepilogo degli esami seguiti e sostenuti alla fine del [2] R. Agrawal, T. Imielinski, and Swami A. Mining II anno di dottorato: Associations between Sets of Items in Massive Databases. In Proc. of the ACM-SIGMOD 1993 • Towards an Infrastructure for Pervasive ComInt’l Conf. on Management of Data, pages 207– puting course on Large Scale Distributed Systems 216, Washington D.C., USA, 1993. , prof. F. Panzieri, (University of Bologna, Italy). [3] R. Agrawal, H. Manilla, R. Srikant, H. Toivonen, and A. Inkeri Verkamo. Fast Discovery of Asso• An Architecture for Web Usage Mining course ciation Rules in Large. In Advances in Knowlon Knowledge Discovery and Datamining, by edge Discovery and Data Mining, pages 307–328. prof. D. Pedreschi (University of Pisa, Italy), AAAI Press, 1996. F. Giannotti (CNUCE-CNR, Italy), prof. J. Han (Simon Fraser University, Canada) [4] G. Aloisio, M. Cafaro, P. Falabella, C. Kesselman, and R. Williams. Grid computing on the • Resource Management of Distributed Resources web using the globus toolkit. In Proc. HPCN on Grids course on Parallel Computing, prof. S. Europe 2000, Amsterdam, Netherlands. Lecture Orlando (University of Venice, Italy) Notes in Computer Science, Springer-Verlag, N. 1823, 32-40, 2000. • Fingerprinting Techniques course on Probabilistic Algorithms, prof. A. Clementi (University [5] R. Baraglia, D. Laforenza, S. Orlando, Tor Vergata, Rome Italy). P. Palmerini, and R. Perego. Implementation issues in the design of I/O intensive data • Corso di Simulazione, prof. L. Donatiello mining applications on clusters of worksta(University of Bologna, Italy), prof.ssa S. Baltions. In Proc. of the 3rd Workshop on High samo (University of Venice, Italy) Performance Data Mining, Cancun, Mexico. Spinger-Verlag, 2000. • Lambda Calcolo, prof. A. Salibra (University of Venice, Italy) [6] R. Baraglia and P. Palmerini. Suggest: A web usage mining system. In Proceedings of IEEE International Conference on Information Technology: Coding and Computing, 2002. Di tutti i corsi sono stati sostenuti gli esami secondo le modalitá previste dal docente, tranne il corso di Lambda Calcolo. Non ho potuto tenere il seminario conclusivo dell’esame di Lambda Calcolo, a causa della mia partenza per gli Stati Uniti. D’accordo con il docente del corso, terró il seminario al mio rientro in Italia. [7] P. Becuzzi, M. Coppola, and M. Vanneschi. Mining of association rules in very large databases: a structured parallel approach. In Proc. of Europar, 1999. 7 [8] Francine Berman, Richard Wolski, Silvia [17] J. Darlington, M. Ghanem, Y. Guo, and H. W. Figueira, Jennifer Schopf, and Gary Shao. ApTo. Performance models for co-ordinating paralplication level scheduling on distributed heterolel data classification. In Proc. of the Seventh Ingeneous networks. In Proceedings of Supercomternational Parallel Computing Workshop, 1997. puting 1996, 1996. [18] Umeshwar Dayal, Qiming Chen, and Meichun [9] P. A. Boncz. Monet: A Next-Generation DBMS Hsu. Large-scale data mining applications: ReKernel For Query-Intensive Applications. PhD quirements and architectures. In Proocedings of thesis, Universiteit van Amsterdam, AmsterWorkshop on Large-Scale Parallel KDD Systems dam, The Netherlands, May 2002. August 15th, 1999, San Diego, CA, USA, 1999. [10] P. A. Boncz and M. L. Kersten. MIL Primitives for Querying a Fragmented World. The VLDB [19] I. S. Dhillon and D. S. Modha. A data clustering algorithm on distributed memory machines. Journal, 8(2):101–119, October 1999. In Proc. of the 5th ACM SIGKDD International [11] J. P. Bradford and J. Fortes. Performance and Conference on Knowledge Discovery and Data memory access characterization of data minMining, 1999. ing applications. In Proceedings of Workshop on Workload Characterization: Methodology and [20] D. H. J Epema, Miron Livny, R. van Dantzig, Case Studies, 1998. X. Evers, and Jim Pruyne. A worldwide flock of condors : Load sharing among workstation clus[12] Sergey Brin, Rajeev Motwani, Jeffrey D. Ullters. Journal on Future Generations of Comman, and Shalom Tsur. Dynamic itemset countputer Systems, 12, 1996. ing and implication rules for market basket data. [13] [14] [15] [16] In Proceedings of the ACM SIGMOD Interna[21] Renato Ferreira, Gagan Agrawal, and Joel H. tional Conference on Management of Data, volSaltz. Compiling object-oriented data intensive ume 26,2 of SIGMOD Record, pages 255–264, applications. In International Conference on SuNew York, May13–15 1997. ACM Press. percomputing, pages 11–21, 2000. Rajkumar Buyya, editor. High Performance [22] Renato Ferreira, Tahsin M. Kurc, Michael Cluster Computing. Prentice Hall PTR, 1999. Beynon, Chialin Chang, Alan Sussman, and Rajkumar Buyya, David Abramson, and Joel H. Saltz. Object-relational queries into Jonathan Giddy. Nimrod/g: An architecture for multidimensional databases with the active data a resource management and scheduling system in repository. Parallel Processing Letters, 9(2):173– a global computational grid. In The 4th Inter195, 1999. national Conference on High Performance Computing in Asia-Pacific Region (HPC Asia 2000), [23] I. Foster and C. Kesselman. Globus: A metaBeijing, China. IEEE Computer Society Press, computing infrastructure toolkit. Intl J. SuperUSA, 2000. computer Applications, 2(11):115–128, 1997. M. Cannataro, D. Talia, and P. Trunfio. Design and development of distributed data mining ap- [24] I. Foster and C. Kesselman, editors. The Grid: Blueprint for a Future Computing Infrastrucplications on the knowledge grid. In Proceedings ture. Morgan-Kaufmann, 1999. of High Performance and Distributed Computing, 2002. [25] I. Foster, C. Kesselman, and S. Tuecke. The M. Cannataro D. Talia. Knowledge grid: An Nexus task-parallel runtime system. In Proc. architecture for distributed knowledge discovery. 1st Intl Workshop on Parallel Processing, pages Communications of the ACM, 2002. 457–462. Tata McGraw Hill, 1994. 8 [26] I. Foster, C. Kesselman, and S. Tuecke. The [36] Junqiang Liu, Yunhe Pan, Ke Wang, and Jiawei anatomy of the grid: Enabling scalable virtual Han. Mining frequent item sets by opportunistic organizations. Intl. J. Supercomputer Applicaprojection. In SIGKDD, Edmonton, July 2002. tions, 3(15), 2001. [37] M. Marzolla and P. Palmerini. Simulation of a grid scheduler for data mining. Esame per [27] G. Galal, D. J. Cook, and Holder L. B. Exil corso di dottorato in informativa, Universita’ ploiting parallelism in knowledge discovery sysCa’ Foscari, Venezia, 2002. tems to improve scalability. In Proc. of the 31st Hawaii International Conference on System Sci- [38] Jesus Mena. Data Mining Your Website. Digital ences., 1998. Press, United States of America, 1999. [28] Robert Grossman, Stuart Bailey, Balinder Mali [39] S. Orlando, P. Palmerini, R. Perego, and F. Silvestri. Enhancing the apriori algorithm for freRamau, and Andrei Turinsky. The preliminary quent set counting. submitted to the SIAM condesign of papyrus: A system for high perforference, 2001. mance, distributed data mining over clusters. In Advances in Distributed and Parallel Knowledge [40] S. Orlando, P. Palmerini, R. Perego, and F. SilDiscovery. AAAI/MIT Press, 2000. vestri. An efficient parallel and distributed algorithm for counting frequent sets. In Proceedings [29] J. Han, J. Pei, and Y. Yin. Mining Frequent Patof the 5th International Conference on Vector terns without Candidate Generation. In Proc. of and Parallel Processing, 2002. the ACM SIGMOD Int. Conf. on Management of Data, pages 1–12, Dallas, Texas, USA, 2000. [41] S. Orlando, P. Palmerini, R. Perego, and F. Silvestri. Scheduling high performance data mining [30] R. Jin and G. Agrawal. A middleware for develtasks on a data grid environment. In Proceedings oping parallel data mining applications. In Proc. of Europar, 2002. of the 1-st SIAM Conference on Data Mining, [42] J. S. Park, M.-S. Chen, and P. S. Yu. An Ef2000. fective Hash Based Algorithm for Mining Association Rules. In Proc. of the 1995 ACM SIG[31] Yarsun Hsu Jin-Soo Kim, Xiaohan Qin. Memory MOD International Conference on Management characterization of a parallel data mining workof Data, pages 175–186, San Jose, California, load. In Proceedings of Workshop on Workload 1995. Characterization: Methodology and Case Studies, 1998. [43] Srinivasan Parthasarath and Ramesh Subramonian. Facilitating data mining on a network [32] M. Joshi, G. Karypis, and V. Kumar. Scalparc: of workstations. In Advances in Distributed A new scalable and efficient parallel classificaand Parallel Knowledge Discovery. AAAI/MIT tion algorithm for mining large datasets. In ProPress, 2000. ceedings of IPPS/SPDP’98, 1998. [44] J. Pei, J. Han, H. Lu, S. Nishio, S. Tang, and [33] Proceedings of the First SIAM Workshop on D. Yang. H-mine: Hyper-structure mining of freData Mining and Counter Terrorism, 2002. quent patterns in large databases. In Proc. 2001 Int. Conf. on Data Mining (ICDM’01), Novem[34] http://www.top500.org. ber 2001. [35] Chandrika Kamath. Data mining for science and [45] Foster J. Provost, David Jensen, and Tim Oates. engineering applications. In Proceedings of the Efficient progressive sampling. In Knowledge First SIAM Conference, 2001. Discovery and Data Mining, pages 23–32, 1999. 9 [46] N. Ramakrishnan and A. Y. Grama. Data Mining: From Serendipity to Science. IEEE Computer, 32(8):34–37, 1999. [47] F. Romano. Parallelizzazione dell’algorithmo c4.5 per la costruzione di alberi decisionali. Tesi di Laurea, Universita’ Caq’ Foscari, Venezia, 2000. [48] Jennifer M. Schopf. A general architecture for scheduling on the grid. Journal of Parallel and Distributed Computing, special issue on Grid Computing, 2002. [49] D. B. Skillikorn. Strategies for parallel data mining. IEEE Concurrency, 7, 1999. [50] Anurag Srivastava, Eui-Hong Han, Vipin Kumar, and Vineet Singh. Parallel formulations of decision-tree classification algorithms. Data Mining and Knowledge Discovery, 3(3):237–261, 1999. [51] K. Stoffel and A. Belkoniene. Parallel k-means clustering for large datasets. In P. Amestoy, P. Berger, M. Daydé, I. Duff, V. Frayssé, L. Giraud, and D. Ruiz, editors, EuroPar’99 Parallel Processing, Lecture Notes in Computer Science, No. 1685. Springer-Verlag, 1999. [52] Sterling T.L., Salmon J., Becker D.J., and Savarese D.F. How to Build a Beowulf. A guide to the Implementation and Application of PC Clusters. The MIT Press, 1999. [53] M. J. Zaki. Parallel and distributed association mining: A survey. IEEE Concurrency, 7, Oct. 1999. [54] M. J. Zaki, S. Parthasarathy, and W. Li. A localized algorithm for parallel association mining. In ACM Symposium on Parallel Algorithms and Architectures, pages 321–330, 1997. [55] Mohammed J. Zaki, Srinivasan Parthasarathy, Wei Li, and Mitsunori Ogihara. Evaluation of sampling for data mining of association rules. In 7th International Workshop on Research Issues in Data Engineering, 1997. 10