Download High Performance Distributed Systems for Data Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

K-means clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
High Performance Distributed Systems for Data Mining
Paolo Palmerini
Univerità Ca’ Foscari, Venice, Italy
[email protected]
November 19, 2002
Abstract
rently possible to collect and store in many and diverse fields of science and business, is one of the
challenges that computer science researchers are currently facing.
The set of algorithms and techniques that where
developed in the last decades to extract interesting patterns from huge data repositories is called
Data Mining (DM). Such techniques are part of a
bigger framework, referred to as Knowledge Discovery in Databases (KDD), that covers the whole process, from data preparation to knowledge modeling.
Within this process, DM techniques and algorithms
are the actual tools that analysts have at their disposal to find unknown patterns and correlation in the
data.
Typical DM tasks are classification (assign each
record of a database to one of a predefined set of
classes), clustering (find groups of records that are
close according to some defined metrics) or association rules (determine implication rules for a subset
of record attributes). A considerable number of algorithms have been developed to perform these and
others tasks, from many fields of science, from machine learning to statistics through neural and fuzzy
computing. What was a hand tailored set of case specific recipes, about ten years ago, is now recognized
as a proper science [46].
It is sufficient to consider the remarkable wide spectrum of applications where DM techniques are currently being applied to understand the ever growing interest from the research community in this discipline. Among the traditional sciences we mention astronomy [35], high energy physics, biology and
The set of algorithms and techniques used to extract
interesting patterns and trends from huge data repositories is called Data Mining. Due to the typical complexity of computations and the amount of data handled, the performance control of Data Mining algorithms is still an open problem. Despite the many
important results that have been obtained so far for
specific cases, a more general framework is needed for
the development of Data Mining applications, where
performances can be controlled effectively.
We propose a research activity aimed at the design
of a hardware/software architecture for Data Mining.
Such architecture will be based on the generalization
of the results found during the design of High Performance Data Mining Algorithms and will be aware of
the current trends in High Performance Large Scale
Distributed Computing Platforms, namely Cluster of
Workstations and Computational Grids.
Although the thesis will be focused on the architecture design, we also plan to investigate the implementation issues of such a system, by means of
simulations and with the deployment of a working
small scale instance of such architecture.
Part of this research is done in collaboration with
prof. Zaki at the Rensselaer Polytechnic Institute,
Troy, NY, USA.
1
Introduction
The ability of extracting useful and non-trivial information from the huge amount of data that is cur1
medicine that have always provided a rich source of
applications to data miners. An important field of application for data mining techniques is also the World
Wide Web [38]. The Web provides the ability to access a one of the largest data repositories, which in
most cases still remains to be analyzed and understood. Recently, Data Mining techniques are also
being applied to social sciences, home land security
and counter terrorism [33].
Due to its relatively recent development, Data
Mining still poses many challenges to the research
community. New methodologies are needed in order to mine more interesting and specific information
from the data, new frameworks are needed to harmonize more effectively all the steps of the KDD process,
new solutions will have to manage the complex and
heterogeneous source of information that is available
to the analysts.
One of the problems that has always been claimed
as one of the most important to address, but has
never been solved in general terms, is about the performance of DM algorithms. As a matter of fact, the
complexity of such algorithms depend not only on
external properties of the input data, like size, number of attributes, number of records, and so on, but
also on the data internal properties, such as correlations and other statistical features that can only be
known at run time. This makes the problem of controlling the performance of DM algorithms extremely
difficult.
This thesis is focused on the design of a High Performance and Distributed System for Data Mining.
We will distinguish among a DM algorithm which is
a single DM kernel, a DM application, which is a
more complex element whose execution in general involves the execution of DM algorithms several, and a
DM System, which is the framework withing which
DM applications are being executed. A DM System
is therefore composed by a software environment that
provides all the functionalities to compose DM applications, and a hardware back-end onto which the DM
applications are executed.
In the rest of this document, a more precise definition of the characteristics of the Data Mining System
proposed is given, together with a motivation for its
realization and a survey of the main recent results
obtained in this field.
2
2.1
State of the Art and Open
Problems
Parallel and Distributed Data
Mining
The performance of DM algorithms have always been
a main concern for data miners. Just to mention one
example, consider the remarkable amount of algorithms that exists for solving one popular DM problem, the so called Frequent Set Counting problem.
Since its first introduction in 1993 by Agrawal [2] who
contextually proposed his popular Apriori algorithm,
a number of other algorithms still now populate the
scientific literature [42], [12], [29], [54], [36], [44], [3],
[39].
As performance were concerned, parallel High Performance Computing Platforms have always constitute a natural target architecture for them. In [53]
M. J. Zaki reviews most of the efforts in the field
of parallel association mining, by analyzing different
approaches and strategies for parallelization. Decision tree construction is in general a not trivial task
to parallelize. Notable results are reported in [50]
and [32]. Clustering algorithms generally present a
structure which is easier to parallelize. Several parallelization of the popular k-means algorithm have
been proposed on distributed memory architectures
[19], on large PC cluster [51], and on a cluster of SMP
[5].
Regardless this considerable amount of work related to the performance and the efficiency of DM
algorithms, very few general results can be outlined
so far. As in the case of the cited FSC problem, there
is not an evident best solution. Rather different algorithms perform differently depending on the input
data, the target architecture and the user defined parameter values.
There is the need for detailed analytical performance model that take into account all the
factors that have an influence on resource usage (CPU, memory and disks), in order to devise adaptable techniques that allows to adopt
2
the best solution known for the specific case.
Some work have already been started along the line
of performance modeling. In [49], David Skillicorn
argues that bench-marking and implementations are
very expensive approaches for parallel DM algorithms
performance debugging. He proposes a number of
cost-effective alternative measures (counting computations, data accesses, and communication). These
measures can provide a reasonably accurate picture
of an application’s performance. In [31] [11] there is
an analysis of resource usage and workload characterization for DM algorithms.
The observation that many different Data Mining
algorithms share common structure and properties,
has been pointed out in many works [30, 49, 17, 7].
Nevertheless a unification of the partial results found
on single algorithms is still an open problem.
The main lines of research were conducted at language level, as in [43], where Parthasarathy and
Subramonian present a language construct (a SIMD
DOALL) for the design of parallel Data Mining programs. In [22] and [21], Saltz et al. introduce a set
of language extensions and a prototype compiler for
supporting high-level object-oriented programming
of data intensive reduction operation over multidimensional data, using a run-time system called Active Data Repository.
Some further efforts have also been devoted toward
the definition of complete DM systems, that include
all the aspects related to distributed knowledge discovery: from handling distributed data, to applying
parallel processing for pattern identification.
Papyrus [28] is the most complete such project.
Clusters of data and compute hosts form the whole
system that is distributed over a wide area network.
Performance characteristics for inter cluster and intra
cluster communications are considered, in determining whether data, models or results are to be transferred in order to achieve high efficiency. On top of
the clustered hosts, a layered architecture is built,
composed by a set of tools devised to facilitate the
local Data Mining and wide area combining process.
A different approach, more domain specific, is
the one of the SUBDUE system. SUBDUE [27]
is a Knowledge Discovery System for structural
databases. Parallel and distributed implementations
of the system are described in [27] and discussed.
For what concerns the investigation of architecture
for DM, we can mention [18].
It is worth mentioning that all these projects seem
to have been abandoned in the last years, while a
great deal of attention is still posed on performance
and generalization of results. We think that one limitation of such previous projects was to have not
properly considered the current trends in distributed
architectures. Therefore many solutions had to be
found from scratch thus resulting in a global lack of
effectiveness of the system proposed. This is probably the case for the Papyrus system, whose layered
architecture presents more than one point of similarity with computational Grids (see below for an introduction on Grids).
We argue that awareness of the features of
modern HPC and large scale distributed architecture, together with an in-depth analysis of DM algorithms performance costs, will
lead to the realization of a general DM system, able to actually scale to arbitrary data
size, adaptive to different hardware characteristics, and effective in handling inherently distributed data.
2.2
The architectural framework
Clusters of workstations (COWs) are now a widely
spreading platform for High Performance Computing
[13] [52]. Due to the performance achieved by commodity hardware components and open source operating systems, it is not an exception to find Linux
based COWs among the top ten most powerful machines on earth [34]. Our concern on performance
will therefore lead us to the development of solutions
specifically targeted at COW-based platforms.
Another crucial architectural constraint imposed
by data mining application is the inherently distributed nature of data. Such data cannot in general,
either for privacy or feasibility reasons, be gathered
at a single site. Therefore the natural architecture
for the development of DM applications would be a
distributed one. Large scale distributed computing
platform are recently being described within a unified paradigm called the Grid [24]. In the words of I.
3
Foster, one of the fathers of the Grid concepts [26]:
among the components. K-Grid users interact with
the K-Grid by composing and submitting DAGs, i.e.
the application of a set of DM kernels on a set of
datasets. For example we can perform an initial clustering on a given dataset in order to extract groups of
homogeneous records, and then look for association
rules within each cluster.
In this scenario, one important service of the KGrid is the one that is in charge of mapping task
requests onto physical resources. The user will in fact
have a transparent view of the system and possibly
little or no knowledge of of the physical resources
where the computations will be executed, neither he
or she knows where the data actually reside. The
only thing the user must be concerned with, is the
semantic of the application, i.e. what kind of analysis
he or she wants to perform and on which data.
Many efforts have already been devoted to the
problem of scheduling distributed jobs on Grid platforms. Some of such schedulers are Nimrod-g [1],
Condor [20] and AppLeS [8]. Recently a general architecture for grid schedulers has been outlined in
[48] by J. Schopf. She describes three main phases
of the activity of a grid scheduler. The first phase is
devoted to resource discovery. During this phase a
set of candidate machines where the application can
be executed, is built. This set is obtained by filtering
the machines where the user has enough privileges to
access and at the same time satisfy some minimum
requirements, expressed by the user. In the second
phase one specific resource is selected among the ones
determined in the previous phase. This choice is performed based on information about system status e.g. machine loads, network traffic - and again possible user requirements in term of execution deadline or
limited budget. Finally, the third phase is devoted to
actual job execution, from reservation to completion,
through submission and monitoring.
The second phase described above is the most
challenging, since it is strictly application dependent. Many of the schedulers mentioned above propose their own solution to the problem. Nevertheless, there are some characteristics of scheduling DM
tasks, that make inadequate the previous approaches.
First of all we lack an accurate analytical cost
model for DM tasks. In the case of the Nimrod-g
The real and specific problem that underlies the Grid concept is coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organizations (VO). The sharing that we are concerned with is not primarily file exchange
but rather direct access to computers, software, data, and other resources, as is required by a range of collaborative problemsolving and resource-brokering strategies
emerging in industry, science, and engineering. Among them, Data Mining is one of
the most challenging.
A number of middleware for the deployment of actual Grids have been developed in the last five years.
Among them, the most successful are the Globus
Toolkit [23], Nexus [25] and Condor [20].
We can consider Parallel DM (PDM) and Distributed DM (DDM), and in particular GRID-aware
DDM, as results of the natural evolution of DM technologies.
Recently, a framework for such applications on
Grid platforms has been proposed as the Knowledge
Grid (K-Grid) [16]. The K-Grid is a middleware for
distributed KDD. It is composed by two layers. At
the bottom there is the layer of core services, implemented over standard grid middleware, like Globus.
A set of higher level services provide specialized functions for the Knowledge Discovery process. These
services can be used to construct complex Problem
Solving Environments, which exploit Data Mining
kernels as basic software components that can be applied one after the other, in a modular way [15].
One important issue concerning grid computing, is
about resource management. Current grid technology [23] simply provides the tools to implement the
management of resources. On top of these tools, some
prototype of resource brokers have been implemented
[14] [4], but the efforts in this directions still need to
be continued. A general DM task on the K-Grid can
therefore be described as a Directed Acyclic Graph
(DAG) whose nodes are the DM algorithms being
applied, and the links represent data dependencies
4
• Scalability. The system will be able to handle
data volumes of the order of Terabytes.
system, the parametric, exactly known cost of each
job allows the system to foresee with a high degree of
accuracy which is going to be the execution time of
each job. This does not hold for DM, where the execution time of an algorithm in general depend on the
input parameters in a non linear way, and also on
the dataset internal correlations, so that, given the
same algorithm, the same set of parameters and two
dataset of identical dimensions, the execution time
can vary of orders of magnitude. The same can be
said for other performance metrics, as memory requirement and I/O activity.
The other characteristic is that scheduling a DM
task in general implies scheduling computation and
data transfer. Traditional schedulers typically only
address the first problem, i.e. scheduling computations. In the case of DM, since the dataset are typically big, it is also necessary to properly take into
account the time needed to transfer data and to consider when and if it is worth to move data to a different location in order to optimize resource usage or
overall completion time.
To resume, the design of a Grid scheduler specific for Data Mining applications is related to
the ability of modeling the cost of Data Mining
algorithms and the ability to take into proper
consideration the communications needed to
handle really huge datasets.
• Adaptability. DM algorithms will be able to
adapt to variable resource availability (memory
and disk space).
• Distributed. DM algorithms will handle dataset
whose location is by default distributed across
several sites. Grid technologies will provide the
necessary framework for the management of distributed resources.
Our work will be articulated in the lines of investigation illustrated in the following sections.
3.1
Data Mining Algorithms and Cost
Models
During the last years we have studied the performances of several DM algorithms ([5], [39], [40], [6],
[47]). Most notably, an in-depth analysis of an efficient algorithm for association mining, led to the
realization of the DCI algorithm, which at the moment of writing is one of the fastest such algorithms
[39]. The main features of the DCI algorithm is that
it is adaptable to the actual resource availability (like
amount of memory available to the application), it is
scalable, so that it can handle datasets whose dimension far exceeds that of the physical memory, and it
is rather general, in that it efficiently mine frequent
in datasets with different internal proper3 Design of a High Perfor- patterns
ties.
mance and Distributed Data The experience made during the development of
the DCI algorithm, led to more general results about
Mining Server
resource usage and requirements by DM algorithms.
We are currently working on the definition of emIn this thesis we will focus on the performance is- pirical cost models for DM algorithms. We recently
sues of Parallel and Distributed Data Mining Appli- proposed a methodology [41] aimed at obtaining excations. Our goal is to obtain the design specification perimental cost models, by means of sampling, a
of a hardware/software architecture for DM applica- technique traditionally adopted to obtain knowledge
tions.
models at reduced computational cost [55], [45]. The
Our work will lead to an innovative system with idea is to apply the DM algorithm on a small sample
respect to the following features:
of the input dataset, in order to have a hint on the algorithm performance, both in terms of quality of the
• Generality. The system will not be targeted at results found and in terms of resource usage. The
one specific DM algorithm or application.
main problem is that, as already pointed out, in DM
5
algorithm performance depend on the unknown internal properties, which also holds for the small sample.
More specifically, we do not have any apriori reason
to think that the performance will scale linearly with
the sample size. Our intuition is to find a statistical
characterization of the sampled dataset (based on entropy calculation), that should allow to define when
a sample good , i.e. it maintains the properties of the
actual dataset, and when it is not1
3.2
more effective for DM, like the vertical model
for association mining2
• Language structures and constructs for generalized Data Mining algorithms. Within the context of OO languages, we provide a set of templates and classes for the implementation of
generic DM algorithms. For example for Association Mining we provide a Pattern class, which
can be instantiated on specific databases (for
example of retail data or of sequences of trees
and more complex structures). Then the classical operations of association mining (like support count, subpattern generation, intersections,
etc.) can be provided to the programmer in a
standardized way.
Scheduling Data Mining Jobs on
HPDC Architectures
This line of research is devoted at the definition of effective scheduling policies for mapping DM tasks with
no dependencies the K-Grid, using the cost models
obtained previously. We will focus both on local optimization strategies, at the level of a single cluster
of workstations, and at global optimizations, at intra
cluster level.
We expect hardware resource usage optimizations,
like memory and disks, will have a bigger impact on
the local policies, whereas communications will play
a more important role in the global ones.
The main tool of investigation along this line will
be constituted by simulation [37].
3.3
• Local scheduler for the actual execution of the
algorithms developed within the DM Server on
a COW architecture. The algorithms developed within this DM Server will actually be
transformed in small job requests to a scheduler
which is managing the local resources (typically
a COW) onto which the Server is running.
4
Conclusions
A Data Mining Server
Data Mining applications still poses many problems
This part of the research is performed in collabora- to the scientific community. Among them, issues retion with prof. M. Zaki at the Rensselaer Polytechnic lated to the performance of such algorithm still limit
their ability to affectively handling really huge data
Institute, Troy, NY, USA.
Realization of a complete Data Mining Server that volumes, possibly distributed across several sites and
will provide efficient access to data and a language with variable resource availability.
In this thesis we claim that a detailed study of
framework for the implementation of Data Mining
the
performance requirements of Data Mining algoAlgorithms.
rithms can lead to quite general framework within
• Development of data models suited for data which High Performance and Distributed DM applimining algorithms. We plan to adopt an ap- cations can be developed.
proach similar to the one adopted in the Monet
Our research is focused on the definition of gen[9] DBMS, where a highly fragmented data eral cost models for DM algorithms, that will allow
model based on Binary Association Tables [10] to devise scheduling strategies for DM applications
is adopted. This model allows to integrate the
2 In association mining the two common data representation
standard relational model with other models,
used are the so called horizontal, where rows indicate records
and columns items, and the vertical one, where for each item
http://miles.cnuce.cnr.it/ palmeri/datam/sampling/simul we store the list of rows where the item appears.
1 Updated
status of this research can be checked out at
6
able to optimize resource usage. Our work will reIl materiale presentato per la conclufer to commonly recognized main trends in HPC and sione di ogni corso é reperibile all’indirizzo
large scale distributed computing. Namely Cluster of http://www.dsi.unive.it/ palmeri
Workstations and Computational Grids.
Although the main goal of the thesis will be an
architectural design of a Data Mining System where References
DM algorithms can be executed, we also plan to ob[1] D. Abramson, J. Giddy, I. Foster, and L. Kotler.
tain a small scale working example of such a system.
High performance parametric modeling with
nimrod/g: Killer application for the global grid?
In International Parallel and Distributed ProAppendice Piano di Studi
cessing Symposium Cancun, Mexico, 2000.
Riepilogo degli esami seguiti e sostenuti alla fine del
[2] R. Agrawal, T. Imielinski, and Swami A. Mining
II anno di dottorato:
Associations between Sets of Items in Massive
Databases. In Proc. of the ACM-SIGMOD 1993
• Towards an Infrastructure for Pervasive ComInt’l Conf. on Management of Data, pages 207–
puting course on Large Scale Distributed Systems
216, Washington D.C., USA, 1993.
, prof. F. Panzieri, (University of Bologna,
Italy).
[3] R. Agrawal, H. Manilla, R. Srikant, H. Toivonen,
and A. Inkeri Verkamo. Fast Discovery of Asso• An Architecture for Web Usage Mining course
ciation Rules in Large. In Advances in Knowlon Knowledge Discovery and Datamining, by
edge Discovery and Data Mining, pages 307–328.
prof. D. Pedreschi (University of Pisa, Italy),
AAAI Press, 1996.
F. Giannotti (CNUCE-CNR, Italy), prof. J.
Han (Simon Fraser University, Canada)
[4] G. Aloisio, M. Cafaro, P. Falabella, C. Kesselman, and R. Williams. Grid computing on the
• Resource Management of Distributed Resources
web using the globus toolkit. In Proc. HPCN
on Grids course on Parallel Computing, prof. S.
Europe 2000, Amsterdam, Netherlands. Lecture
Orlando (University of Venice, Italy)
Notes in Computer Science, Springer-Verlag, N.
1823, 32-40, 2000.
• Fingerprinting Techniques course on Probabilistic Algorithms, prof. A. Clementi (University [5] R. Baraglia, D. Laforenza, S. Orlando,
Tor Vergata, Rome Italy).
P. Palmerini, and R. Perego. Implementation issues in the design of I/O intensive data
• Corso di Simulazione, prof. L. Donatiello
mining applications on clusters of worksta(University of Bologna, Italy), prof.ssa S. Baltions. In Proc. of the 3rd Workshop on High
samo (University of Venice, Italy)
Performance Data Mining, Cancun, Mexico.
Spinger-Verlag, 2000.
• Lambda Calcolo, prof. A. Salibra (University
of Venice, Italy)
[6] R. Baraglia and P. Palmerini. Suggest: A web
usage mining system. In Proceedings of IEEE
International Conference on Information Technology: Coding and Computing, 2002.
Di tutti i corsi sono stati sostenuti gli esami secondo le modalitá previste dal docente, tranne il
corso di Lambda Calcolo. Non ho potuto tenere
il seminario conclusivo dell’esame di Lambda Calcolo, a causa della mia partenza per gli Stati Uniti.
D’accordo con il docente del corso, terró il seminario
al mio rientro in Italia.
[7] P. Becuzzi, M. Coppola, and M. Vanneschi. Mining of association rules in very large databases:
a structured parallel approach. In Proc. of Europar, 1999.
7
[8] Francine Berman, Richard Wolski, Silvia [17] J. Darlington, M. Ghanem, Y. Guo, and H. W.
Figueira, Jennifer Schopf, and Gary Shao. ApTo. Performance models for co-ordinating paralplication level scheduling on distributed heterolel data classification. In Proc. of the Seventh Ingeneous networks. In Proceedings of Supercomternational Parallel Computing Workshop, 1997.
puting 1996, 1996.
[18] Umeshwar Dayal, Qiming Chen, and Meichun
[9] P. A. Boncz. Monet: A Next-Generation DBMS
Hsu. Large-scale data mining applications: ReKernel For Query-Intensive Applications. PhD
quirements and architectures. In Proocedings of
thesis, Universiteit van Amsterdam, AmsterWorkshop on Large-Scale Parallel KDD Systems
dam, The Netherlands, May 2002.
August 15th, 1999, San Diego, CA, USA, 1999.
[10] P. A. Boncz and M. L. Kersten. MIL Primitives
for Querying a Fragmented World. The VLDB [19] I. S. Dhillon and D. S. Modha. A data clustering algorithm on distributed memory machines.
Journal, 8(2):101–119, October 1999.
In Proc. of the 5th ACM SIGKDD International
[11] J. P. Bradford and J. Fortes. Performance and
Conference on Knowledge Discovery and Data
memory access characterization of data minMining, 1999.
ing applications. In Proceedings of Workshop
on Workload Characterization: Methodology and [20] D. H. J Epema, Miron Livny, R. van Dantzig,
Case Studies, 1998.
X. Evers, and Jim Pruyne. A worldwide flock of
condors : Load sharing among workstation clus[12] Sergey Brin, Rajeev Motwani, Jeffrey D. Ullters. Journal on Future Generations of Comman, and Shalom Tsur. Dynamic itemset countputer Systems, 12, 1996.
ing and implication rules for market basket data.
[13]
[14]
[15]
[16]
In Proceedings of the ACM SIGMOD Interna[21] Renato Ferreira, Gagan Agrawal, and Joel H.
tional Conference on Management of Data, volSaltz. Compiling object-oriented data intensive
ume 26,2 of SIGMOD Record, pages 255–264,
applications. In International Conference on SuNew York, May13–15 1997. ACM Press.
percomputing, pages 11–21, 2000.
Rajkumar Buyya, editor. High Performance
[22] Renato Ferreira, Tahsin M. Kurc, Michael
Cluster Computing. Prentice Hall PTR, 1999.
Beynon, Chialin Chang, Alan Sussman, and
Rajkumar Buyya, David Abramson, and
Joel H. Saltz. Object-relational queries into
Jonathan Giddy. Nimrod/g: An architecture for
multidimensional databases with the active data
a resource management and scheduling system in
repository. Parallel Processing Letters, 9(2):173–
a global computational grid. In The 4th Inter195, 1999.
national Conference on High Performance Computing in Asia-Pacific Region (HPC Asia 2000), [23] I. Foster and C. Kesselman. Globus: A metaBeijing, China. IEEE Computer Society Press,
computing infrastructure toolkit. Intl J. SuperUSA, 2000.
computer Applications, 2(11):115–128, 1997.
M. Cannataro, D. Talia, and P. Trunfio. Design
and development of distributed data mining ap- [24] I. Foster and C. Kesselman, editors. The Grid:
Blueprint for a Future Computing Infrastrucplications on the knowledge grid. In Proceedings
ture. Morgan-Kaufmann, 1999.
of High Performance and Distributed Computing, 2002.
[25] I. Foster, C. Kesselman, and S. Tuecke. The
M. Cannataro D. Talia. Knowledge grid: An
Nexus task-parallel runtime system. In Proc.
architecture for distributed knowledge discovery.
1st Intl Workshop on Parallel Processing, pages
Communications of the ACM, 2002.
457–462. Tata McGraw Hill, 1994.
8
[26] I. Foster, C. Kesselman, and S. Tuecke. The [36] Junqiang Liu, Yunhe Pan, Ke Wang, and Jiawei
anatomy of the grid: Enabling scalable virtual
Han. Mining frequent item sets by opportunistic
organizations. Intl. J. Supercomputer Applicaprojection. In SIGKDD, Edmonton, July 2002.
tions, 3(15), 2001.
[37] M. Marzolla and P. Palmerini. Simulation of
a grid scheduler for data mining. Esame per
[27] G. Galal, D. J. Cook, and Holder L. B. Exil corso di dottorato in informativa, Universita’
ploiting parallelism in knowledge discovery sysCa’ Foscari, Venezia, 2002.
tems to improve scalability. In Proc. of the 31st
Hawaii International Conference on System Sci- [38] Jesus Mena. Data Mining Your Website. Digital
ences., 1998.
Press, United States of America, 1999.
[28] Robert Grossman, Stuart Bailey, Balinder Mali [39] S. Orlando, P. Palmerini, R. Perego, and F. Silvestri. Enhancing the apriori algorithm for freRamau, and Andrei Turinsky. The preliminary
quent set counting. submitted to the SIAM condesign of papyrus: A system for high perforference, 2001.
mance, distributed data mining over clusters. In
Advances in Distributed and Parallel Knowledge
[40] S. Orlando, P. Palmerini, R. Perego, and F. SilDiscovery. AAAI/MIT Press, 2000.
vestri. An efficient parallel and distributed algorithm for counting frequent sets. In Proceedings
[29] J. Han, J. Pei, and Y. Yin. Mining Frequent Patof the 5th International Conference on Vector
terns without Candidate Generation. In Proc. of
and Parallel Processing, 2002.
the ACM SIGMOD Int. Conf. on Management
of Data, pages 1–12, Dallas, Texas, USA, 2000. [41] S. Orlando, P. Palmerini, R. Perego, and F. Silvestri. Scheduling high performance data mining
[30] R. Jin and G. Agrawal. A middleware for develtasks on a data grid environment. In Proceedings
oping parallel data mining applications. In Proc.
of Europar, 2002.
of the 1-st SIAM Conference on Data Mining,
[42] J. S. Park, M.-S. Chen, and P. S. Yu. An Ef2000.
fective Hash Based Algorithm for Mining Association Rules. In Proc. of the 1995 ACM SIG[31] Yarsun Hsu Jin-Soo Kim, Xiaohan Qin. Memory
MOD International Conference on Management
characterization of a parallel data mining workof Data, pages 175–186, San Jose, California,
load. In Proceedings of Workshop on Workload
1995.
Characterization: Methodology and Case Studies, 1998.
[43] Srinivasan Parthasarath and Ramesh Subramonian. Facilitating data mining on a network
[32] M. Joshi, G. Karypis, and V. Kumar. Scalparc:
of workstations. In Advances in Distributed
A new scalable and efficient parallel classificaand Parallel Knowledge Discovery. AAAI/MIT
tion algorithm for mining large datasets. In ProPress, 2000.
ceedings of IPPS/SPDP’98, 1998.
[44] J. Pei, J. Han, H. Lu, S. Nishio, S. Tang, and
[33] Proceedings of the First SIAM Workshop on
D. Yang. H-mine: Hyper-structure mining of freData Mining and Counter Terrorism, 2002.
quent patterns in large databases. In Proc. 2001
Int. Conf. on Data Mining (ICDM’01), Novem[34] http://www.top500.org.
ber 2001.
[35] Chandrika Kamath. Data mining for science and [45] Foster J. Provost, David Jensen, and Tim Oates.
engineering applications. In Proceedings of the
Efficient progressive sampling. In Knowledge
First SIAM Conference, 2001.
Discovery and Data Mining, pages 23–32, 1999.
9
[46] N. Ramakrishnan and A. Y. Grama. Data Mining: From Serendipity to Science. IEEE Computer, 32(8):34–37, 1999.
[47] F. Romano. Parallelizzazione dell’algorithmo
c4.5 per la costruzione di alberi decisionali. Tesi
di Laurea, Universita’ Caq’ Foscari, Venezia,
2000.
[48] Jennifer M. Schopf. A general architecture for
scheduling on the grid. Journal of Parallel and
Distributed Computing, special issue on Grid
Computing, 2002.
[49] D. B. Skillikorn. Strategies for parallel data mining. IEEE Concurrency, 7, 1999.
[50] Anurag Srivastava, Eui-Hong Han, Vipin Kumar, and Vineet Singh. Parallel formulations
of decision-tree classification algorithms. Data
Mining and Knowledge Discovery, 3(3):237–261,
1999.
[51] K. Stoffel and A. Belkoniene. Parallel k-means
clustering for large datasets. In P. Amestoy,
P. Berger, M. Daydé, I. Duff, V. Frayssé, L. Giraud, and D. Ruiz, editors, EuroPar’99 Parallel
Processing, Lecture Notes in Computer Science,
No. 1685. Springer-Verlag, 1999.
[52] Sterling T.L., Salmon J., Becker D.J., and
Savarese D.F. How to Build a Beowulf. A guide
to the Implementation and Application of PC
Clusters. The MIT Press, 1999.
[53] M. J. Zaki. Parallel and distributed association
mining: A survey. IEEE Concurrency, 7, Oct.
1999.
[54] M. J. Zaki, S. Parthasarathy, and W. Li. A localized algorithm for parallel association mining.
In ACM Symposium on Parallel Algorithms and
Architectures, pages 321–330, 1997.
[55] Mohammed J. Zaki, Srinivasan Parthasarathy,
Wei Li, and Mitsunori Ogihara. Evaluation of
sampling for data mining of association rules. In
7th International Workshop on Research Issues
in Data Engineering, 1997.
10