Download A framework for optimizing the performance of peer-to

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
TECHNIA
International Journal of Computing Science and Communication Technologies, VOL. 3, NO. 1, July 2010. (ISSN 0974-3375)
A framework for optimizing the performance of
peer-to-peer distributed data mining algorithms
E. Anupriya1 & N.Ch.S.N.Iyengar2
1
School of Computing Sciences and Engineering,
VIT University, Vellore, TN 632 014, India
[email protected], [email protected]
Peer-to-Peer distributed data mining is an emerging paradigm in
distributed computing. The objective of the paper is to present a
broad study on existing peer-to-peer data mining algorithms,
computational challenges and identify the optimization factors. In
particular, we have focused on reducing the data set size on every
peer using feature selection which is one of the core optimization
factors to reduce the application run time. In connection to this,
we have proposed a preliminary architecture, Peer Optimized
Data Mining (PODM) architecture. Here, Data Reduction is
viewed as a separate, significant process to reduce data set size at
each peer node. Pre-processing Manager (PPM) in the
architecture, works towards automatic feature selection and
forwarding the data to Data Reduct Manager (DR). This shows
that the percentage of features reduced has direct impact on the
percentage of comparison made in the data set.
Peer-to-Peer (P2P) DDM applications spawns new breed
of applications like collaborative decision making, social
network analysis and surveillance using sensor networks.
In this paper, we present a broad study on Peer-to-Peer
(P2P) DDM algorithms and focus on optimization
factors and methods to improve peer-to-peer distributed
data mining. Section 2, describes the various peer-topeer computational challenges. Section 3, related work
discusses the existing peer-to-peer distributed data
mining algorithms. Section 4, deals with identification
and enumeration of optimization factors. Also, data
reduction and feature extraction is dealt with, in this
section. Section 5, explains the Peer Optimized Data
Mining architecture. Section 6, concludes our work.
Keywords: peer-to-peer distributed data mining (P2PDDM),
Peer Optimized Data Mining (PODM), Feature Selection.
II. Peer-to-Peer (P2P) Computational Environment
Challenges
I. Introduction
Tremendous work has been carried out in distributed
data mining area, but those algorithms are designed for
stable network and data. Hence extending DDM
algorithms to suit P2P computational environment
requires careful modification as the peers can join or
leave the peer group anytime or have different data
caused by failure and recovery of peers. Peer-to-Peer
(P2P) DM algorithms impose the need for the following
operational characteristics.
Abstract
The advent of new technologies like pervasive,
distributed and ubiquitous computing has made data to
be accessible virtually from anywhere, anytime. Analysis
of such highly distributed data and discovering data
patterns is highly a complex task.
Distributed Data Mining deals with analysis of data
patterns in environment with distributed data, computing
nodes and users. The advent of high-speed Networks and
inexpensive hardware devices have contributed to fast
growing of server-less Peer-to-Peer (P2P) networks.
Together, all these peers store large volume of data
collected from different sources. This distributed data,
upon mining may yield useful data patterns or results.
The primary objective of the Peer-to-Peer (P2P) data
mining is to perform collaborative data mining
functionality among the distributed peers rather than
transferring data to centralized site and performing
mining in a centralized site. The data mining
functionality may be classification, clustering or
association rule mining based on the application domain.
531
Scalable
Peer-to-Peer (P2P) DM algorithms should be scalable in
terms of handling varying data size or varying number of
peers as the peers may join or leave the peer group at any
time.
Minimal Communication Overhead
Peer-to-Peer (P2P) DM algorithms should communicate
efficiently with short messages to minimize the
additional overhead of communication failures and
overheads of large data or message transfers.
Incremental
Peer-to-Peer (P2P) DM algorithms should be
incremental to handle variations in data set rather than
starting the mining process from beginning.
TECHNIA
International Journal of Computing Science and Communication Technologies, VOL. 3, NO. 1, July 2010. (ISSN 0974-3375)
Minimal Data Exchange
Algorithms running on Peer nodes should exchange
minimal data to ensure localization of algorithms as it is
impossible to get global synchronization in such a large
systems like P2P networks.
Fault Tolerance
Peer-to-Peer (P2P) DM algorithms should be fault
tolerable as failure of peers, loss of data or change in
data due to failures are quite common in highly dynamic
P2P environment.
III. Related Works
The analysis starts from how the data are stored and
managed in P2P systems. Local Relational Model
(LRM) [6] is data model specifically designed for P2P
applications. Since most of the database systems in real
are either relational or object relational in nature, the
LRM assumes that all nodes in P2P network are
relational database for simplicity. LRM aims to allow
inconsistent databases and to support interoperability in
the absence of a global schema by establishing
coordination among the peers. The coordination
formulas define semantic dependencies between the two
databases. However it fails to address the underlying
communication protocol, automatic derivation of domain
relations and domain mapping logic.
Data Mining functionalities include Characterization,
Discrimination, Association Rule mining, Classification
and Clustering. The problem of association rule mining
in P2P Systems is challenging due to dynamic nature of
P2P Networks [3]. A new algorithm LSD-ARM (LargeScale distributed Association Rule Mining) comprises of
two independent components: First is ARM algorithm
which traverses the local data base and maintains the
current result. Second is majority voting protocol where
each node participates in voting process. All rules with
confidence above threshold are discovered and
combined.
A central problem in data mining is Classification. The
Newscast Model of Computation [5] is proposed for P2P
overlay networks typically for information dissemination
and file sharing. This distributed computation model
allow effective calculation of basic statistics like Basic
Averaging, Weighted Averaging and Cumulative
Averaging. Also demonstrates how to implement data
mining algorithms based on these techniques. Naïve
Bayes is used for distributed classification for illustration
and states that ratio calculation would be sufficient to
calculate conditional probabilities.
Ensemble paradigm for distributed classification in P2P
networks discusses building local classifiers and
integrating the result globally.[1] Under this paradigm,
each peer builds its local classifiers on the local data and
the results from all local classifiers are then combined by
plurality voting. To build local classifiers, they have
adopted the learning algorithm of pasting bites to
generate multiple local classifiers on each peer based on
the local data. To combine local results, they have
proposed a general form of Distributed Plurality Voting
(DPV) protocol in dynamic P2P networks.
Decentralization of algorithms and distributed data
mining applications in the context of P2P networks is an
interesting direction. [8]. It describes both exact and
approximate distributed data mining algorithms that
work in a decentralized manner. In particular it illustrates
these approaches for the problem of computing and
monitoring clusters in the data residing at the different
nodes of a Peer-to-Peer network. This paper takes lead to
approximately solve k-means clustering problem in Peerto-Peer network.
HP2PC: Scalable Hierarchically Distributed Peer-to-Peer
Clustering architecture is based on multi layer overlay
network of peer neighborhoods. Peers at a certain level
of the hierarchy cooperate within their respective
neighborhoods to perform clustering. Using this model,
the clustering problem can be partitioned in a modular
way, solving each part individually, then successively
combine clustering up the hierarchy where increasingly
global solutions are computed.
IV. Optimization
The inherent dynamic nature of P2P networks demands
efficiency and optimization of performance of Peer-toPeer distributed data mining algorithms. Optimization is
the ability to achieve reasonably good performance in
large data sets particularly with respect to large data sets
residing on Peers. The objective is to identify and
modify factors affecting performance in order to
optimize Peer-to-Peer distributed data mining algorithms
keeping the constraints in tact. Optimization is a tradeoff
between cost and accuracy.
In general, the principle that governs the optimization
process is the need to reduce the time taken to perform a
distributed data mining task [10] by:
1. Identifying factors that affect the performance in
distributed data mining (such as data sets,
communication and processing)
2.
an inverse relationship with performance) to
those factors for alternate scenarios or strategies
532
TECHNIA
International Journal of Computing Science and Communication Technologies, VOL. 3, NO. 1, July 2010. (ISSN 0974-3375)
3. Choosing a strategy that involves the least cost
and thereby optimizes the performance
containing legal document information are associated
with Keyword or Semantics, say relations containing
attribute value as keyword Legal may be considered as a
single logical data set for mining.
Table 1: P2P distributed data mining algorithms Vs Optimization
factors
Paper ID
[6]
[3]
[5]
DM
Functionalit
y
Data
Management
Optimization
Factors
addressed
*Semantic
Interoperability
ARM
*Minimal
Communication
Overhead
*Tolerance
*Cost
*Scalability
*Convergence
Speed
*Communication
Optimal
*Topology
changes
*Data Changes
*Scalability
*Speed
*Flexibility
*Scalability
Classification
[1]
Classification
[8]
Clustering
[9]
Clustering
Peer A
Optimization
Factors
addressable
*Dimensionality
Reduction
*DM Query
Incremental
mining cost
*Incremental
mining cost
of data
*DM Query
doc_type
Legal_doc
Peer B
doc_id
L189
doc_type
Legal
description
Land
acquisition
description
Land
bought
Figure 1: Same type of relation existing on two different peers.
*Convergence
*Approximation
of results
*Stability
*Speed
*Clustering time
The intent behind the P2PDDM optimization is to reduce
the response time by choosing appropriate strategy with
minimal computational cost. The time required to
perform P2PDDM task relies on four factors which
includes a. communication time b. data mining task time
c. knowledge integration time d. no of peers involved.
1. Communication time It is the time taken to
initially interact with peers and agree upon
service levels (tcn )
2. Data Mining It is the time taken to mine data
on distributed peers. Peer nodes are of
heterogeneous in nature. The time taken to
perform data mining task differs from Peer node
to another. Hence the maximum time taken by
any of the participating Peer nodes is taken.
(tmax(dmi) )
3. Knowledge integration It is the time taken to
integrate the knowledge from participating
peers. (tcb)
tp2pdm = tcn + tmax(dmi) + tcb
Peer nodes)
doc_id
L001
E-1
Table 1 describes data mining functionalities or
algorithms adopted in P2P environment, optimization
factors addressed and addressable. From table 1 it is
obvious that when a Semantic Interoperability exists, it
minimizes the time taken to select relevant data from
peers for analysis. For example, if all relations
Dimensionality reduction is again selection of relevant
attributes for improving data mining performance. (i.e.)
Only attributes with high impact on decision attributes
alone will be included in mining process. Convergence
and Approximation, in turn means the speed of arriving
at result with reasonable accuracy. Incremental mining
contributes to minimizing the computational cost of
newly added data set instead of computing it from
scratch.
On the whole, from table 1, the factors that can be
optimized to achieve reasonable performance in peer-topeer data mining includes Speed, Robustness, Scalability
and Interpretability. These factors may vary from one
data mining task to the another.
A .Data Reduction
The core factor that affects the peer-to-peer data mining
task is the time taken to perform the data mining task at
each peer. In turn, this computational cost depends
directly on the size of the data set. In reality, the data set
is likely to be very huge. Complex data analysis and
mining on huge amounts of data can take longer time or
may become infeasible in certain situations. Data
reduction techniques can be used to reduce the size of
the data set. Data reduction strategies includes data cube
aggregation, attribute subset selection, dimensionality
reduction,
numerosity
reduction
and
finally
discretization and concept hierarchy generation. Among
the data reduction strategies, attribute subset selection or
feature extraction though oldest is still active in research.
The following subsection gives an overview of feature
extraction techniques.
B. Feature Extraction
Feature Extraction is the process of detecting and
eliminating irrelevant, weakly relevant or redundant
533
TECHNIA
International Journal of Computing Science and Communication Technologies, VOL. 3, NO. 1, July 2010. (ISSN 0974-3375)
attributes or dimensions in a given data set. The goal of
feature selection is to find the minimal subset of
attributes such that the resulting probability distribution
of data classes is close to original distribution obtained
using all attributes. Comparison is one of the expensive
operations involved in data mining task. In general, the
computational cost of data set D is O(n x |D| x log(|D|)).
[ n - #attributes, D number of instances]. The number
of comparisons required for m attributes and n instances
is m * n2. Table 2 is an artificially generated data
exhibiting the number of comparisons required for the
given number of attributes and instances. Say for a given
data set D with 23 attributes and 400 instances,
36,80,000 comparisons are required. Hence the reduction
of data set size at each peer node locally would optimize
the data mining process. Table 3 shows the different
percentage of attribute reduction and the respective
reduction in comparisons. Say for a given data set D with
23 attributes and 400 instances, the number of attributes
after reduction is 5. Hence the number of comparisons
required is 8,00,000, percentage of attributes reduced is
78% and the percentage of comparisons reduced is also
78%. Therefore percentage of attribute reduction will
directly reduce the same percentage of comparisons with
no change in the number of instances in D.
Table 2: Required number of comparisons
#
attrib.
4
7
10
14
22
23
45
56
74
89
100
#
inst.
50
100
200
200
400
400
500
600
200
200
100
#comp.
10000
70000
400000
560000
3520000
3680000
11250000
20160000
2960000
3560000
1000000
# inst.
50
100
200
200
400
400
500
600
200
200
100
C. Feature Selection Techniques
For a data set D, with n attributes, 2n subsets are
possible. Search for an optimal subset would be highly
expensive especially when n and the number of data
classes increases. Sometimes it may be infeasible.
Therefore most of the feature selection techniques are
heuristic methods. These heuristic methods are greedy
nature and try to explore possible reduced search space.
Feature selection techniques fall under two categories.
First, feature ranking techniques and second, feature
subset selection techniques. In the former, all features
are ranked by a metric like information gain, chi-square
etc. The features that do not achieve the adequate score
are eliminated. In the later, the search is for optimal
subset of features that would be equivalent to orginial
subset of features. The subset of features are evaluated
more commonly based on distance metrics like
Euclidean, Hamming etc or filter metrics like Entropy or
Probabilistic distance. Common search approaches
include greedy forward attribute selection, backward
attribute selection, simulated annealing, and genetic
algorithms. Various Feature Selection techniques
[12],[13],[14],[15] are show in Table 4.
Table 4: Feature Selection Techniques
S.No.
1
2
3
4
Table 3: Reduced #attributes with #reduced comparisons
# attrib.
2
3
3
5
8
5
9
20
23
31
40
Figure 2: #Comparisons chart before and after attribute
reduction
#comp.
5000
30000
120000
200000
1280000
800000
2250000
7200000
920000
1240000
400000
5
6
7.
8.
9.
Feature Selection Techniques
Linear principal component analysis
Auto Associative Networks
Genetic Algorithms
Sensitivity Analysis using Neural
Networks
Rough Sets
Swarm based Rough set reduction
Fuzzy based Rough set reduction
Simulated Annealing
Support Vector Machine
V. PODM architecture
The inherent nature of P2P Systems is highly dynamic.
Therefore the underlying database systems will be
heterogeneous, prominently called as multi-database
system. The intention behind the P2P paradigm is to
extend direct communication among the peer nodes.
Hence, the participating peers can exchange data and
534
TECHNIA
International Journal of Computing Science and Communication Technologies, VOL. 3, NO. 1, July 2010. (ISSN 0974-3375)
services with the other peers directly. In this paper, we
will be particularly focusing on P2P distributed data
mining of data residing in peers of common interest.
Every peer may have some data either to share or to take
from other peers. This data residing in various peers if
mined may yield useful information. Also data of this
kind may have semantic inter-dependencies, schema
differences, and domain variations. Peer coordination,
schema matching to some extend is discussed in [6].
Peer coordination may remain the same as most of the
P2P system use relational databases.
UI
DMQM
SM
PCL
comprises of User Interface (UI), Data Mining Query
Manager (DMQM), Semantic Manager (SM),
WRAPPER, Data Mining Task Manager (DMTM), Data
Reduct (DR), Pre-processing Manager (PPM).UI takes
care of receiving the user data mining query, passing it to
DMQM and displaying results back to user. DMQM is
responsible for parsing the data mining query issued by
the user, interact with SM for semantics and issues the
instructions to DMTM through Wrapper. Wrapper maps
the parsed query to DMTM to initiate respective data
mining task on reduced data set available with DR. PPM
takes care of pre-processing the data and directs it to DR.
DR is responsible for the reduced data set either from
local or from other peers. PCL layer is a middleware
implementation, therefore it can communicate within and
with other peers only through XML. LCL layer is node
dependent.
User
PODM layer
WRAPPER
Peer A
DMTM
DR
Peer B
Peer C
Figure 3: Distributed Computing Architecture using
Peer-to-Peer paradigm
LCL
Multivariate data
PPM
Clea
Normalize
Transform
Peer Database
Figure 2 - PODM Layer
Feature
Extractor
To support this, we have presented a preliminary
architecture which would reduce the data set to be mined
residing at each peer locally. Therefore the distributed
data mining process in P2P network may be optimized
with respect to application run time at each peer. We
assume that the peers may work in coordination with
specified agreed level of information exchange and
coordination among them. The peers may be
participating in the task as long as their interest remains
common.
Peer Optimized Data Mining (PODM)
architecture (Figure 2) is a two layer architecture which
includes seven modules. Layer 1 is Peer Coordination
Layer (PCL) and it is responsible for coordination with
other peer nodes. Layer 2 is Local Coordination Layer
(LCL) and it is meant for local communication. PODM
535
Feature
Selector
Validate
Reduced
Data Set
Figure 4: Feature Extraction Process in PPM
It is better to optimize locally rather looking for a global
solution. PODM layer on each peer optimizes the data
mining process locally (Figure 3). In our work, Preprocessing Manager (PPM) plays a vital role in
producing the reduced data set. Feature Extractor is
embedded within this PPM.
TECHNIA
International Journal of Computing Science and Communication Technologies, VOL. 3, NO. 1, July 2010. (ISSN 0974-3375)
VI. Conclusion
[6]
The tremendous growth in the P2P networks has
spawned new breed of applications. Peer-to-Peer
distributed data mining is an emerging paradigm in
distributed computing. Large volumes of data residing
on these peers on mining would yield fruitful data
patterns. But, data mining on such a highly dynamic
environment is challenging and time consuming. Feature
Selection could be viewed as one of the means to reduce
data locally on each peer. The percentage of features
reduced has direct impact on the percentage of
comparisons made in the data set. Peer Optimized Data
Mining Architecture is a preliminary architecture which
views pre-processing as a separate, significant process to
reduce the data set size using feature selection.
[2]
[3]
[4]
[5]
[8]
[9]
[10]
[11]
VII. References
[1]
[7]
[12]
P. Luo, H. Xiong, K. Lu, Z
Distributed
Classification in Peer-toH. BaaZaoui Zghal, S. Faiz, H. Ben G
Framework for Data Mining Based Multi Agent: An
[13]
Science, Engineering and Technology
R. Wolff, A.
Peer-toMan and Cybernetics Part B, 34(6), 2004
P.
Classification and Prediction Algorithms for Data
117, 2002
W. Kowalczyk, M.
wards Data
Mining in Large and Fully Distributed Peer-to-Peer
[14]
[15]
203-210, 2003
536
P. A. Bernstein, F. Gi unchiglia, A. Kementsietsidis, J.
Mylopoulos, L. Serafini, I
Data
Management for Peer-toIn Proceedings of Web DB 2002, 2002
J. C. da Silva, C. Giannella, R. Bhargava, H. Kargupta,
M.
S. Datta, K. Bhaduri, C. Giannella, R. Wolff, H.
ning in Peer-to-Peer
. IEEE Internet Computing special issue on
Distributed Data Mining, 2006
K. M. Hammouda, M.
HP2PC: Scalable
Hierarchically-Distributed Peer-to-Peer Clustering
SIAM 2007
S. Krishnaswamy, S. Wai Loke, A. Zaslavsky,
of distributed data mining
by predicting
on Enterprise Information System, 1-4020-1086-9,
2003
J. Han, M.
Concepts
and
ion, 2006
C. Dong, D. Wu, J.
Evaluation Dataset Based on Genetic Algorithm and
978-0-7695-3336-0/08 $25.00 ©
2008 IEEE
C. Der Huang*, C. -Teng Lin, and N. Ranjan Pal,
Hierarchical Learning Architecture With Automatic
Feature Selection for Multiclass Protein Fold
IEEE transactions on nanobioscience,
vol. 2, no. 4, december 2003
Comparison of Feature Extraction and Selection
T
ND000909, 2001
T.-Hsiang Cheng1, C.-Ping Wei2, Vincent S. Tseng3
Feature Selection for Medical Data Mining:
Comparisons of Expert Judgment and Automatic
Proceedings of the 19th IEEE
Symposium on
Computer-Based Medical Systems
(CBMS'06), 2006.