Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
TECHNIA International Journal of Computing Science and Communication Technologies, VOL. 3, NO. 1, July 2010. (ISSN 0974-3375) A framework for optimizing the performance of peer-to-peer distributed data mining algorithms E. Anupriya1 & N.Ch.S.N.Iyengar2 1 School of Computing Sciences and Engineering, VIT University, Vellore, TN 632 014, India [email protected], [email protected] Peer-to-Peer distributed data mining is an emerging paradigm in distributed computing. The objective of the paper is to present a broad study on existing peer-to-peer data mining algorithms, computational challenges and identify the optimization factors. In particular, we have focused on reducing the data set size on every peer using feature selection which is one of the core optimization factors to reduce the application run time. In connection to this, we have proposed a preliminary architecture, Peer Optimized Data Mining (PODM) architecture. Here, Data Reduction is viewed as a separate, significant process to reduce data set size at each peer node. Pre-processing Manager (PPM) in the architecture, works towards automatic feature selection and forwarding the data to Data Reduct Manager (DR). This shows that the percentage of features reduced has direct impact on the percentage of comparison made in the data set. Peer-to-Peer (P2P) DDM applications spawns new breed of applications like collaborative decision making, social network analysis and surveillance using sensor networks. In this paper, we present a broad study on Peer-to-Peer (P2P) DDM algorithms and focus on optimization factors and methods to improve peer-to-peer distributed data mining. Section 2, describes the various peer-topeer computational challenges. Section 3, related work discusses the existing peer-to-peer distributed data mining algorithms. Section 4, deals with identification and enumeration of optimization factors. Also, data reduction and feature extraction is dealt with, in this section. Section 5, explains the Peer Optimized Data Mining architecture. Section 6, concludes our work. Keywords: peer-to-peer distributed data mining (P2PDDM), Peer Optimized Data Mining (PODM), Feature Selection. II. Peer-to-Peer (P2P) Computational Environment Challenges I. Introduction Tremendous work has been carried out in distributed data mining area, but those algorithms are designed for stable network and data. Hence extending DDM algorithms to suit P2P computational environment requires careful modification as the peers can join or leave the peer group anytime or have different data caused by failure and recovery of peers. Peer-to-Peer (P2P) DM algorithms impose the need for the following operational characteristics. Abstract The advent of new technologies like pervasive, distributed and ubiquitous computing has made data to be accessible virtually from anywhere, anytime. Analysis of such highly distributed data and discovering data patterns is highly a complex task. Distributed Data Mining deals with analysis of data patterns in environment with distributed data, computing nodes and users. The advent of high-speed Networks and inexpensive hardware devices have contributed to fast growing of server-less Peer-to-Peer (P2P) networks. Together, all these peers store large volume of data collected from different sources. This distributed data, upon mining may yield useful data patterns or results. The primary objective of the Peer-to-Peer (P2P) data mining is to perform collaborative data mining functionality among the distributed peers rather than transferring data to centralized site and performing mining in a centralized site. The data mining functionality may be classification, clustering or association rule mining based on the application domain. 531 Scalable Peer-to-Peer (P2P) DM algorithms should be scalable in terms of handling varying data size or varying number of peers as the peers may join or leave the peer group at any time. Minimal Communication Overhead Peer-to-Peer (P2P) DM algorithms should communicate efficiently with short messages to minimize the additional overhead of communication failures and overheads of large data or message transfers. Incremental Peer-to-Peer (P2P) DM algorithms should be incremental to handle variations in data set rather than starting the mining process from beginning. TECHNIA International Journal of Computing Science and Communication Technologies, VOL. 3, NO. 1, July 2010. (ISSN 0974-3375) Minimal Data Exchange Algorithms running on Peer nodes should exchange minimal data to ensure localization of algorithms as it is impossible to get global synchronization in such a large systems like P2P networks. Fault Tolerance Peer-to-Peer (P2P) DM algorithms should be fault tolerable as failure of peers, loss of data or change in data due to failures are quite common in highly dynamic P2P environment. III. Related Works The analysis starts from how the data are stored and managed in P2P systems. Local Relational Model (LRM) [6] is data model specifically designed for P2P applications. Since most of the database systems in real are either relational or object relational in nature, the LRM assumes that all nodes in P2P network are relational database for simplicity. LRM aims to allow inconsistent databases and to support interoperability in the absence of a global schema by establishing coordination among the peers. The coordination formulas define semantic dependencies between the two databases. However it fails to address the underlying communication protocol, automatic derivation of domain relations and domain mapping logic. Data Mining functionalities include Characterization, Discrimination, Association Rule mining, Classification and Clustering. The problem of association rule mining in P2P Systems is challenging due to dynamic nature of P2P Networks [3]. A new algorithm LSD-ARM (LargeScale distributed Association Rule Mining) comprises of two independent components: First is ARM algorithm which traverses the local data base and maintains the current result. Second is majority voting protocol where each node participates in voting process. All rules with confidence above threshold are discovered and combined. A central problem in data mining is Classification. The Newscast Model of Computation [5] is proposed for P2P overlay networks typically for information dissemination and file sharing. This distributed computation model allow effective calculation of basic statistics like Basic Averaging, Weighted Averaging and Cumulative Averaging. Also demonstrates how to implement data mining algorithms based on these techniques. Naïve Bayes is used for distributed classification for illustration and states that ratio calculation would be sufficient to calculate conditional probabilities. Ensemble paradigm for distributed classification in P2P networks discusses building local classifiers and integrating the result globally.[1] Under this paradigm, each peer builds its local classifiers on the local data and the results from all local classifiers are then combined by plurality voting. To build local classifiers, they have adopted the learning algorithm of pasting bites to generate multiple local classifiers on each peer based on the local data. To combine local results, they have proposed a general form of Distributed Plurality Voting (DPV) protocol in dynamic P2P networks. Decentralization of algorithms and distributed data mining applications in the context of P2P networks is an interesting direction. [8]. It describes both exact and approximate distributed data mining algorithms that work in a decentralized manner. In particular it illustrates these approaches for the problem of computing and monitoring clusters in the data residing at the different nodes of a Peer-to-Peer network. This paper takes lead to approximately solve k-means clustering problem in Peerto-Peer network. HP2PC: Scalable Hierarchically Distributed Peer-to-Peer Clustering architecture is based on multi layer overlay network of peer neighborhoods. Peers at a certain level of the hierarchy cooperate within their respective neighborhoods to perform clustering. Using this model, the clustering problem can be partitioned in a modular way, solving each part individually, then successively combine clustering up the hierarchy where increasingly global solutions are computed. IV. Optimization The inherent dynamic nature of P2P networks demands efficiency and optimization of performance of Peer-toPeer distributed data mining algorithms. Optimization is the ability to achieve reasonably good performance in large data sets particularly with respect to large data sets residing on Peers. The objective is to identify and modify factors affecting performance in order to optimize Peer-to-Peer distributed data mining algorithms keeping the constraints in tact. Optimization is a tradeoff between cost and accuracy. In general, the principle that governs the optimization process is the need to reduce the time taken to perform a distributed data mining task [10] by: 1. Identifying factors that affect the performance in distributed data mining (such as data sets, communication and processing) 2. an inverse relationship with performance) to those factors for alternate scenarios or strategies 532 TECHNIA International Journal of Computing Science and Communication Technologies, VOL. 3, NO. 1, July 2010. (ISSN 0974-3375) 3. Choosing a strategy that involves the least cost and thereby optimizes the performance containing legal document information are associated with Keyword or Semantics, say relations containing attribute value as keyword Legal may be considered as a single logical data set for mining. Table 1: P2P distributed data mining algorithms Vs Optimization factors Paper ID [6] [3] [5] DM Functionalit y Data Management Optimization Factors addressed *Semantic Interoperability ARM *Minimal Communication Overhead *Tolerance *Cost *Scalability *Convergence Speed *Communication Optimal *Topology changes *Data Changes *Scalability *Speed *Flexibility *Scalability Classification [1] Classification [8] Clustering [9] Clustering Peer A Optimization Factors addressable *Dimensionality Reduction *DM Query Incremental mining cost *Incremental mining cost of data *DM Query doc_type Legal_doc Peer B doc_id L189 doc_type Legal description Land acquisition description Land bought Figure 1: Same type of relation existing on two different peers. *Convergence *Approximation of results *Stability *Speed *Clustering time The intent behind the P2PDDM optimization is to reduce the response time by choosing appropriate strategy with minimal computational cost. The time required to perform P2PDDM task relies on four factors which includes a. communication time b. data mining task time c. knowledge integration time d. no of peers involved. 1. Communication time It is the time taken to initially interact with peers and agree upon service levels (tcn ) 2. Data Mining It is the time taken to mine data on distributed peers. Peer nodes are of heterogeneous in nature. The time taken to perform data mining task differs from Peer node to another. Hence the maximum time taken by any of the participating Peer nodes is taken. (tmax(dmi) ) 3. Knowledge integration It is the time taken to integrate the knowledge from participating peers. (tcb) tp2pdm = tcn + tmax(dmi) + tcb Peer nodes) doc_id L001 E-1 Table 1 describes data mining functionalities or algorithms adopted in P2P environment, optimization factors addressed and addressable. From table 1 it is obvious that when a Semantic Interoperability exists, it minimizes the time taken to select relevant data from peers for analysis. For example, if all relations Dimensionality reduction is again selection of relevant attributes for improving data mining performance. (i.e.) Only attributes with high impact on decision attributes alone will be included in mining process. Convergence and Approximation, in turn means the speed of arriving at result with reasonable accuracy. Incremental mining contributes to minimizing the computational cost of newly added data set instead of computing it from scratch. On the whole, from table 1, the factors that can be optimized to achieve reasonable performance in peer-topeer data mining includes Speed, Robustness, Scalability and Interpretability. These factors may vary from one data mining task to the another. A .Data Reduction The core factor that affects the peer-to-peer data mining task is the time taken to perform the data mining task at each peer. In turn, this computational cost depends directly on the size of the data set. In reality, the data set is likely to be very huge. Complex data analysis and mining on huge amounts of data can take longer time or may become infeasible in certain situations. Data reduction techniques can be used to reduce the size of the data set. Data reduction strategies includes data cube aggregation, attribute subset selection, dimensionality reduction, numerosity reduction and finally discretization and concept hierarchy generation. Among the data reduction strategies, attribute subset selection or feature extraction though oldest is still active in research. The following subsection gives an overview of feature extraction techniques. B. Feature Extraction Feature Extraction is the process of detecting and eliminating irrelevant, weakly relevant or redundant 533 TECHNIA International Journal of Computing Science and Communication Technologies, VOL. 3, NO. 1, July 2010. (ISSN 0974-3375) attributes or dimensions in a given data set. The goal of feature selection is to find the minimal subset of attributes such that the resulting probability distribution of data classes is close to original distribution obtained using all attributes. Comparison is one of the expensive operations involved in data mining task. In general, the computational cost of data set D is O(n x |D| x log(|D|)). [ n - #attributes, D number of instances]. The number of comparisons required for m attributes and n instances is m * n2. Table 2 is an artificially generated data exhibiting the number of comparisons required for the given number of attributes and instances. Say for a given data set D with 23 attributes and 400 instances, 36,80,000 comparisons are required. Hence the reduction of data set size at each peer node locally would optimize the data mining process. Table 3 shows the different percentage of attribute reduction and the respective reduction in comparisons. Say for a given data set D with 23 attributes and 400 instances, the number of attributes after reduction is 5. Hence the number of comparisons required is 8,00,000, percentage of attributes reduced is 78% and the percentage of comparisons reduced is also 78%. Therefore percentage of attribute reduction will directly reduce the same percentage of comparisons with no change in the number of instances in D. Table 2: Required number of comparisons # attrib. 4 7 10 14 22 23 45 56 74 89 100 # inst. 50 100 200 200 400 400 500 600 200 200 100 #comp. 10000 70000 400000 560000 3520000 3680000 11250000 20160000 2960000 3560000 1000000 # inst. 50 100 200 200 400 400 500 600 200 200 100 C. Feature Selection Techniques For a data set D, with n attributes, 2n subsets are possible. Search for an optimal subset would be highly expensive especially when n and the number of data classes increases. Sometimes it may be infeasible. Therefore most of the feature selection techniques are heuristic methods. These heuristic methods are greedy nature and try to explore possible reduced search space. Feature selection techniques fall under two categories. First, feature ranking techniques and second, feature subset selection techniques. In the former, all features are ranked by a metric like information gain, chi-square etc. The features that do not achieve the adequate score are eliminated. In the later, the search is for optimal subset of features that would be equivalent to orginial subset of features. The subset of features are evaluated more commonly based on distance metrics like Euclidean, Hamming etc or filter metrics like Entropy or Probabilistic distance. Common search approaches include greedy forward attribute selection, backward attribute selection, simulated annealing, and genetic algorithms. Various Feature Selection techniques [12],[13],[14],[15] are show in Table 4. Table 4: Feature Selection Techniques S.No. 1 2 3 4 Table 3: Reduced #attributes with #reduced comparisons # attrib. 2 3 3 5 8 5 9 20 23 31 40 Figure 2: #Comparisons chart before and after attribute reduction #comp. 5000 30000 120000 200000 1280000 800000 2250000 7200000 920000 1240000 400000 5 6 7. 8. 9. Feature Selection Techniques Linear principal component analysis Auto Associative Networks Genetic Algorithms Sensitivity Analysis using Neural Networks Rough Sets Swarm based Rough set reduction Fuzzy based Rough set reduction Simulated Annealing Support Vector Machine V. PODM architecture The inherent nature of P2P Systems is highly dynamic. Therefore the underlying database systems will be heterogeneous, prominently called as multi-database system. The intention behind the P2P paradigm is to extend direct communication among the peer nodes. Hence, the participating peers can exchange data and 534 TECHNIA International Journal of Computing Science and Communication Technologies, VOL. 3, NO. 1, July 2010. (ISSN 0974-3375) services with the other peers directly. In this paper, we will be particularly focusing on P2P distributed data mining of data residing in peers of common interest. Every peer may have some data either to share or to take from other peers. This data residing in various peers if mined may yield useful information. Also data of this kind may have semantic inter-dependencies, schema differences, and domain variations. Peer coordination, schema matching to some extend is discussed in [6]. Peer coordination may remain the same as most of the P2P system use relational databases. UI DMQM SM PCL comprises of User Interface (UI), Data Mining Query Manager (DMQM), Semantic Manager (SM), WRAPPER, Data Mining Task Manager (DMTM), Data Reduct (DR), Pre-processing Manager (PPM).UI takes care of receiving the user data mining query, passing it to DMQM and displaying results back to user. DMQM is responsible for parsing the data mining query issued by the user, interact with SM for semantics and issues the instructions to DMTM through Wrapper. Wrapper maps the parsed query to DMTM to initiate respective data mining task on reduced data set available with DR. PPM takes care of pre-processing the data and directs it to DR. DR is responsible for the reduced data set either from local or from other peers. PCL layer is a middleware implementation, therefore it can communicate within and with other peers only through XML. LCL layer is node dependent. User PODM layer WRAPPER Peer A DMTM DR Peer B Peer C Figure 3: Distributed Computing Architecture using Peer-to-Peer paradigm LCL Multivariate data PPM Clea Normalize Transform Peer Database Figure 2 - PODM Layer Feature Extractor To support this, we have presented a preliminary architecture which would reduce the data set to be mined residing at each peer locally. Therefore the distributed data mining process in P2P network may be optimized with respect to application run time at each peer. We assume that the peers may work in coordination with specified agreed level of information exchange and coordination among them. The peers may be participating in the task as long as their interest remains common. Peer Optimized Data Mining (PODM) architecture (Figure 2) is a two layer architecture which includes seven modules. Layer 1 is Peer Coordination Layer (PCL) and it is responsible for coordination with other peer nodes. Layer 2 is Local Coordination Layer (LCL) and it is meant for local communication. PODM 535 Feature Selector Validate Reduced Data Set Figure 4: Feature Extraction Process in PPM It is better to optimize locally rather looking for a global solution. PODM layer on each peer optimizes the data mining process locally (Figure 3). In our work, Preprocessing Manager (PPM) plays a vital role in producing the reduced data set. Feature Extractor is embedded within this PPM. TECHNIA International Journal of Computing Science and Communication Technologies, VOL. 3, NO. 1, July 2010. (ISSN 0974-3375) VI. Conclusion [6] The tremendous growth in the P2P networks has spawned new breed of applications. Peer-to-Peer distributed data mining is an emerging paradigm in distributed computing. Large volumes of data residing on these peers on mining would yield fruitful data patterns. But, data mining on such a highly dynamic environment is challenging and time consuming. Feature Selection could be viewed as one of the means to reduce data locally on each peer. The percentage of features reduced has direct impact on the percentage of comparisons made in the data set. Peer Optimized Data Mining Architecture is a preliminary architecture which views pre-processing as a separate, significant process to reduce the data set size using feature selection. [2] [3] [4] [5] [8] [9] [10] [11] VII. References [1] [7] [12] P. Luo, H. Xiong, K. Lu, Z Distributed Classification in Peer-toH. BaaZaoui Zghal, S. Faiz, H. Ben G Framework for Data Mining Based Multi Agent: An [13] Science, Engineering and Technology R. Wolff, A. Peer-toMan and Cybernetics Part B, 34(6), 2004 P. Classification and Prediction Algorithms for Data 117, 2002 W. Kowalczyk, M. wards Data Mining in Large and Fully Distributed Peer-to-Peer [14] [15] 203-210, 2003 536 P. A. Bernstein, F. Gi unchiglia, A. Kementsietsidis, J. Mylopoulos, L. Serafini, I Data Management for Peer-toIn Proceedings of Web DB 2002, 2002 J. C. da Silva, C. Giannella, R. Bhargava, H. Kargupta, M. S. Datta, K. Bhaduri, C. Giannella, R. Wolff, H. ning in Peer-to-Peer . IEEE Internet Computing special issue on Distributed Data Mining, 2006 K. M. Hammouda, M. HP2PC: Scalable Hierarchically-Distributed Peer-to-Peer Clustering SIAM 2007 S. Krishnaswamy, S. Wai Loke, A. Zaslavsky, of distributed data mining by predicting on Enterprise Information System, 1-4020-1086-9, 2003 J. Han, M. Concepts and ion, 2006 C. Dong, D. Wu, J. Evaluation Dataset Based on Genetic Algorithm and 978-0-7695-3336-0/08 $25.00 © 2008 IEEE C. Der Huang*, C. -Teng Lin, and N. Ranjan Pal, Hierarchical Learning Architecture With Automatic Feature Selection for Multiclass Protein Fold IEEE transactions on nanobioscience, vol. 2, no. 4, december 2003 Comparison of Feature Extraction and Selection T ND000909, 2001 T.-Hsiang Cheng1, C.-Ping Wei2, Vincent S. Tseng3 Feature Selection for Medical Data Mining: Comparisons of Expert Judgment and Automatic Proceedings of the 19th IEEE Symposium on Computer-Based Medical Systems (CBMS'06), 2006.