* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download MULTI-LAYERED FRAMEWORK FOR DISTRIBUTED DATA MINING
Clusterpoint wikipedia , lookup
Operational transformation wikipedia , lookup
Data Protection Act, 2012 wikipedia , lookup
Data center wikipedia , lookup
Data analysis wikipedia , lookup
Forecasting wikipedia , lookup
Information privacy law wikipedia , lookup
Data vault modeling wikipedia , lookup
MULTI-LAYERED SOFTWARE SYSTEM FRAMEWORK FOR DISTRIBUTED DATA MINING Masum Serazi, Amal Perera, Qiang Ding, Vasiliy Malakhov, William Perrizo North Dakota State University Computer Science Department Outline Introduction to Distributed Data Mining • • • Importance of a Layered Architecture A Prototype System • • Demands Existing Projects Architecture System Architecture (layered) • • • Server Communication Client & GUI System Characteristics Conclusion Demands of distributed data mining Large dataset size Diversity of data Geographic distribution of users and resources Computationally intensive result generation Large scale distributed data mining project Kensington project • Mining enterprise data distributed across the internet. Papyrus project • Based on mobile agents implemented using java. PaDDMAS • A component based tool set that integrates pre-developed or custom packages JAM • Agent based distributed system that has been developed to mine stored in different sites. BODHI • Collective data mining with stress on the learning vertically partitioned data. Architecture Client-Server • Advantage: • Able to use high performance computing on the server side to do the data mining. Agent based Hybrid Importance of a Layered Architecture Layered framework helps to manage complexity. Provides the flexibility to add/remove/modify layer and components of a layer Allows for a better tracking of progress of large, complex projects. Human input is required to tune the data and the algorithms to suite the need (Mix of greyware versus software can be changed over time). System Architecture DataMIMETM developed as proofof-concept. Based on patent pending, “P-tree technology” Efficient and scalable system. Flexible plug-ins. Conceptual view of the system Mine on DataMIME™ Capture dataset to DataMIMET M Integrate data (synchronize to existing) Internet System performanc e ananlysis Client Side Server Side One of the Slave Servers Master Server Server Architecture Data capture and integration layer (DCI/DII) Data mining interface (DMI) Distributed Ptree Management Interface (DPMI) • Room for new feeder Already Plugged Algorithm Plugs for new algorithms DMA Layer DCI/DII Layer DMI Layer Uniform data structure Data mining algorithms (DMA) Client-server communication Client interface DPMI: Distributed Ptree Management Interface Distributed Ptree database The Distributed P-tree Database The DPD collects all data in vertical format (as opposed to the ubiquitous horizontal (record-based) data structure used in DBMSs), as Predicatetrees (P-trees) based on the patent pending P-tree technology). P-trees can be 0-dimensional, 1-dimensional, 2-dimensional, etc. Next slide shows the detailed construction of 1-D P-trees from a generic horizontal table of data. Room for new feeder Already Plugged Algorithm Plugs for new algorithms DMA Layer DCI/DII Layer DMI Layer DPMI: Distributed Ptree Management Interface Distributed Ptree database (DPD) Current practice: Structure data into horizontal records. Process vertically (scans) R(A1 Horizontally structured records Scanned vertically 2 6 2 2 5 2 7 7 Predicate tree technology: vertically project each attribute, then vertically project each bit position of each attribute, then compress each bit slice into a basic Ptree. e.g., compression of R11 into P11 goes as follows: R[A1] R[A2] R[A3] R[A4] A2 A 3 A4 ) 7 7 7 7 2 2 0 0 6 6 5 5 1 1 1 1 1 0 1 7 4 5 4 4 = 010 011 010 010 101 010 111 111 111 111 110 111 010 010 000 000 110 110 101 101 001 001 001 001 001 000 001 111 100 101 100 100 0 0 0 0 1 0 1 1 R11 pure1? false=0 pure1? true=1 Top-down construction of the 1-dimensional Ptree representation of R11, denoted, P11, is built by recording the truth of the universal predicate “pure 1” in a tree recursively on halves, until purity is achieved. pure1? false=0 pure1? false=0 pure1? false=0 010 011 010 010 101 010 111 111 111 111 110 111 010 010 000 000 110 110 101 101 001 001 001 001 001 000 001 111 100 101 100 100 R11 R12 R13 R21 R22 R23 R31 R32 R33 0 0 0 0 1 0 1 1 1 1 1 1 0 1 1 1 0 1 0 0 1 0 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1 0 0 1 1 0 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 R41 R42 R43 0 0 0 1 1 1 1 1 0 0 0 1 0 0 0 0 1 0 1 1 0 1 0 0 Horizontally AND basic Ptrees 1. Whole is pure1? false 0 P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 2. Left half pure1? false 0 P11 And it’s pure0000so00 10 0 00 0 1 00 1 00 0 00 1 00 0 00 0 01 0 01 0 00 00 0 3. Right half pure1? false 0 10 10 10 01 01 01 0100 0 01 1 01 0001 branch ends 10 0 ^ 01 ^ 10 ^ ^ ^ 01 ^ ^^ ^ ^ 10 01 01 01 01 10 4. Left half of rt half ? false0 7 0 1 4 0 0 5. Rt half of right half? true1 01 To count occurrences of 7,0,1,4 use pure111000001100: 0 23-level 6. Lf half of lf of rt? true1 But it is pure 1 10 P11^P12^P13^P’21^P’22^P’23^P’31^P’32^P33^P41^P’42^P’43 = 0 0 22-level =2 (pure0) so this ^ 21-level 01 7. Rt half of lf of rt? false0 branch ends 2-D P-tree Data Structure m 11 11 11 11 11 11 11 01 11 11 11 11 11 11 11 11 11 10 11 11 11 11 11 11 00 00 00 10 11 11 11 11 1 m m m 0 1 m 1 1 m 1 1 1 1 1 0 0 0 1 0 1 1 0 1 Peano or Z-ordering Pure (Pure-1/Pure-0/Mixed) quadrants Root Count (count of 1s in the tree) Provides an efficient format for ANDing, ORing and Complementing. Lossless, compressed, countcomputation_ready representations. DCI/DII (Data Capture and Data Integration Interface) layer Allows user to capture and to integrate data to the required format (p-tree format). The main component of this layer is the feeder. An individual feeder can process a particular format of incoming data. User can write his own feeder and plug it very easily in this architecture. DCI/DII Layer DMI (Data Mining Interface) layer DMI does counting, the most important operation for data mining provided by P-trees, including: • basic P-trees • value P-trees • tuple P-trees • Interval P-trees • Cube P-trees DMI also provide the P-tree algebra, which has four operations: • AND • OR • NOT (complement) and • XOR, to implement the point wise logical operations on P-trees for (Data Mining Algorithms) DMA. D M I Distributed Ptree Management Interface (DPMI) Layer The DPMI layer provides: • access • location • and concurrency transparency by hiding the fact that: • data representation may differ • resources may be located in different places • resources may be shared by several competitive users. By resource we meant data and its converted form Ptree. DMA (Data Mining Algorithms) layer This layer is a collection of data mining tools (algorithms). Upon receiving a request from the client side an algorithm will be fired up for mining. This layer depends on the DMI for accessing meta-info and required counts needed in: • • • • Ptree based K Nearest Neighbor PKNN Podium Incremental Neighbor Evaluator PINE P-BAYESIAN Etc. The architecture has the flexibility to plug-in any new algorithm on this layer. DMA Layer Communication The communication between different layers is designed in such a way that it minimizes the data flow over the network. In the DCI and the DMA communication protocols a client will create a connection, send a request, receive a response and close the connection. A client will send only one request in a single threaded connection. The response for a request is a line with a message indicating the outcome of the request. A DMA protocol request has a similar structure : header and an optional set of binary files with checksums. The header in the DMA protocol is a set of key / value pairs (properties. Response to the DMA protocol request also contains key / value pairs. Client Structure The two main functionalities are: • Capture: Which sends datasets along with their meta information (description of the data) to the DII/DCI layer of the server for capturing. Meta Data • Mining: This sends requests to the DMA layer for applying data mining applications on previously captured datasets and the presentation of the results. Prediction Model Data Unclassified data DMA Client side DMA Client Side DCI DCI Data Meta-data generator Client Side DCI Visualization Tool Client and GUI Data Capturing Data Mining In the client side DataMIMETM has a graphical user interface (GUI) to visually interact with a user (http://midas.cs.ndsu.nodak.edu/~datasurg/datamime ) System Characteristics Ability to handle formatted record-based, relational-like data with numerical and/or categorical attributes. The data could be in text format, relational format, or TIFF image format. Easy conversion from any other machine readable format can be provided through customized data feeders. Users can do any data analysis and mining on data sets in the system, or on any new data they capture or integrate into the system. Capable of handling large quantities of data and mines them in scalable time. Clients of the system can run on UNIX and Microsoft Windows platform with the server designed to be a UNIXbased system. System Characteristics (cont.) Supports major RDBMS platforms. The server engine can be run on a single machine or distributed across multiple computers for better scalability and efficiency. The system has an open architecture provides high degree of software extensibility and integration capabilities. The system provides high level of asynchronous background operations, performing most data intensive operations in the background or offline and allowing users to continue their work. The system minimizes the flow of data across the network. Conclusion We have shown the importance of having a layered architecture for a distributed data mining system. Key elements were identified in deciding on the different layers. Able to identify a unique efficient vertical data structure at the lowest layer that can take advantages of the latest hardware. To facilitate the data distribution a management layer is also recognized. Two other layers are defined: data capture and data mining layer. A prototype system was developed as a proof-of-concept to show the feasibility of the approach.