Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Computer Science and Engineering FREERIDE-G: Framework for Developing GridBased Data Mining Applications L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher {[email protected]} ICPP’06 FREERIDE-G: Enabling Remote DataMining Leonid Glimcher P. 1 Computer Science and Engineering Distributed Data-Intensive Science Compute Cluster User ? Data Repository Cluster ICPP’06 FREERIDE-G: Enabling Remote DataMining Leonid Glimcher P. 2 Computer Science and Engineering Challenges for Application Development • Analysis of large amounts of disk resident data • Incorporating parallel processing into analysis • Processing needs to be independent of other elements and easy to specify • Coordination of storage, network and computing resources required • Transparency of data retrieval, staging and caching is desired ICPP’06 FREERIDE-G: Enabling Remote DataMining Leonid Glimcher P. 3 Computer Science and Engineering FREERIDE-G Goals • Support High-End Processing – Enable efficient processing of large scale data mining computations • Ease Use of Parallel Configurations – Support shared and distributed memory parallelization starting from a common high-level interface • Hide Details of Data Movement and Caching – Data staging and caching (when feasible/appropriate) needs to be transparent to application developer ICPP’06 FREERIDE-G: Enabling Remote DataMining Leonid Glimcher P. 4 Computer Science and Engineering Presentation Road Map • • • • • • Motivation and goals System architecture and overview Applications used for evaluation Experimental evaluation Related work in distributed data-intensive science Conclusions and future work ICPP’06 FREERIDE-G: Enabling Remote DataMining Leonid Glimcher P. 5 Computer Science and Engineering FREERIDE-G Architecture Data Repository Data Retrieval Data Distribution Communication User cluster Data Processing Caching Retrieval Communication Compute Nodes Data server Data Processing Data server Data Processing Caching Retrieval Data Retrieval Communication Data Distribution Communication Caching Retrieval Communication Compute Nodes Data Processing Caching Retrieval Communication ICPP’06 FREERIDE-G: Enabling Remote DataMining Leonid Glimcher P. 6 Computer Science and Engineering Data Server Functionality • Data retrieval: – data chunks read from repository disks • Data distribution: – each chunk assigned a processing node destination in user cluster • Data communication: – each chunk forwarded to destination processing node Data server runs on every on-line data repository node, automating data delivery to the end-user ICPP’06 FREERIDE-G: Enabling Remote DataMining Leonid Glimcher P. 7 Computer Science and Engineering Compute Node Functionality • Data communication: – data chunks received from corresponding data node • Computation: – application specific processing performed on each chunk • Data caching & retrieval: – for multi-pass algorithms data cached locally on 1st pass and retrieved locally for sub-sequent passes Compute server runs on every processing node to receive data and process it in an application specific way ICPP’06 FREERIDE-G: Enabling Remote DataMining Leonid Glimcher P. 8 Computer Science and Engineering Processing structure of FREERIDE-G Built on FREERIDE KEY observation: most algorithms follow canonical loop Middleware API: • Subset of data to be processed • Reduction object • Local and global reduction operations • Iterator Supports: • Disk resident datasets • Shared & Distributed Memory While( ) { forall( data instances d) { I = process(d) R(I) = R(I) op d } ……. } ICPP’06 FREERIDE-G: Enabling Remote DataMining Leonid Glimcher P. 9 Computer Science and Engineering Summary of implementation issues • Managing and communicating remote data: – 2-way coordination required • Load distribution: – if compute cluster bigger than data cluster • Parallel processing on compute cluster: – FREERIDE-G supports generalized reductions • Caching: – benefits multi-pass algorithms ICPP’06 FREERIDE-G: Enabling Remote DataMining Leonid Glimcher P. 10 Computer Science and Engineering Remote Data Issues • Managing data communication: – ADR library used for scheduling and performing data retrieval at repository site – communication timing coordinated between source and destination • Caching: – local file system used for caching – avoids redundant communication of data for (P-1)/P iterations ICPP’06 FREERIDE-G: Enabling Remote DataMining Leonid Glimcher P. 11 Computer Science and Engineering Parallel data processing issues • Load distribution: – Needed when more compute nodes are available then data nodes – Hashing on unique chunk ID • Parallel processing on compute cluster: – After data is distributed, local reduction performed on every node – Reduction object gathered at Master node – Global combination (reduction) performed on Master node ICPP’06 FREERIDE-G: Enabling Remote DataMining Leonid Glimcher P. 12 Computer Science and Engineering Application Summary • Data Mining: – K-Nearest Neighbor search – K-means clustering – EM clustering • Scientific Feature Mining: – Vortex detection in the fluid flow dataset – Molecular defect detection in the molecular dynamics dataset ICPP’06 FREERIDE-G: Enabling Remote DataMining Leonid Glimcher P. 13 Computer Science and Engineering Goals for Experimental Evaluation • Evaluation parallel scalability of applications developed: – Numbers of data and compute nodes kept equal with variable parallel configurations • Evaluating scalability of compute nodes: – Number of compute nodes kept independent of number of data nodes • Evaluating benefits of caching: – Multi-pass algorithms evaluated ICPP’06 FREERIDE-G: Enabling Remote DataMining Leonid Glimcher P. 14 Computer Science and Engineering Evaluating Overall Scalability Vortex detection application 2000 500 0 Execution time 1.85 GB 1000 12000 ICPP’06 710 MB 1500 14000 Equal number of repository and compute nodes 260 MB 2500 Execution time • Cluster of 700 MHz Pentiums • Connected through Myrinet LANai 7.0 (no access to high bandwidth network) 3000 3-D Column 8 4 Parallel configuration 3-D Column 5 3-D Column 6 Defect detection application 3-D Column 7 130 3-D MB Column 8 1 2 450 MB 10000 8000 6000 4000 2000 0 4 1.8 GB 3-D Column 4 3-D Column 5 3-D 1 2 4Column 8 6 3-D Parallel configuration Column 7 3-D Leonid Glimcher Column 8 FREERIDE-G: Enabling Remote DataMining P. 15 Computer Science and Engineering Overall Scalability K-means clustering application 30000 350 MB 700 MB 1.4 GB Execution time 25000 20000 15000 10000 5000 0 8 4 2 1 Paralle l configuration EM clustering application 35000 350 MB 700 MB 1.4 GB 30000 Execution time 25000 20000 15000 10000 • All 5 applications tested: – High parallel efficiency – Good scalability with respect to: • problem size • processing node number 5000 0 8 4 2 1 Par alle l configur ation ICPP’06 FREERIDE-G: Enabling Remote DataMining Leonid Glimcher P. 16 Computer Science and Engineering Evaluating Scalability of Compute Nodes KNN search application 2500 Execution time Compute cluster size is greater than data repository cluster size. Applications (single pass only): 1. kNN search, 2. molecular defect detection, 3. vortex detection (next slide), 1 cn 2 cn 4 cn 8 cn 16 cn 2000 1500 1000 500 0 1 2 4 8 Data node # Defect detection application 14 0 0 0 1 cn 2 cn 4 cn 8 cn 16 cn Parallel configurations: • Data nodes: 1 to 8 • Compute nodes: 1 to 16. Execution time 12 0 0 0 10 0 0 0 8000 6000 4000 2000 0 1 ICPP’06 2 4 8 Data node # FREERIDE-G: Enabling Remote DataMining Leonid Glimcher P. 17 Computer Science and Engineering Compute Node Scalability Vortex detection application 3000 1 cn 2 cn 4 cn 8 cn 16 cn Execution time 2500 2000 1500 1000 • Only data processing work parallelized • Data retrieval and communication times not effected • Speedups are sub-linear 500 0 1 2 4 Data node # 8 Better resource utilization leads to analysis time decrease ICPP’06 FREERIDE-G: Enabling Remote DataMining Leonid Glimcher P. 18 Computer Science and Engineering Evaluating effects of caching K-means clustering application 30000 Execution time 25000 20000 15000 10000 5000 0 1 2 4 8 Parallel configuration EM clustering application 35000 Execution time • Network bandwidth simulated: 500 KB/sec • Caching vs. non-caching versions compared Comparing data communication times (P passes): – factor of P decrease from caching Caching benefit depends on: • application • network bandwidth 1.4 GB nc 1.4 GB c 700 MB nc 700 MB c 350 MB nc 350 MB c 1.4 GB nc 1.4 GB c 700 MB nc 700 MB c 350 MB nc 350 MB c 30000 25000 20000 15000 10000 5000 0 1 2 4 8 Parallel configuration ICPP’06 FREERIDE-G: Enabling Remote DataMining Leonid Glimcher P. 19 Computer Science and Engineering Related Work • Support for grid-based data mining: – Knowledge Grid toolset – Grid-Miner toolkit – Discovery Net layer – DataMiningGrid framework No interface for easing parallelization and abstracting data movement • GRIST – support for astronomy related mining on the grid Specific to the astronomical domain FREERIDE-G is built directly on top of FREERIDE. ICPP’06 FREERIDE-G: Enabling Remote DataMining Leonid Glimcher P. 20 Computer Science and Engineering Conclusions • FREERIDE-G supports remote data analysis from high-level interface • Evaluated on variety of algorithms • Demonstrated scalability in terms of: – Even data-compute scale-up – Compute node scale-up (only processing time) • Multi-pass algorithms benefit from data caching ICPP’06 FREERIDE-G: Enabling Remote DataMining Leonid Glimcher P. 21 Computer Science and Engineering Continuing Work on FREERIDE-G • • • • High bandwidth network evaluation Performance prediction based resource selection Resource allocation More sophisticated caching and data communication mechanisms (SRB) • Data format issues: wrapper integration • Higher-level front-end to further ease development of data analysis tools for the grid ICPP’06 FREERIDE-G: Enabling Remote DataMining Leonid Glimcher P. 22 Computer Science and Engineering ICPP’06 FREERIDE-G: Enabling Remote DataMining Leonid Glimcher P. 23