Download Compiler Techniques for Data Parallel Applications With Very Large

FREERIDE: System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Ge Yang Gagan Agrawal Department of Computer and Information Sciences Ohio State University Motivation: Data Mining Problem    Datasets available for mining are often large Our understanding of what algorithms and parameters will give desired insights is limited Time required for creating scalable implementations of different algorithms and running them with different parameters on large datasets slows down the data mining process Project Overview   FREERIDE (Framework for Rapid Implementation of datamining engines) as the base system Demonstrated for a variety of standard mining algorithms FREERIDE offers:      The ability to rapidly prototype a highperformance mining implementation Distributed memory parallelization Shared memory parallelization Ability to process large and disk-resident datasets Only modest modifications to a sequential implementation for the above three Key Observation from Mining Algorithms While( ) {   Popular algorithms have a common canonical loop Can be used as the basis for supporting a common middleware forall( data instances d) { I = process(d) R(I) = R(I) op d } ……. } Shared Memory Parallelization Techniques      Full Replication: create a copy of the reduction object for each thread Full Locking: associate a lock with each element Optimized Full Locking: put the element and corresponding lock on the same cache block Fixed Locking: use a fixed number of locks Cache Sensitive Locking: one lock for all elements in a cache block Memory Layout for Various Locking Schemes Full Locking Optimized Full Locking Lock Fixed Locking Cache-Sensitive Locking Reduction Element Trade-offs between Techniques    Memory requirements: high memory requirements can cause memory thrashing Contention: if the number of reduction elements is small, contention for locks can be a significant factor Coherence cache misses and false sharing: more likely with a small number of reduction elements Combining Shared Memory and Distributed Memory Parallelization    Distributed memory parallelization by replication of reduction object Naturally combines with full replication on shared memory For locking with non-trivial memory layouts, two options   Communicate locks also Copy reduction elements to a separate buffer Apriori Association Mining Time(s) Relative Performance of Full Replication, Optimized Full Locking and Cache-Sensitive Locking 100000 80000 60000 40000 20000 0 fr ofl csl 0.10% 0.05% 0.03% 0.02% Support Levels 500MB dataset, N2000,L20, 4 threads K-means Shared Memory Parallelization 1600 1400 1200 full repl 1000 800 opt full locks 600 400 cache sens. Locks 200 0 1 thread 4 threads 16 threads Performance on Cluster of SMPs Apriori Association Mining 70000 60000 50000 40000 1 thread 2 threads 3 threads 30000 20000 10000 0 1 node 2 nodes 4 nodes 8 nodes Results from EM Clustering Algorithm   EM is a popular data mining algorithm Can we parallelize it using the same support that worked for other clustering algo (k-means) and algo for other mining tasks 3000 2500 2000 1 thread 2 threads 3 threads 4 threads 1500 1000 500 0 1 2 4 8 nodes Results from FP-tree FPtree: 800 MB dataset 20 frequent itemsets 60 50 40 1 thread 2 threads 3 threads 30 20 10 0 1 2 4 8 A Case Study: Decision Tree Construction • • • Question: can we parallelize decision tree construction using the same framework ? Most existing parallel algorithms have a fairly different structure (sorting, writing back …) Being able to support decision tree construction will significantly add to the usefulness of the framework Approach    Implemented RainForest framework (Gehrke) Currently focus on RF-read Overview of the algorithm  While the stop condition not satisfied     read the data build the AVC-group for nodes choose the splitting attributes to split nodes select a new set of node to process as long as the main memory could hold it Parallelization Strategies     Pure approach: only apply one of full replication, optimized full locking and cache-sensitive locking Vertical approach: use replication at top levels, locking at lower Horizontal: use replication for attributes with a small number of distinct values, locking otherwise Mixed approach: combine the above two Results 3500 3000 Time(s) 2500 fr 2000 ofl 1500 csl 1000 500 0 1 2 4 8 No. of Nodes Performance of pure versions, 1.3GB dataset with 32 million records in the training set, function 7, the depth of decision tree = 16. Results 3000 Time(s) 2500 2000 horizontal 1500 vertical 1000 mixed 500 0 1 2 4 8 No. of Nodes Combining full replication and full locking Results 3000 Time(s) 2500 2000 horizontal 1500 vertical 1000 mixed 500 0 1 2 4 8 No. of Nodes Combining full replication and cache-sensitive locking Combining Distributed Memory and Shared Memory Parallelization for Decision Tree    The key problem: large size of AVC groups means very high communication volume Results in very limited speedups Can we modify the algorithm to reduce communication volume ? SPIES On (a) FREERIDE     Developed a new communication efficient decision tree construction algorithm – Statistical Pruning of Intervals for Enhanced Scalability (SPIES) Combines RainForest with statistical pruning of intervals of numerical attributes to reduce memory requirements and communication volume Does not require sorting of data, or partitioning and writing-back of records Paper in SDM regular program 7000 6000 5000 1 thread 2 threads 3 threads 4000 3000 2000 1000 0 1 node 8 nodes Applying FREERIDE for Scientific Data Mining     Joint work with Machiraju and Parthasarathy Focusing on feature extraction, tracking, and mining approach developed by Machiraju et al. A feature is a region of interest in a dataset A suite of algorithms for extracting and tracking features Summary  Demonstrated a common framework for parallelization of a wide range of mining algos       Association mining – apriori and fp-tree Clustering – k-means and EM Decision tree construction Nearest neighbor search Both shared memory and distributed memory parallelism A number of advantages   Ease parallelization Support higher-level interfaces

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Compiler Techniques for Data Parallel Applications With Very Large