Download Avg. Time Per Iteration (sec) - Computer Science and Engineering

Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal 1 May 22, 2017 Motivation Growing need for analysis of large scale data    Scientific Commercial Data-intensive Supercomputing (DISC) Map-Reduce has received a lot of attention     2 Database and Datamining communities High performance computing community May 22, 2017 Motivation (2) Processor Architecture Trends      Clock speeds are not increasing Trend towards multi-core architectures Accelerators like GPUs are very popular Cluster of multi-cores / GPUs are becoming common Trend Towards the Cloud   Use storage/computing/services from a provider    3 How many of you prefer gmail over cse/osu account for e-mail? Utility model of computation Need high-level APIs and adaptation May 22, 2017 My Research Group Data-intensive theme at multiple levels       4 Parallel programming Models Multi-cores and accelerators Adaptive Middleware Scientific Data Management / Workflows Deep Web Integration and Analysis May 22, 2017 This Talk Parallel Programming API for Data-Intensive Computing    An alternate API and System for Google’s MapReduce Show actual comparison Fault-tolerance for data-intensive computing Data-intensive Computing on Accelerators    5 Compilation for GPUs May 22, 2017 Map-Reduce Simple API for (data-intensive) parallel programming Computation is:       6 Apply map on each input data element Produce (key,value) pair(s) Sort them using the key Apply reduce on the set with a distinct key values May 22, 2017 Map-Reduce Execution 7 May 22, 2017 Map-Reduce: Positives and Questions Positives:   Simple API    Functional language based Very easy to learn Support for fault-tolerance  Important for very large-scale clusters Questions   Performance?   8 Comparison with other approaches Suitability for different class of applications? May 22, 2017 Class of Data-Intensive Applications Many different types of applications   Data-center kind of applications   More ``compute-intensive’’ applications    Data scans, sorting, indexing data-intensive Machine learning, data mining, NLP Map-reduce / Hadoop being widely used for this class Standard Database Operations  Sigmod 2009 paper compares Hadoop with Databases and OLAP systems What is Map-reduce suitable for? What are the alternatives?    9 MPI/OpenMP/Pthreads – too low level? May 22, 2017 FREERIDE: GOALS    Developed 2001-2003 at Ohio State Framework for Rapid Implementation of Data Mining Engines The ability to rapidly prototype a highperformance mining implementation     Distributed memory parallelization Shared memory parallelization Ability to process disk-resident datasets Only modest modifications to a sequential implementation for the above three May 22, 2017 10 May 22, 2017 10 FREERIDE – Technical Basis Popular data mining algorithms have a common canonical loop Generalized Reduction Can be used as the basis for supporting a common API Demonstrated for Popular Data Mining and Scientific Data Processing Applications     11 While( ) { forall (data instances d) { I = process(d) R(I) = R(I) op f(d) } ……. } May 22, 2017 Comparing Processing Structure  Similar, but with subtle differences 12 May 22, 2017 Processing Structure for FREERIDE  13 Basic one-stage dataflow May 22, 2017 May 22, 2017 13 Observations on Processing Structure  Map-Reduce is based on functional idea    Do not maintain state This can lead to sorting overheads FREERIDE API is based on a programmer managed reduction object     14 Not as ‘clean’ But, avoids sorting Can also help shared memory parallelization Helps better fault-recovery May 22, 2017 Experiment Design  Tuning parameters in Hadoop     For comparison, we used four applications    Data Mining: KMeans, KNN, Apriori Simple data scan application: Wordcount Experiments on a multi-core cluster  15 Input Split size Max number of concurrent map tasks per node Number of reduce tasks 8 cores per node (8 map tasks) May 22, 2017 Results – Data Mining  KMeans: varying # of nodes 400 Dataset: 6.4G K : 1000 Dim: 3 350 300 Avg. Time Per Iteration (sec) 250 Hadoop FREERIDE 200 150 100 50 0 4 8 16 # of nodes 16 May 22, 2017 Results – Data Mining (II) Apriori: varying # of nodes 140 Dataset: 900M Support level: 3% Confidence level: 9% 120 100 Avg. Time Per Iteration (sec)  80 Hadoop FREERIDE 60 40 20 0 4 8 16 # of nodes May 22, 2017 17 May 22, 2017 17 Results – Data Mining (III) KNN: varying # of nodes 160 Dataset: 6.4G K : 1000 Dim: 3 140 120 100 Avg. Time Per Iteration (sec)  Hadoop FREERIDE 80 60 40 20 0 4 8 16 # of nodes May 22, 2017 18 May 22, 2017 18 Results – Datacenter-like Application  Wordcount: varying # of nodes 600 Dataset: 6.4G Total Time (sec) 500 400 Hadoop FREERIDE 300 200 100 0 4 8 16 # of nodes 19 May 22, 2017 Scalability Comparison  KMeans: varying dataset size 180 K : 100 Dim: 3 On 8 nodes 160 140 Avg. Time Per Iteration (sec) 120 100 Hadoop FREERIDE 80 60 40 20 0 800M 1.6G 3.2G 6.4G 12.8G Dataset Size 20 May 22, 2017 Scalability – Word Count  Wordcount: varying dataset size 600 On 8 nodes Total Time (sec) 500 400 Hadoop FREERIDE 300 200 100 0 800M 1.6G 3.2G 6.4G 12.8G Dataset Size 21 May 22, 2017 Observations   Performance issues with Hadoop are now well know How much of a factor is the API   Java, file system API comparison on the same platform  Design of MATE  22 Map-reduce with an AlternaTE API May 22, 2017 Basis: Phoenix implementation     Shared memory map-reduce implementation from Stanford C based An efficient runtime that handles parallelization, resource management, and fault recovery Support FREERIDE-like API 23 May 22, 2017 May 22, 2017 23 Functions  24 APIs provided by the runtime Function Description R/O int mate_init(scheudler_args_t * args) R int mate_scheduler(void * args) R int mate_finalize(void * args) O void reduction_object_pre_init() R int reduction_object_alloc(int size)—return the object id R void reduction_object_post_init() R void accumulate/maximal/minimal(int id, int offset, void * value) O void reuse_reduction_object() O void * get_intermediate_result(int iter, int id, int offset) O May 22, 2017 May 22, 2017 24 Experiments: K-means  K-means: 400MB, 3-dim points, k = 100 on one AMD node with 16 cores 25 May 22, 2017 May 22, 2017 25 Fault-Tolerance in FREERIDE/MATE  Map-reduce supports fault-tolerance by replicating files   Storage and processing time overheads FREERIDE/MATE API offers another option    26 Reduction object is a low-cost application-level checkpoint Can support efficient recovery Can also allow redistribution of work on other nodes May 22, 2017 Fault Tolerance Results 27 May 22, 2017 This Talk  Parallel Programming API for Data-Intensive Computing     An alternate API and System for Google’s MapReduce Show actual comparison Fault-tolerance for data-intensive computing Data-intensive Computing on Accelerators  28 Compilation for GPUs May 22, 2017 Background - GPU Computing • Many-core architectures/Accelerators are becoming more popular • GPUs are inexpensive and fast • CUDA is a high-level language for GPU programming CUDA Programming • Significant improvement over use of Graphics Libraries • But .. • Need detailed knowledge of the architecture of GPU and a new language • Must specify the grid configuration • Deal with memory allocation and movement • Explicit management of memory hierarchy Parallel Data mining • Common structure of data mining applications (FREERIDE) • /* outer sequential loop */ while() { /* Reduction loop */ Foreach (element e){ (i, val) = process(e); Reduc(i) = Reduc(i) op val; } Porting on GPUs     High-level Parallelization is straight-forward Details of Data Movement Impact of Thread Count on Reduction time Use of shared memory Architecture of the System User Input Variable information Reduction functions Optional functions Variable Analyzer Host Program Variable Access Pattern and Combination Operations Kernel functions Code Generator Code Analyzer( In LLVM) Grid configuration and kernel invocation Executable User Input Variables to be used in the reduction function Values of each variable or size of array A sequential reduction function Optional functions (initialization function, combination function…) Analysis of Sequential Code • Get the information of access features of each variable • Determine the data to be replicated • Get the operator for global combination • Variables for shared memory Memory Allocation and Copy …… …… A. T0 T1 T2 T3 T4 B. T61 T62 T63 T0 T1 T61 T62 T63 T0 T1 …… T0 T1 T2 T3 T4 Need copy for each thread C. Copy the updates back to host memory after the kernel reduction function returns Generating CUDA Code and C++/C code Invoking the Kernel Function • Memory allocation and copy • Thread grid configuration (block number and thread number) • Global function • Kernel reduction function • Global combination Optimizations • Using shared memory • Providing user-specified initialization functions and combination functions • Specifying variables that are allocated once Applications • K-means clustering • EM clustering • PCA K-means Results Speedups Speedup of EM Speedup of PCA Summary   Data-intensive Computing is of growing importance One size doesn’t fit all   Map-reduce has many limitations Accelerators can be promising for achieving performance 43 May 22, 2017

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Avg. Time Per Iteration (sec) - Computer Science and Engineering