Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
High Level Abstractions for Data-Intensive Computing Christopher Moretti, Hoang Bui, Brandon Rich, and Douglas Thain University of Notre Dame 4/30/2008 Christopher Moretti – University of Notre Dame 1 Computing’s central challenge, “How not to make a mess of it,” has not yet been met. -Edsger Dijkstra 4/30/2008 Christopher Moretti – University of Notre Dame 2 Overview Many systems today give end users access to hundreds or thousands of CPUs. But, it is far too easy for the naive user to create a big mess in the process. Our Solution: Deploy high-level abstractions that describe both data and computation needs. Some examples of current work: All-Pairs: An abstraction for biometric workloads. Distributed Ensemble Classification DataLab: A system and language for data-parallel computation. 4/30/2008 Christopher Moretti – University of Notre Dame 3 Distributed Computing is Hard! Which resources ? How do I fit my workload into jobs? What about job input data? How can I measure job stats? 4/30/2008 What is Condor? How Many? What happens when things fail? How long will it take? What do I do with the results? Christopher Moretti – University of Notre Dame 5 Distributed Computing is Hard! Which resources ? How do I fit my workload into jobs? What about job input data? How can I measure job stats? What is Condor? How Many? What happens when things fail? ARGH! 4/30/2008 How long will it take? What do I do with the results? Christopher Moretti – University of Notre Dame 6 The All-Pairs Problem S2 1 All-Pairs( Set S1, Set S2, Function F ) yields a matrix M: Mij = F(S1i,S2j) S2 2 S2 3 S2 4 S2 5 S2 6 S2 7 S1 1 S1 2 S1 3 S1 4 S1 5 S1 6 60K 20KB images >1GB 3.6B comparisons @ 50/s = 2.3 CPUYrs x 8B output = 29GB 4/30/2008 S1 7 Christopher Moretti – University of Notre Dame 11 Biometric All-Pairs Comparison 1 .8 .1 0 0 .1 1 0 .1 .1 0 1 0 .1 .7 1 F 1 0 .1 1 4/30/2008 Christopher Moretti – University of Notre Dame 12 Naïve Mistakes Computing Problem: Even expert users don’t know how to tune jobs optimally, and can make 100 CPUs even slower than one by overloading the file server, network, or resource manager. For all $X : For all $Y : cmp $X to $Y Batch System CPU 4/30/2008 CPU CPU CPU Each CPU reads 10TB! file server Christopher Moretti – University of Notre Dame 13 Consequences of Naïve Mistakes 4/30/2008 Christopher Moretti – University of Notre Dame 14 All Pairs Abstraction set S of files binary function F F invocation M = AllPairs(F,S) 4/30/2008 Christopher Moretti – University of Notre Dame 15 All-Pairs Production System at Notre Dame Web Portal F 1 - Upload F and S into web portal. S G H T 300 active storage units 500 CPUs, 40TB disk 4 – Choose optimal partitioning and submit batch jobs. F F F F F F 2 - AllPairs(F,S) All-Pairs Engine 6 - Return result matrix to user. 4/30/2008 5 - Collect and assemble results. 3 - O(log n) distribution by spanning tree. Christopher Moretti – University of Notre Dame 17 4/30/2008 Christopher Moretti – University of Notre Dame 18 4/30/2008 Christopher Moretti – University of Notre Dame 19 4/30/2008 Christopher Moretti – University of Notre Dame 20 Returning the Result Matrix 4.37 Too many files. 6.01 Hard to do prefetching. 2.22 4.37 7.13 8.94 6.72 1.34 … … … 0.98 Too large files. Must scan entire file. Row/Column ordered. How can we build it? 4/30/2008 Christopher Moretti – University of Notre Dame 21 Result Storage by Abstraction Chirp_array allows users to create, manage, modify large arrays without having to realize underlying form. Operations on chirp_array: create a chirp_array open a chirp_array set value A[i,j] get value A[i,j] X X get row A[i] get column A[j] set row A[i] set column A[j] 4/30/2008 Christopher Moretti – University of Notre Dame CPU CPU CPU Disk Disk Disk 22 Result Storage with chirp_array chirp_array_get(i,j) 4/30/2008 Christopher Moretti – University of Notre Dame CPU CPU CPU Disk Disk Disk 23 Result Storage with chirp_array chirp_array_get(i,j) 4/30/2008 Christopher Moretti – University of Notre Dame CPU CPU CPU Disk Disk Disk 24 Result Storage with chirp_array chirp_array_get(i,j) 4/30/2008 Christopher Moretti – University of Notre Dame CPU CPU CPU Disk Disk Disk 25 Data Mining on Large Data Sets Problem: Supercomputers are expensive, not all scientists have access to them for completing very large memory problems. Classification on large data sets without sufficient memory can degrade throughput, degrade accuracy, or fail outright. 4/30/2008 Christopher Moretti – University of Notre Dame 26 Data Mining Using Ensembles training data partitioning/sampling (optional) algorithm 1 algorithm n classifier 1 classifier n test instance voting classification (From Steinhaeuser and Chawla, 2007) 4/30/2008 Christopher Moretti – University of Notre Dame 27 Data Mining Using Ensembles training data partitioning/sampling (optional) algorithm 1 algorithm n classifier 1 classifier n test instance voting classification (From Steinhaeuser and Chawla, 2007) 4/30/2008 Christopher Moretti – University of Notre Dame 28 Abstraction for Ensembles Using Natural Parallelism Here are my algorithms. Here is my data set. Here is my test set. Abstraction Engine Choose optimal partitioning and submit batch jobs. CPU CPU CPU CPU Local Votes Return local votes for tabulation and final prediction. 4/30/2008 Christopher Moretti – University of Notre Dame 29 DataLab Abstractions file system tcsh emacs perl distributed data structures set S A B file F function evaluation Y = F(X) C job_start job_commit job_wait job_remove parrot chirp server chirp server chirp server unix filesys unix filesys unix filesys 4/30/2008 chirp server Christopher Moretti – University of Notre Dame chirp server F X Y 30 DataLab Language Syntax apply F on S into T set S A chirp server 4/30/2008 B set T F C A chirp server chirp server chirp server F F F Christopher Moretti – University of Notre Dame B C chirp server 31 For More Information Christopher Moretti [email protected] Douglas Thain [email protected] Cooperative Computing Lab http://cse.nd.edu/~ccl 4/30/2008 Christopher Moretti – University of Notre Dame 32