Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Introduction to Map/Reduce Data Transformations Tasso Argyros CTO and Co-Founder Aster Data Systems [email protected] A Brief History of MapReduce 2004: Google publishes MapReduce paper at OSDI 2 2006: Hadoop becomes the first OpenSource implementation of MapReduce Confidential and proprietary. Copyright © 2008 Aster Data Systems 2008: Aster and Greenplum become the first database vendors to incorporate MapReduce functionality into their products What is MapReduce? It’s the simplest API you have ever seen It has just two functions 1. Map() and 2. Reduce() Plus: it’s language independent (Java, Perl, Python, …) 3 Confidential and proprietary. Copyright © 2008 Aster Data Systems Why is MapReduce Useful? It simplifies distributed applications… …by abstracting the details of data distribution (where is the data I need?) and process distribution (where should I run this process?)… …behind two simple functions. But let’s see an example 4 Confidential and proprietary. Copyright © 2008 Aster Data Systems The quick brown fox jumps over the lazy dog. The world only needs five computers. MapReduce is a very powerful programming paradigm. In-Database MapReduce is the future. Server A Hello world. Server B Switch 5 To be or not to be: that is the question. Confidential and proprietary. Copyright © 2008 Aster Data Systems Server C Server D Goal We Want to Count the # of Times Each Word Occurs 6 Confidential and proprietary. Copyright © 2008 Aster Data Systems 1st Approach1st Approach No MapReduce No MapReduce 7 Confidential and proprietary. Copyright © 2008 Aster Data Systems The quick brown fox jumps over the lazy dog In-Database MapReduce is the future. Server A the quick brown fox jumps over the lazy dog The world only needs five computers. the world only needs five computers Hello world. in database mapreduce is the future concept Server B Confidential and proprietary. Copyright © 2008 Aster Data Systems To be or not to be: that is the question. MapReduce is mapreduce a very is powerful a concept. very powerful Switch 8 hello world Server C Server D to be or not to be that is the question the quick brown fox jumps over the lazy dog in database mapreduce is the future the world only needs five computers hello world mapreduce is a very powerful concept to be or not to be that is the question Server 4 Final Result File 9 the 5 is 3 mapreduce 2 … … Confidential and proprietary. Copyright © 2008 Aster Data Systems What Did We Do? 1. Write a script to parse the documents and output word lists 2. FTP all the word lists to server 4 3. Write another script to count each word on Server 4 Problem: (2) and (3) do not scale! 10 Confidential and proprietary. Copyright © 2008 Aster Data Systems 2nd Approach No MapReduce Fully Distributed 11 Confidential and proprietary. Copyright © 2008 Aster Data Systems The quick brown fox jumps over the lazy dog In-Database MapReduce is the future. Server A the quick brown fox jumps over the lazy dog The world only needs five computers. the world only needs five computers Hello world. hello world To be or not to be: that is the question. MapReduce is mapreduce a very is powerful a concept. very powerful in database mapreduce is the future concept Server B the the world the world the powerful the lazy database brown database Confidential and proprietary. Copyright © 2008 Aster Data Systems future 12 Switch Server C Server D mapreduce mapreduce be be to jumps computers hello is is is question over a that to be or not to be that is the question Server 1 Final Result File the 5 … …. Server 2 Final Result File world 2 … …. Server 3 Final Result File mapreduce 2 … …. Server 4 Final Result File 13 is 3 … …. Confidential and proprietary. Copyright © 2008 Aster Data Systems 2nd Approach: No MapReduce, Distributed 1. Write a script to parse the documents and output word lists 2. FTP all the word lists to a single node 3. Write another script to count each word 4. Write another script to break each word list to four files: f1, f2, f3, f4 Note: for this to work the same word needs to be in the same file across all servers. How do I do that? Answer: use some form of hashing 5. FTP all f1s to Server 1, all f2s to Server 2, etc n(n-1) transfers 6. Write another script to merge all fXs on each Server 7. Write another script to count each word 14 Confidential and proprietary. Copyright © 2008 Aster Data Systems Does it work? Yes Does it take lots of time? Yes! Is it a pain? Yes!! Would you do it? No!!! 15 Confidential and proprietary. Copyright © 2008 Aster Data Systems Moreover… Who will manage your files? What if nodes fail? What if you want to add more nodes? What if… What if… What if… 16 Confidential and proprietary. Copyright © 2008 Aster Data Systems 17 Input Any file (e.g. documents) Input All <key, value> pairs with the same key grouped (e.g. all <word, count> pairs where word = “the”) Map() Reduce() Output Stream of <key, value> pairs (e.g. <word, count> pairs) Output Anything (e.g. sum of counts for a specific word) Confidential and proprietary. Copyright © 2008 Aster Data Systems The quick brown fox jumps over the lazy dog Map() In-Database MapReduce is the future. <the, 1> <the, 1> <the, 1> <the, 1> <the, 1> <database,1> <database,1> <future,1> 18 <in, 1> <database, 1> <mapreduce,1> <is,1> <the,1> <future,1> Map() Server A Server B <world,1> <world,1> <powerful,1> <lazy,1> <brown,1> Map() and Redistributi on Phase <the, 1> <quick, 1> <brown,1> <fox,1> <jumps,1> <over,1> <the,1> <lazy,1> <dog,1> Switch Confidential and proprietary. Copyright © 2008 Aster Data Systems Server C Server D <mapreduce,1> <mapreduce,1> <be,1> <be,1> <to,1> <jumps,1> <computers,1> <hello,1> <is,1> <is,1> <is,1> <question,1> <over,1> <a,1> <that,1> Grouping and Reduce() Phase (on Server 1) <the, <the, <the, <the, <the, <the, 1> <the, 1> <the, 1> <the, 1> <the, 1> <database,1> <database,1> <future,1> 1> 1> 1> 1> 1> Server 1 Final Result File <database,1> <database,1> <future,1> 19 Reduce() Confidential and proprietary. Copyright © 2008 Aster Data Systems Reduce() Reduce() the 5 database 2 future 1 What Just Happened? By writing two small scripts with a few lines of code… … we achieved exactly the same result! Plus, our code did not have to care about: •the # of servers on the system (4 or 400?) • which server to send each word • any network communication aspects • any fault tolerance aspects •… 20 Confidential and proprietary. Copyright © 2008 Aster Data Systems Word Count was Only an Example! Google does all web indexing on MapReduce “The indexing code is simpler, smaller, and easier to understand, because the code that deals with fault tolerance, distribution and parallelization is hidden within the MapReduce library. For example, the size of one phase of the computation dropped from approximately 3,800 lines of C++ code to approximately 700 lines when expressed using MapReduce.” Google 2004 MapReduce paper 21 Confidential and proprietary. Copyright © 2008 Aster Data Systems Word Count was Only an Example! Published work from Stanford University showed that even extremely complex Data Mining algorithms can fit in this very simple model “We adapt Google’s MapReduce paradigm to demonstrate this parallel speed up technique on a variety of learning algorithms including locally weighted linear regression (LWLR), k-means, logistic regression (LR), naive Bayes (NB), SVM, ICA, PCA, gaussian discriminant analysis (GDA), EM, and backpropagation (NN).” Stanford 2006 AI Lab paper 22 Confidential and proprietary. Copyright © 2008 Aster Data Systems Result? MapReduce makes writing parallel programs extremely easy… …and can accommodate from trivial to very complex algorithms… …thus enabling the processing of petabytes of data with a few lines of code! 23 Confidential and proprietary. Copyright © 2008 Aster Data Systems But… Today MapReduce is used only by hardcore coders/programmers/hackers Changes in MapReduce queries require changes in the MapReduce code itself • Constantly keep coding Using MapReduce with database data is hard and cumbersome… …when most of the structured data in the enterprise are stored in databases! 24 Confidential and proprietary. Copyright © 2008 Aster Data Systems Beyond SQL and MapReduce 25 Confidential and proprietary. Copyright © 2008 Aster Data Systems SQL vs MapReduce: Two different worlds? SQL 26 MapReduce Declarative Procedural • Specifies what needs to happen • Specifies how it needs to happen Execution plans optimized Code compiled once; dynamically MapReduce plans are static Input/output is Input/output is structured unstructured Data redistribution inferred Data redistribution based from SQL statement (in on <keys> in Reduce() MPP Databases) phase Confidential and proprietary. Copyright © 2008 Aster Data Systems Implementing MR in the Database Uses Polymorphic SQL operators to embed MapReduce functions to SQL Introduces a “PARTITION BY” clause to specify data redistribution Introduces a “SEQUENCE BY” clause to specify ordering of data flows to the MR functions Best of both worlds • Planning is still dynamic • MapReduce functions can be used like custom SQL operators • MapReduce functions can implement any algorithm or transformation • Code Once – Use Many (through SQL) model 27 Confidential and proprietary. Copyright © 2008 Aster Data Systems The SQL/MR Process 28 Write a SQL/MR function using Perl, Java, Python… Upload the script or .jar file to the database (one-time) (one-time) Confidential and proprietary. Copyright © 2008 Aster Data Systems Invoke SQL/MR function through SQL (infinite times) SQL/MR Function: Syntax SELECT… FROM (5) Select output (eg. count) MR_Function ( ON source_data [ PARTITION BY column ] [ ORDER BY column ] [Function Arguments] ) WHERE … GROUP BY … HAVING … ORDER BY … LIMIT …; 29 Optional conditions & filters Confidential and proprietary. Copyright © 2008 Aster Data Systems (4) Java/Python/… MR function (1) Source table or sub-select (2) <key> for data redistribution (3) Sort before the MR function Optional MR_Function Arguments Example 1: Tokenization Demo #1: Only Map (Tokenization) in SQL/MR SELECT word, count(*) AS wordcount FROM Tokenize( ON blogs ) GROUP BY word ORDER BY wordcount DESC LIMIT 20; Demo #2: Map (Tokenization) and Reduce (WordCount) in SQL/MR SELECT key AS word, value AS wordcount FROM WordCountReduce ( ON Tokenize ( ON blogs ) PARTITION BY key ) ORDER BY wordcount DESC LIMIT 20; Demo #3: Why do Reduce when you have SQL? SELECT word, count(*) AS wordcount FROM Tokenize( ON blogs ) GROUP BY word ORDER BY wordcount DESC LIMIT 20; 30 Confidential and proprietary. Copyright © 2008 Aster Data Systems Example 2: Sessionization What Is Sessionize? An example Aster SQL/MR function Leverages Aster’s Java library API What Does It Do? User specified a column (eg. timestamp) and a session timeout value (in seconds) Spits out unique session identifiers (sessionid column) Usage CREATE TABLE sessionized_clicks AS SELECT ts, userid, sessionid, ... FROM Sessionize( ON clicks PARTITION BY userid ORDER BY ts TIMEOUT 60 ); 31 Confidential and proprietary. Copyright © 2008 Aster Data Systems Example 2: Sessionization Clickstream 00:58:24 PrezBus h 10:00:24 Shawn1 02:30:33 PrezBus h 10:01:23 Shawn1 SQL/MR timesta userid mp 10:00:00 Shawn1 timesta userid mp 10:00:00 Shawn1 sessioni d 0 10:00:24 Shawn1 0 10:01:23 Shawn1 0 10:02:40 Shawn1 1 Session Timeout = 60 seconds timesta userid sessioni mp d 00:58:24 PrezBush 0 10:02:40 Shawn1 02:30:33 PrezBush 1 INPUT 32 Confidential and proprietary. Copyright © 2008 Aster Data Systems OUTPUT Slide 32 MR Applications in the Database ELT Text and data transformations, in-parallel, in-database Queries that become too complex for SQL E.g. Sessionize(), customer segmentation, predictive analytics, … Queries that SQL inherently cannot handle well Time series analytics Aster has a set of pre-defined SQL/MR functions for this Data structures that do not fit well the relational model Time series (again) Graphs, spatial data Any analytical or reporting application that requires more performance and data proximity! 33 Confidential and proprietary. Copyright © 2008 Aster Data Systems Summary Growing challenges in scaling analytical applications and reporting MapReduce is driving a data revolution (see: Google) In-Database MapReduce will open up databases to a host of new applications [email protected] (Questions, Comments) asterdata.com/blog (Lots of technical details) 1.888.Aster.Data (Any other information) 34 Confidential and proprietary. Copyright © 2008 Aster Data Systems