Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Analyzing big, continuous data Dionysios Logothetis, Georgos Siganos Telefonica Research Collaboration with: Ken Yocum Chris Olston, Ben Reed UC San Diego Yahoo! Research Big data analytics • Data abound – Web content, e-mails, CDRs, financial, retail, scientific data • Drive innovation: science, business, finance – CDRs: detect calling patterns – E-mails: targeted advertising – Social media feeds: brand monitoring Numerous other: fraud detection, product recommendation, content personalization, genomics, weather prediction, … 2 Big challenges • Scale: data are growing too fast – Facebook: 15TB/day in 2009 130TB/day in 2011 • Rich analytics: need more than relational – Some operations not expressible in SQL – e.g. graph mining, machine learning • Efficiency: do it quickly and economically – Data “freshness” is important – Computation and network may cost 3 Data-intensive Scalable Computing (DISC) Allow users to easily analyze big data • MapReduce (Google), Hadoop (Yahoo), Dryad (MS) • Flexible programming models – Allow sophisticated analytics DISC • Scalable architectures – Easy to parallelize analysis – Use 1000s of cheap PCs 4 Scaling with massive data-parallelism Input Compute cluster Data sources e.g. web crawlers Bigger data? Throw in more machinery! Data Processing Operators Output Scale to virtually any data! 5 Update-driven analytics • Scalability is not enough • Analytics constantly integrate new data • Example: web analytics – New pages crawled every hour – Re-compute results from scratch – As data grow, add more machines DISC • Inefficient: only fraction of pages updated • Wastes CPU, network, energy and money Output New output Need a new way to program these analytics efficiently 6 Programming with state Must be able to re-use computation • Incremental: data change across time – e.g. new web pages every hour – Update result, do not re-compute • Iterative: data change across iterations State – e.g. machine learning, graph mining – Refine prior analysis result Outputoutput Updated State is fundamental for efficiency 7 Example: Incremental web analytics • 90-node Hadoop cluster, real Yahoo! data set • Incrementally maintain URL in-link counts • Continuously: – Crawl and process 30 GB of new pages: count URLs – Read old URL counts and increment – Save new URL counts to HDFS “cnn.com”, 2 “fox.com”, 3 … Processing time (sec) 70 Hadoop 60 Running time proportional to state size Ideal 50 40 30 Running time proportional to state updates 20 10 0 0 1 2 3 4 5 Increment 6 7 8 8 The problem State is outside the scope of the processing system Application Processing system (e.g Hadoop) State management Storage (e.g. HDFS, DBMS, Bigtable) • Programmability – Users forced to manage state manually – Use HDFS or local files, Bigtable, DBMS – Result: Custom and fragile code • Efficiency – System is not aware of state – Or treats it like any other input – Result: Wasted data movement and processing 9 The Continuous Bulk Processing model (CBP) Application Processing system (e.g Hadoop) State management Storage (e.g. HDFS, DBMS, Bigtable) • Idea: State as a first-class abstraction • Abstract state management: easier programming • State-aware system: allows optimizations 10 Continuous Bulk Proccessing (CBP) • Scalable processing on bulk-data – Groupwise processing primitive • Sophisticated incremental/iterative analytics – Generic stateful operator • Simplify dataflow programming – Primitives for programmatic dataflow control • Efficient processing – Leverages model to optimize state management 11 Groupwise processing • Core concept underlying relational operations, e.g. GROUP BY • Users specify grouping key for input records • User function applied per group, e.g. count input “cnn.com” “nbc.com” “nbc.com” “fox.com” “cnn.com” output group input by key, e.g. URL count(“cnn.com”, ) 2 count(“nbc.com”, ) 2 count(“fox.com”, ) 1 Allows each group to be processed in parallel 12 Groupwise processing in CBP • User-defined Translate() function – Multiple input and output flows • RouteBy() function extracts grouping key – One RouteBy function per input flow Fin1 Fin2 … T(k1, F1[], F2[] ) Processing T(k , F [], F [] ) Ø 2 1 2 stage • Processing performed in a stage: – Groups input records by key – Calls Translate() for every key in parallel – Runs repeatedly in epochs Fout1 2 1 13 Maintaining state in CBP • We model state as a loopback flow • Writes to loopback flows appear in next epoch • Stage groups input with state records Fin1 FState T(k1, F1[], FState Ø [] ) T(k2, F1[], FState Ø [] ) … • Translate uses loopback flows like any other flow New data 21 21 • Translate can: – Add/modify state: writes new record – Delete state: does not write record 14 Allowing selective state updates • State increases in size – e.g. by adding new URL counts • New data affect only fraction of the state – e.g. only the “green” state record Fin FState 1 T(k1, F1[], FState[]) T(k2, F1[], FState[] ) … • Current systems do “outer”-grouping – Call user function for every key in input or state • Forces system to access the whole state • CBP allows “inner”-grouping of input with state – Call Translate() only if key exists in the input • Allows to access only the necessary state 15 Implementing the CBP model • Naïve approach: use Hadoop/MapReduce as a “black-box” • Leverage scalability and fault-tolerance State Input 1 Hadoop Updated state Hadoop Input 2 Output 1 Output 2 16 CBP emulation on MapReduce map() reduce() RouteBy() Translate() Input Output Shuffle M R 2 R 2 1 M M 17 Impedance mismatches • Treats state as any input – Unnecessary state processing/movement • Supports only “outer”-grouping – Accesses whole state even for small updates CBP prototype implements model and optimizations 18 Incremental state shuffling • Hadoop treats state as any input – Re-maps & re-shuffles state to group – State already grouped State map reduce M M R State R • Unnecessary processing/movement State • CBP separates state from other data – Stores state to side files – Copies state directly to reducers M M R R • Avoids re-processing and re-shuffling 19 Impact of incremental shuffling • 90-node Hadoop cluster, real Yahoo! data set • Incrementally maintain URL in-link counts • Continuously: – Crawl and process 30 GB of new pages: count URLs – Read old URL counts and increment – Save new URL counts to HDFS “cnn.com”, 2 “fox.com”, 3 … Processing time (sec) 70 Hadoop CBP Ideal 60 Running time proportional to state size 50 50% less time 40 Near constant processing time 30 20 10 Running time proportional to state updates 0 0 1 2 3 4 5 Increment 6 7 8 20 Supporting inner-grouping • Inner-grouping selects fraction of state • Requires random access to state • DISC systems: optimized for sequential access Lesson from the DB community: • Maintain an index on state records 21 Accessing state randomly Idea: Put state in a table-based store • Bigtable/Hypertable – Scalable table-stores: allow random access – More functionality than we need • Implemented custom indexed files – Each reducer maintains its own indexed file • Reducers avoid scanning whole state Input M M R R State Key “blue” “green” “purple” … Value … 22 Evaluation • Validate benefit of explicit state modeling – Impact of optimizations on efficiency • Processing time • Data moved across network • Sophisticated applications, real-world data – Web analytics – PageRank – Clustering coefficients 23 The PageRank algorithm • Used by Google to deliver search results – Given a web graph, assigns ranks to pages 0.2 0.5 0.3 Algorithm overview: • Assign initial rank • Repeat until ranks converge – Vertices distribute their ranks to neighbors – Vertices update their own ranks 1/3 1/3 1/6 A B 1/6 C 1/3 24 Solving graph problems How we usually do it: • Vertices maintain state – e.g. neighbors, current rank State A State B • Local computation per vertex – e.g. update rank – Can be performed in parallel C State • Message exchange between vertices – e.g. distribute rank – Synchronization among vertices 25 Programming graphs in CBP • A state record corresponds to a vertex • Grouping key is the ID of the vertex, e.g. page URL • Stage calls Translate() for every vertex – Updates current state – Outputs messages to a loopback flow • Stage groups messages with state (i.e. vertex) input state messages - Updates state - Outputs messages output – Vertices “receive” messages Leveraging CBP grouping primitives: • Inner-grouping: process vertices with messages, not whole graph 26 Incremental PageRank • PageRank – Iterative algorithm – Computes site popularity based on web-graph – Small changes affect the whole graph • Incremental version [S. Chien et al.] – Re-computes only mostly affected subgraph – Makes use of inner-grouping • The experiment subgraph – Real web graph with 7.5M nodes, 109M edges – We add 2800 new edges – 16-node cluster 27 Results 450 400 350 300 250 200 150 100 50 0 Data transferred CBP emulation Incremental shuffling Inner grouping Phase 1 0 2 4 Cumulative data moved (GB) Cumulative time (min) Running time Phase 2 6 8 10 Epoch 12 14 16 Incremental shuffling: • Reduces running time by 23% 400 350 300 250 200 150 100 50 0 CBP emulation Incremental shuffling Phase 1 0 2 4 6 Phase 2 8 10 Epoch 12 14 16 • Reduces network load by 46% Inner grouping: • Only 0.5% of the state needs to be accessed • Reduces running time by 53% 28 Summary of CBP • Integrates state with groupwise processing – Scalable stateful analytics • Expressive model for sophisticated stateful analytics – Used to implement real applications • Leverages model to optimize state management – Reduces processing times – Uses less compute and network resources 29 Mining large graphs in real-time • A lot of intelligence hidden inside relations • Social graphs – Friend/article recommendation, targeted Ads • Call graphs in Telcos – Fraud detection, network monitoring “Find what my community read in the last 5 minutes” • Graphs may change rapidly – Phone calls, posts every second • Challenge: update graph analytics in real-time “Detect suspicious calling patterns for every new call” 30 Incremental graph algorithms • Real-timeness requires incremental updates – Too costly to re-compute from scratch • CBP: explicitly incremental programs – Easy for some analytics, e.g. a sum • But… graph algorithms are more complex! – e.g. clustering, PageRank, pattern matching, etc. – How do you update clustering when the graph changes? • Designing incremental graph algorithms is hard 31 Real-time graph mining with GraphInc Idea • Same trick: computation re-use But • Users write algorithms as if on static data – Much simpler, works as before • System updates analysis transparently – Automatically detects opportunities for re-use 32 Example: computing shortest paths S D New edge • Identify computations that don’t change • When changes happen, execute only the diff • Remember previous state and messages sent – If all remain the same, no need to execute • Results in computation and communication savings 33 Thank you