Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
MalStone:Towards A Benchmark for Analytics on Large Data Clouds Collin Bennett Open Data Group 400 Lathrop Ave Suite 90 River Forest IL 60305 David Locke Open Data Group 400 Lathrop Ave Suite 90 River Forest IL 60305 Robert L. Grossman Open Data Group 400 Lathrop Ave Suite 90 River Forest IL 60305 Jonathan Seidman Open Data Group 400 Lathrop Ave Suite 90 River Forest IL 60305 Steve Vejcik Open Data Group 400 Lathrop Ave Suite 90 River Forest IL 60305 KDD’10, July 25–28, 2010, Washington, DC, USA OUTLINE 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. ABSTRACT INTRODUCTION Common Elements MalStone A & B MalGen THREE IMPLEMENTATIONS EXPERIMENTAL STUDIES DISCUSSION RELATED WORK SUMMARY 0. ABSTRACT Terasort MalStone MalGen 1. INTRODUCTION Data Mining for Clouds:Hbase, Apache Pig, Hive and ZooKeeper, There are no similar benchmarks for comparing two large data clouds that support building analytic models on large datasets. Use MalStone, also describe the implementation of a data generator for MalStone called MalGen 2.Common Elements Time stamps Sites Entities e.g. Web sites, computers, network devices e.g. visitors, users, flows Log files fill disks, many, many disks Behavior occurs at all scales Want to identify phenomena at all scales Need to group “similar behavior” Need to do statistics (not just sorting) 2.Common Elements Abstract the Problem Using Site-Entity Logs Example Sites Entities Measuring online Web sites advertising Consumers Drive-by exploits Web sites Computers (identified by cookies or IP) Compromised systems User accounts Compromised computers 3. MalStone A & B MalStone Benchmark Benchmark developed by Open Cloud Consortium for clouds supporting data intensive computing. Code to generate synthetic data required is available from code.google.com/p/malgen Stylized analytic computation that is easy to implement in MapReduce and its generalizations. 3. MalStone A & B MalStone A computes j for all sites j in the log files. MalStone B computes j;t for sites j in the log files 3. MalStone A & B be the set of all entities ei Aj that become marked at any time in the monitor window 3. MalStone A & B is the set of entities that become marked at any time during the monitor window. 3. MalStone A & B The statistic is (1 + 0 + 0)/(1 + 1 + 0) = 1/2 4. MalGen Tens of millions of sites Hundreds of millions of entities Billions of events Most sites have a few number of events Some sites have many events Most entities visit a few sites Some visitors visit many sites 4. MalGen For generating site-entity log files 5. THREE IMPLEMENTATIONS HDFS, Hadoop Streams and Python Hadoop HDFS and MapReduce Sector and Sphere UDFs(User Defined Functions ) 6. EXPERIMENTAL STUDIES 6. EXPERIMENTAL STUDIES Sector/Sphere v1.20 # Nodes # Records Size of Dataset Tests done on Open Cloud Testbed. MalStone B 44 min 20 nodes 10 Billion 1 TB 7. DISCUSSION Hadoop streams does not require the MapReduce framework. Python programs can be invoked by Hadoop streams. 8. RELATED WORK In 2008,Haddop by Terasort:297sec. In 2009,Hadoop by Terasort:209sec. In nowadays,Terasort was replacement by Minute Sort:in about 1 Min. [MapReduce for machine learning on multicore] Using MapReduce,but does not describe a computation similar to the MalStone statistic. 9. SUMMARY MalGen to create large amount of data. Performance depend upon which cloud middleware is used to compute.