Download MalStone:Towards A Benchmark for Analytics on Large Data Clouds

MalStone:Towards A Benchmark for Analytics on Large Data Clouds Collin Bennett Open Data Group 400 Lathrop Ave Suite 90 River Forest IL 60305 David Locke Open Data Group 400 Lathrop Ave Suite 90 River Forest IL 60305 Robert L. Grossman Open Data Group 400 Lathrop Ave Suite 90 River Forest IL 60305 Jonathan Seidman Open Data Group 400 Lathrop Ave Suite 90 River Forest IL 60305 Steve Vejcik Open Data Group 400 Lathrop Ave Suite 90 River Forest IL 60305 KDD’10, July 25–28, 2010, Washington, DC, USA OUTLINE 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. ABSTRACT INTRODUCTION Common Elements MalStone A & B MalGen THREE IMPLEMENTATIONS EXPERIMENTAL STUDIES DISCUSSION RELATED WORK SUMMARY 0. ABSTRACT Terasort  MalStone  MalGen  1. INTRODUCTION  Data Mining for Clouds：Hbase, Apache Pig, Hive and ZooKeeper,  There are no similar benchmarks for comparing two large data clouds that support building analytic models on large datasets.  Use MalStone, also describe the implementation of a data generator for MalStone called MalGen 2.Common Elements   Time stamps Sites   Entities       e.g. Web sites, computers, network devices e.g. visitors, users, flows Log files fill disks, many, many disks Behavior occurs at all scales Want to identify phenomena at all scales Need to group “similar behavior” Need to do statistics (not just sorting) 2.Common Elements Abstract the Problem Using Site-Entity Logs Example Sites Entities Measuring online Web sites advertising Consumers Drive-by exploits Web sites Computers (identified by cookies or IP) Compromised systems User accounts Compromised computers 3. MalStone A & B MalStone Benchmark    Benchmark developed by Open Cloud Consortium for clouds supporting data intensive computing. Code to generate synthetic data required is available from code.google.com/p/malgen Stylized analytic computation that is easy to implement in MapReduce and its generalizations. 3. MalStone A & B MalStone A computes j for all sites j in the log files. MalStone B computes j;t for sites j in the log files 3. MalStone A & B   be the set of all entities ei Aj that become marked at any time in the monitor window 3. MalStone A & B   is the set of entities that become marked at any time during the monitor window. 3. MalStone A & B The statistic is (1 + 0 + 0)/(1 + 1 + 0) = 1/2 4. MalGen        Tens of millions of sites Hundreds of millions of entities Billions of events Most sites have a few number of events Some sites have many events Most entities visit a few sites Some visitors visit many sites 4. MalGen  For generating site-entity log files 5. THREE IMPLEMENTATIONS  HDFS, Hadoop Streams and Python  Hadoop HDFS and MapReduce  Sector and Sphere UDFs(User Defined Functions ) 6. EXPERIMENTAL STUDIES 6. EXPERIMENTAL STUDIES Sector/Sphere v1.20 # Nodes # Records Size of Dataset Tests done on Open Cloud Testbed. MalStone B 44 min 20 nodes 10 Billion 1 TB 7. DISCUSSION   Hadoop streams does not require the MapReduce framework. Python programs can be invoked by Hadoop streams. 8. RELATED WORK  In 2008,Haddop by Terasort：297sec. In 2009,Hadoop by Terasort：209sec. In nowadays,Terasort was replacement by Minute Sort：in about 1 Min.  [MapReduce for machine learning on multicore] Using MapReduce,but does not describe a computation similar to the MalStone statistic. 9. SUMMARY  MalGen to create large amount of data.  Performance depend upon which cloud middleware is used to compute.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download MalStone:Towards A Benchmark for Analytics on Large Data Clouds