Download S-4 - OSU CSE

On the role of Interactivity and Data Placement in Big Data Analytics Srini Parthasarathy OSU The Data Deluge: Data Data Everywhere 2 2 Data Storage is Cheap 600$ to buy a disk drive that can store all of the world’s music [McKinsey Global Institute Special Report, June ’11] 3 Data does not exist in isolation. 4 Data almost always exists in connection with other data – integral part of the value proposition. 5 Social networks VLSI networks Protein Interactions Internet Neighborhood Data dependencies graphs 6 Big Data Problem: All this data is only useful if we can scalably extract useful knowledge from such complex data 7 THIS TALK • THE ROLE OF DATA PLACEMENT IN BIG DATA SYSTEMS • THE ROLE OF VISUALIZATION AND INTERACTION IN BIG DATA ANALYSIS GLOBAL GRAPHS GLOBAL GRAPHS • What? – System for deploying applications processing complex data • Why? – Seeks balance between high productivity and high performance • How? – – – – Built on top of PNL’s GlobalArrays Trees (GlobalTrees, GlobalForests) Relational Arrays (ArrayDB-GA) Graphs (GlobalGraphs) • Data Placement is key to high performance Importance of Data Placement • Locality – Placing related items close to each other so they may be processed together • Mitigating Impact of Data Skew – Reducing load imbalance in a parallel setting – Reducing variance in partition samples • Generating Stratified Samples – Improving interactive performance Key Ideas • Pivotization – Convert data with complex structure into sets – Each element of set captures features of local topology • Hashing into Strata: Hash related sets into similar bins – Can employ a sketch-clustering algorithm • Partitioning: Place Strata into partitions for • Locality • Mitigating Data Skew • Samples . . C B E . . DATA (Δ) C F A F E B L F A C E C A L B L (PS-1) L . . A E C B A A L E PIVOT Δ25 B A C E L (PS-25) . . PIVOT SETS (PS) {1050, 2020, 3130,1800} (SK-1) . . . {1050, 2020, 7225, 2020} (SK-25) . . . SKETCHES(SK) S-1 : : S-4 (Δ1, SK-1) (Δ5, SK-5) (Δ12,SK-12) (Δ25,SK-25) : : : S-5 : : : S-128 : : : Strata (S) PARTITIONING & REPLICATION F L B TRANSFORMATIONS B A A SKETCHSORT or SKETCHCLUSTER C A MINWISE HASHING on PIVOT SETS Δ1 E A A P-1 : P-2 S-4 S-7 S-8 S-12 : S-128 P-3 : : : P-8 S-3 S-4 S-9 S-12 : S127 Frequent Tree Mining • Our proposed approaches shows 100X gains WebGraph Compression • Linear Scaleup with no loss in compression ratio HD PRISM-HD - PRobing the Intrinsic Structure and Makeup of High-dimensional Data Visualization and Interactivity are key to discovery 17 PRISM-HD HD • What? – A novel mechanism for exploring complex data • Why? – User is often overwhelmed with characteristics of data – Befuddled on where to start • How? – Given, similarity measure-of-interest – Compute similarity graph at threshold (t) • Key: Graphs are dimensionless – Provide user graph visualization cues • User determines next threshold and repeats HD HIGH THRESHOLD MODERATE THRESHOLD LOW THRESHOLD HD Benefits of Knowledge Caching HD Benefits of Incremental Processing on Twitter Incremental estimates on Twitter t1 = 0.95 HD PRISM-HD and Global Graphs in Context: Leveraging Social Media in Emergency Response Concluding Remarks HD • Data is everywhere • Data is fraught with complexities – Dimensionality, dynamics, structure, massive… • Both data placement and data interactivity have an important role to play in big data analytics – PRISM-HD and GlobalGraphs can help! Thanks for your attention Contact: [email protected] Mining Simulation Data Medical Image Analysis Protein Interaction Network (yeast) Acknowledgements: Various NSF, NIH, DOE and industry grants

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download S-4 - OSU CSE