Download S-4 - OSU CSE

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Prognostics wikipedia , lookup

Intelligent maintenance system wikipedia , lookup

Transcript
On the role of Interactivity and Data
Placement in Big Data Analytics
Srini Parthasarathy
OSU
The Data Deluge: Data Data Everywhere
2
2
Data Storage is Cheap
600$
to buy a disk drive that can store all of the world’s
music
[McKinsey Global Institute Special Report, June ’11]
3
Data does not exist in isolation.
4
Data almost always exists in
connection with other data – integral
part of the value proposition.
5
Social networks
VLSI networks
Protein Interactions
Internet
Neighborhood
Data dependencies
graphs
6
Big Data Problem: All this data is only
useful if we can scalably extract useful
knowledge from such complex data
7
THIS TALK
• THE ROLE OF DATA PLACEMENT IN BIG DATA
SYSTEMS
• THE ROLE OF VISUALIZATION AND
INTERACTION IN BIG DATA ANALYSIS
GLOBAL GRAPHS
GLOBAL GRAPHS
• What?
– System for deploying applications processing complex data
• Why?
– Seeks balance between high productivity and high performance
• How?
–
–
–
–
Built on top of PNL’s GlobalArrays
Trees (GlobalTrees, GlobalForests)
Relational Arrays (ArrayDB-GA)
Graphs (GlobalGraphs)
• Data Placement is key to high performance
Importance of Data Placement
• Locality
– Placing related items close to each other so they may
be processed together
• Mitigating Impact of Data Skew
– Reducing load imbalance in a parallel setting
– Reducing variance in partition samples
• Generating Stratified Samples
– Improving interactive performance
Key Ideas
• Pivotization
– Convert data with complex structure into sets
– Each element of set captures features of local topology
• Hashing into Strata: Hash related sets into similar bins
– Can employ a sketch-clustering algorithm
• Partitioning: Place Strata into partitions for
• Locality
• Mitigating Data Skew
• Samples
.
.
C
B
E
.
.
DATA (Δ)
C F
A
F E
B
L F
A
C
E
C
A
L
B
L
(PS-1)
L
.
.
A
E
C B
A
A
L
E
PIVOT
Δ25
B
A
C E
L
(PS-25)
.
.
PIVOT SETS (PS)
{1050, 2020,
3130,1800}
(SK-1)
.
.
.
{1050, 2020,
7225, 2020}
(SK-25)
.
.
.
SKETCHES(SK)
S-1
:
:
S-4
(Δ1, SK-1)
(Δ5, SK-5)
(Δ12,SK-12)
(Δ25,SK-25)
:
:
:
S-5
:
:
:
S-128
:
:
:
Strata (S)
PARTITIONING & REPLICATION
F L
B
TRANSFORMATIONS
B
A
A
SKETCHSORT or SKETCHCLUSTER
C
A
MINWISE HASHING on PIVOT SETS
Δ1
E
A
A
P-1
:
P-2
S-4
S-7
S-8
S-12
:
S-128
P-3
:
:
:
P-8
S-3
S-4
S-9
S-12
: S127
Frequent Tree Mining
• Our proposed approaches shows 100X gains
WebGraph Compression
• Linear Scaleup with no loss in compression ratio
HD
PRISM-HD
-
PRobing the Intrinsic Structure and
Makeup of High-dimensional Data
Visualization and Interactivity are key to
discovery
17
PRISM-HD
HD
• What?
– A novel mechanism for exploring
complex data
• Why?
– User is often overwhelmed with
characteristics of data
– Befuddled on where to start
• How?
– Given, similarity measure-of-interest
– Compute similarity graph at
threshold (t)
• Key: Graphs are dimensionless
– Provide user graph visualization cues
• User determines next threshold and
repeats
HD
HIGH THRESHOLD
MODERATE THRESHOLD
LOW THRESHOLD
HD
Benefits of Knowledge Caching
HD
Benefits of Incremental Processing
on Twitter
Incremental estimates on Twitter t1 = 0.95
HD
PRISM-HD and Global Graphs in Context:
Leveraging Social Media in Emergency Response
Concluding Remarks
HD
• Data is everywhere
• Data is fraught with complexities
– Dimensionality, dynamics, structure, massive…
• Both data placement and data interactivity
have an important role to play in big data
analytics
– PRISM-HD and GlobalGraphs can help!
Thanks for your attention
Contact: [email protected]
Mining Simulation Data
Medical Image Analysis
Protein Interaction Network (yeast)
Acknowledgements: Various NSF, NIH, DOE and industry grants