Download GraphLab

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Big Data Analytics
Outline
 Introduction
 MapReduce & Hadoop Architecture
 Higher-level query languages: Pig & Hive
 Pregel
 GraphLab
 PowerGraph
 Spark
2
Outline
 Introduction
 MapReduce & Hadoop Architecture
 Higher-level query languages: Pig & Hive
 Pregel
 GraphLab
 PowerGraph
 Spark
3
The Age of Big Data
28 Million
Wikipedia Pages
1 Billion
Facebook Users
6 Billion
Flickr Photos
72 Hours each Minute
YouTube
“…growing at 50 percent a year…”
“… data a new class of economic asset,
like currency or gold.”
34
Graphs are Everywhere
Collaborative Filtering
Users
Social Network
Netflix
Movies
Probabilistic Analysis
Docs
Text Analysis
Wiki
Words
5
Introduction

Graphs are analyzed in many important contexts




Ranking search results based on the hyperlink structure of web
Module detection of protein-protein interaction networks
Social networks analysis
Many graphs of interest are difficult to analyze


Often spanning millions of vertices
Billions of edges
6
How can we solve the problem?
How to Assign?
Billions of Vertices/Edges
(Graph)
7
Thousand/millions of Computers
(Distributed System)
Outline
 Introduction
 MapReduce & Hadoop Architecture
 Higher-level query languages: Pig & Hive
 Pregel
 GraphLab
 PowerGraph
 Spark
8
Powerful tool for tackling large-data
problems
Bioinformatics
DNA sequence assembly
protein-protein interaction
networks
Recommendation
system
Search
Engine
Text processing
Machine
Translation
MapReduce
How are you?
79
What is MapReduce?
 Programming model for data-intensive
computing on commodity clusters
 Pioneered by Google

Processes 20 PB of data per day
 Popularized by Apache Hadoop project

Used by Yahoo!, Facebook, Amazon, …
10
What is MapReduce Used For?
• At Google:
–
–
–
Index building for Google Search
Article clustering for Google News
Statistical machine translation
• At Yahoo!:
–
–
Index building for Yahoo! Search
Spam detection for Yahoo! Mail
• At Facebook:
–
–
–
Data mining
Ad optimization
Spam detection
11
What is MapReduce Used For?
 In research:






Analyzing Wikipedia conflicts (PARC)
Natural language processing (CMU)
Climate simulation (Washington)
Bioinformatics (Maryland)
Particle physics (Nebraska)
<Your application here>
12
MapReduce Goals
• Scalability to large data volumes:
–
–
Scan 100 TB on 1 node @ 50 MB/s = 24 days
Scan on 1000-node cluster = 35 minutes
• Cost-efficiency:
–
–
–
Commodity nodes (cheap, but unreliable)
Commodity network (low bandwidth)
Easy to use (fewer programmers)
13
Hadoop – How was it Born?
Hadoop implements Map/Reduce computational
Apache
Zookeeper
Apache Tomcat
14
Typical Hadoop Cluster
15
Hadoop Evolution Graph
16
Hadoop Distributed File System (HDFS)
 Files split into 64MB blocks
 Blocks replicated across
several datanodes (often 3)
 Namenode stores metadata
(file names, locations, etc)
 Optimized for large files,
sequential reads
Namenode
1
2
4
2
1
3
1
4
3
File1
1
2
3
4
3
2
4
Datanodes
17
HBase
 A distributed data store that can
scale horizontally to 1,000s of
commodity servers
 Designed to operate on top of the
Hadoop distributed file system
(HDFS).
 HBase depends on ZooKeeper
and by default it uses a
ZooKeeper instance as the
authority on cluster state
18
MapReduce Data Flow (WordCount)
Input
the
quick
brown
fox
the fox
ate the
mouse
how
now
brown
cow
Map
Map
Shuffle & Sort
Reduce
the, 1
brown, 1
fox, 1
Reduce
brown, 2
fox, 2
how, 1
now, 1
the, 3
Reduce
ate, 1
cow, 1
mouse, 1
quick, 1
the, 1
fox, 1
the, 1
Map
how, 1
now, 1
brown, 1
Map
Output
quick, 1
ate, 1
mouse, 1
cow, 1
19
Example: Word Count
def mapper(line):
foreach word in line.split():
output(word, 1)
def reducer(key, values):
output(key, sum(values))
20
Word Count with Combiner
Input
the
quick
brown
fox
the fox
ate the
mouse
how
now
brown
cow
Map
Map
Shuffle & Sort
Reduce
the, 1
brown, 1
fox, 1
Reduce
brown, 2
fox, 2
how, 1
now, 1
the, 3
Reduce
ate, 1
cow, 1
mouse, 1
quick, 1
the, 2
fox, 1
Map
how, 1
now, 1
brown, 1
Map
Output
quick, 1
ate, 1
mouse, 1
cow, 1
21
Word Count
Map function
Reduce function
Run this program as a
MapReduce job
22
Example: Search
• Input: (lineNumber, line) records
• Output: lines matching a given pattern
• Map:
if(line matches pattern):
output(line)
• Reduce: identity function
–
Alternative: no reducer (map-only job)
23
Hadoop Workflow
1. Load data into HDFS
2. Develop code locally
3. Submit MapReduce job
You
Hadoop Cluster
4. Retrieve data from HDFS
24
Architecture overview
user
Master node
Job tracker
Slave node 1
Slave node 2
Slave node N
Task tracker
Task tracker
Task tracker
Workers
Workers
Workers
25
Task Scheduling in MapReduce

MapReduce adopts a master-slave architecture
TT

The master node in MapReduce is referred
to as Job Tracker (JT)
Task Slots
JT


Each slave node in MapReduce is referred
to as Task Tracker (TT)
Tasks Queue
T0
T1
T1
MapReduce adopts a pull scheduling strategy rather than
a push one
T2
TT
Task Slots
30
Fault Tolerance in MapReduce
1. If a task crashes:

Retry on another node
2. If a node crashes:

Relaunch its current tasks on other nodes
3. If a task is going slowly (straggler):
–
–
Launch second copy of task on another node
Take the output of whichever copy finishes first,
and kill the other one
27
Outline
 Introduction
 MapReduce & Hadoop Architecture
 Higher-level query languages: Pig & Hive
 Pregel
 GraphLab
 PowerGraph
 Spark
28
Motivation
29
Pig Latin
Need a high-level, general data flow language
30
Pig
 Started at Yahoo! Research
 Runs about 50% of Yahoo!’s jobs
 Features:



Expresses sequences of MapReduce jobs
Provides relational (SQL) operators
(JOIN, GROUP BY, etc)
Easy to plug in Java functions
31
An Example Problem
find the top 5 most
visited pages by
users aged 18-25.
Load Users
Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
32
In MapReduce
33
Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt
In Pig Latin
Users
= load ‘users’ as (name, age);
Filtered = filter Users by
age >= 18 and age <= 25;
Pages
= load ‘pages’ as (user, url);
Joined
= join Filtered by name, Pages by user;
Grouped = group Joined by url;
Summed
= foreach Grouped generate group,
count(Joined) as clicks;
Sorted
= order Summed by clicks desc;
Top5
= limit Sorted 5;
store Top5 into ‘top5sites’;
Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt
34
Implementation
SQL
automatic
rewrite +
optimize
Pig
user
or
or
Hadoop
Map-Reduce
cluster Pig Latin
Sweet spot between map-reduce and SQL
35
Compilation into Map-Reduce
Map1
Load Visits
Group by url
Every group or join operation
forms a map-reduce boundary
Reduce1
Foreach url
generate count
Map2
Load Url Info
Join on url
Other operations
pipelined into map
and reduce phases
Group by category
Foreach category
generate top10(urls)
Reduce2
Map3
Reduce3
36
Hive
 Developed at Facebook
 Used for most Facebook jobs
 Relational database built on Hadoop



Maintains table schemas
SQL-like query language
Supports complex data types,
some query optimization
37
Outline
 Introduction
 MapReduce & Hadoop Architecture
 Higher-level query languages: Pig & Hive
 Pregel
 GraphLab
 PowerGraph
 Spark
38
Twitter network visualization,
by Akshay Java, 2009
PROPERTIES OF REAL WORLD
GRAPHS
39
Natural Graphs
[Image from WikiCommons]
Properties of Natural Graphs
Regular Mesh
Natural Graph
Power-Law Degree Distribution
41
PageRank
34
43
Computing PageRank


Properties of PageRank
 Can be computed iteratively
 Effects at each iteration are local
Sketch of algorithm:
 Start with seed PRi values
 Each page distributes PRi “credit” to all pages it links to
 Each target page adds up “credit” from multiple inbound links to compute PRi+1
 Iterate until values converge
Iterative Computation is Difficult
• System is not optimized for iteration:
Iterations
Data
Data
Data
CPU 1
CPU 1
CPU 3
Data
Data
Data
Data
Data
CPU 2
CPU 3
Data
Data
Data
Data
Data
Data
CPU 2
CPU 3
Disk Penalty
Data
Data
Data
Startup Penalty
Data
CPU 1
Disk Penalty
CPU 2
Data
Startup Penalty
Data
Disk Penalty
Startup Penalty
Data
Data
Data
Data
Data
Data
Data
Data
44
45
MODEL OF COMPUTATION




A Directed Graph is given to Pregel
It runs the computation at each vertex
Until all nodes vote for halt
Then Returns the results
All Vote
to Halt
Output
46
Vertex State Machine




Algorithm termination is based on every vertex voting to halt
In superstep 0, every vertex is in the active state
A vertex deactivates itself by voting to halt
It can be reactivated by receiving an (external) message
Pregel: Bulk Synchronous Parallel
Compute
Communicate
Barrier
47
http://dl.acm.org/citation.cfm?id=1807184
Giraph Dataflow
2
Compute/Iterate
Split 4
Load /
Send
Graph
Part 1
Compute /
Send
Messages
Part 0
Part 1
Part 2
Part 3
Part 0
Part 1
Compute /
Send
Messages
Split
Send stats / iterate!
Worker
1
Split 3
Part 0
Storing the graph
Output format
Worker
0
Load /
Send
Graph
Split 2
Worker
1
Maste
r
Split 1
In-memory
graph
Maste
r
Split 0
Worker
0
Input format
Worker
1
Loading the graph
3
Worker
0
1
Part 2
Part 3
Part 2
Part 3
MapReduce and Partitioning
• Map-Reduce splits the keys randomly
between mappers/reducers
• But on natural graphs, high-degree vertices
(keys) may have million-times more edges
than the average
Extremely uneven distribution
Time of iteration = time of slowest job.
49
Curse of the Slow Job
Iterations
Data
Data
CPU 1
Data
CPU 1
Data
CPU 1
Data
Data
Data
Data
Data
Data
Data
Data
CPU 2
CPU 2
CPU 2
Data
Data
Data
Data
Data
Data
Data
Data
CPU 3
CPU 3
CPU 3
Data
Data
Data
Data
Data
Barrier
Data
Barrier
Data
Barrier
Data
50
http://www.www2011india.com/proceeding/proceedings/p607.pdf
Outline
 Introduction
 MapReduce & Hadoop Architecture
 Higher-level query languages: Pig & Hive
 Pregel
 GraphLab
 PowerGraph
 Spark
51
MapReduce’s (Hadoop’s) poor
performance on huge graphs
has motivated the development
of special graph-computation
systems
52
Parallel Computing and ML
Not all algorithms are efficiently data parallel
Data-Parallel
Feature
Extraction
Cross
Validation
Complex Parallel Structure
Kernel
Methods
Belief
Propagation
SVM
Tensor
Factorization
Sampling
Deep Belief
Networks
Neural
Networks
Lasso
53
GraphLab
A New Framework for
Parallel Machine Learning
Graph
Analytics
Graphical
Models
Computer
Vision
Clustering
Topic
Modeling
Collaborative
Filtering
GraphLab Version 2.1 API (C++)
MPI/TCP-IP
PThreads
Hadoop/HDFS
Linux Cluster Services (Amazon AWS)
GraphLab easily incorporates external toolkits
Automatically detects and builds external toolkits
55
The GraphLab Framework
Graph Based
Data Representation
Update Functions
User Computation
Consistency Model
58
Data Graph
Data associated with vertices and edges
Graph:
• Social Network
Vertex Data:
• User profile
• Current interests estimates
Edge Data:
• Relationship
(friend, classmate, relative)
59
Distributed Graph
Partition the graph across multiple machines.
60
Distributed Graph
Ghost vertices maintain adjacency structure
and replicate remote data.
“ghost” vertices
61
Distributed Graph
Cut efficiently using HPC Graph partitioning
tools (ParMetis / Scotch / …)
“ghost” vertices
62
The GraphLab Framework
Graph Based
Data Representation
Update Functions
User Computation
Consistency Model
63
Update Functions
User-defined program: applied to a
vertex and transforms data in scope of vertex
Pagerank(scope){
Update function applied
(asynchronously)
// Update the current vertex data
in parallel until
convergence
vertex.PageRank
=a
ForEach inPage:
vertex.PageRank += (1- a ) ´ inPage.PageRank
Many schedulers available to prioritize computation
// Reschedule Neighbors if needed
if vertex.PageRank changes then
reschedule_all_neighbors;
Why Dynamic?
}
Dynamic
computation
64
Shared Memory Dynamic Schedule
Scheduler
a
CPU 1
b
a
h
e
a
b
h
f
i
d
c
g
j
k
CPU 2
Process repeats until scheduler is empty
65
Distributed Scheduling
Each machine maintains a schedule over the vertices it owns.
a
h
f
b
a
f
e
h
i
d
c
g
g
j
k
Distributed Consensus used to identify completion
66
Ensuring Race-Free Code
How much can computation overlap?
67
The GraphLab Framework
Graph Based
Data Representation
Update Functions
User Computation
Consistency Model
69
Serializability Example
Write
Stronger / Weaker
consistency levels availableRead
User-tunable consistency levels
trades off parallelism & consistency
Overlapping regions
are only read.
Update functions one vertex apart can be run in parallel.
Edge Consistency
70
CoEM (Rosie Jones, 2005)
Named Entity Recognition Task
Is “Cat” an animal?
Is “Istanbul” a place?
Vertices: 2 Million
Edges: 200 Million
the cat
<X> ran quickly
Australia
travelled to <X>
Istanbul
<X> is pleasant
0.3% of Hadoop time
Hadoop
95 Cores
7.5 hrs
GraphLab
16 Cores
30 min
Distributed GL 32 EC2 Nodes
80 secs
72
PageRank
5.5 hrs
Hadoop
1 hr
Twister
GraphLab
8 min
40M Webpages, 1.4 Billion Links (100 iterations)
Hadoop results from [Kang et al. '11]
Twister (in-memory MapReduce) [Ekanayake et al. ‘10]
74
Outline
 Introduction
 MapReduce & Hadoop Architecture
 Higher-level query languages: Pig & Hive
 Pregel
 GraphLab
 PowerGraph
 Spark
75
Power-Law Graphs are
Difficult to Partition
CPU 1
CPU 2
• Power-Law graphs do not have low-cost balanced
cuts [Leskovec et al. 08, Lang 04]
• Traditional graph-partitioning algorithms perform
poorly on Power-Law Graphs.
[Abou-Rjeili et al. 06]
76
Program
For This
Run on This
Machine 1
Machine 2
• Split High-Degree vertices
• New Abstraction  Equivalence on Split Vertices
77
The Graph-Parallel Abstraction
• A user-defined Vertex-Program runs on each vertex
• Graph constrains interaction along edges
– Using messages (e.g. Pregel [PODC’09, SIGMOD’10])
– Through shared state (e.g., GraphLab [UAI’10, VLDB’12])
• Parallelism: run multiple vertex programs simultaneously
78
Challenges of High-Degree Vertices
Sequentially process
edges
Sends many
messages
(Pregel)
Asynchronous Execution
requires heavy locking (GraphLab)
Touches a large
fraction of graph
(GraphLab)
Edge meta-data
too large for single
machine
Synchronous Execution
prone to stragglers (Pregel)
79
Graph Partitioning
• Graph parallel abstractions rely on partitioning:
– Minimize communication
– Balance computation and storage
Y
Machine 1
Data transmitted
across network
O(# cut edges)
Machine 2
80
Random Partitioning
• Both GraphLab and Pregel resort to random
(hashed) partitioning on natural graphs
Machine 1
Machine 2
81
• GAS Decomposition: distribute vertex-programs
– Move computation to data
– Parallelize high-degree vertices
• Vertex Partitioning:
– Effectively distribute large power-law graphs
82
Distributed Execution of a
PowerGraph Vertex-Program
Machine 1
Machine 2
Master
Gather
Apply
Scatter
Y’
Y’Y’
Y’
Σ1
+
Σ
+
Σ2
+
Mirror
Y
Σ3
Σ4
Mirror
Machine 3
Mirror
Machine 4
83
Minimizing Communication in PowerGraph
Y
Communication
is linear in
the number of machines
each vertex spans
A vertex-cut minimizes
machines each vertex spans
Percolation theory suggests that power law graphs
have good vertex cuts. [Albert et al. 2000]
84
New Approach to Partitioning
• Rather than cut edges:
Y
Must synchronize
many edges
New YTheorem:
For any edge-cutCPU
we1can directly
CPU 2
construct a vertex-cut which requires
• we cut vertices:
strictly less communication and storage.
Must synchronize
a single vertex
Y
Y
CPU 1
CPU 2
85
Topic Modeling
• English language Wikipedia
– 2.6M Documents, 8.3M Words, 500M Tokens
– Computationally intensive algorithm
Million Tokens Per Second
0
Smola et al.
PowerGraph
20
40
60
80
100
120
140
160
100 Yahoo! Machines
Specifically engineered for this task
64 cc2.8xlarge EC2 Nodes
200 lines of code & 4 human hours
86
Example Topics Discovered from Wikipedia
87
Triangle Counting on The Twitter Graph
Identify individuals with strong communities.
Counted: 34.8 Billion Triangles
Hadoop
[WWW’11]
1536 Machines
423 Minutes
64 Machines
1.5 Minutes
282 x Faster
S. Suri and S. Vassilvitskii, “Counting triangles and the curse of the last reducer,” WWW’11
88
Outline
 Introduction
 MapReduce & Hadoop Architecture
 Higher-level query languages: Pig & Hive
 Pregel
 GraphLab
 PowerGraph
 Spark
89
Spark Motivation
• MapReduce simplified “big data” analysis
on large, unreliable clusters
• But as soon as organizations started using
it widely, users wanted more:
– More complex, multi-stage applications
– More interactive queries
– More low-latency online processing
90
Spark Motivation
Query 2
Job 2
Query 1
Job 1
Stage 3
Stage 2
Stage 1
Complex jobs, interactive queries and online
processing all need one thing that MR lacks:
Efficient primitives for data sharing
…
Query 3
Iterative job
Interactive mining
Stream processing
91
Spark Motivation
Stage 3
Stage 2
Stage 1
Complex jobs, interactive queries and online
processing all need one thing that MR lacks:
Efficient primitives for data sharing
Query 1
Iterative job
Interactive mining
Job 2
Job 1
Problem: in MR, only way
Queryto
2 share data across
…
jobs is stable storage (e.g.
file
Query
3 system) -> slow!
Stream processing
92
2. What is Apache Spark?
93
Examples
HDFS
read
HDFS
write
HDFS
read
iter. 1
HDFS
write
. . .
iter. 2
Input
HDFS
read
Input
query 1
result 1
query 2
result 2
query 3
result 3
. . .
94
Goal: In-Memory Data Sharing
iter. 1
iter. 2
. . .
Input
query 1
one-time
processing
Input
Distributed
memory
query 2
query 3
. . .
10-100× faster than network and disk
95
Solution: Resilient Distributed
Datasets (RDDs)
• Partitioned collections of records that can
be stored in memory across the cluster
• Manipulated through a diverse set of
transformations (map, filter, join, etc)
• Fault recovery without costly replication
– Remember the series of transformations that
built an RDD (its lineage) to recompute lost data
96
Example: Log Mining
Load error messages from a log into memory,
then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
BaseTransformed
RDD
RDD
results
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2))
messages.cache()
tasks
Driver
Cache 1
Worker
Block 1
messages.filter(_.contains(“foo”)).count
Cache 2
messages.filter(_.contains(“bar”)).count
Worker
. . .
Cache 3
Result: scaled
full-text
tosearch
1 TB data
of Wikipedia
in 5-7 sec
in <1(vs
sec170
(vssec
20 for
secon-disk
for on-disk
data)
data)
Scala programming language
Worker
Block 2
Block 3
97
Fault Recovery
RDDs track lineage information that can be
used to efficiently reconstruct lost partitions
Ex:
messages = textFile(...).filter(_.startsWith(“ERROR”))
.map(_.split(‘\t’)(2))
HDFS File
Filtered RDD
filter
(func = _.contains(...))
Mapped RDD
map
(func = _.split(...))
98
Iteratrion time (s)
Fault Recovery Results
140
120
100
80
60
40
20
0
Failure happens
119
81
1
57
56
58
2
3
4
58
5
6
Iteration
57
59
57
59
7
8
9
10
99
Example: Logistic Regression
Find best line separating two sets of points
random initial line
target
100
Logistic Regression Code
val data = spark.textFile(...).map(readPoint).cache()
var w = Vector.random(D)
for (i <- 1 to ITERATIONS) {
val gradient = data.map(p =>
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
).reduce(_ + _)
w -= gradient
}
println("Final w: " + w)
101
Logistic Regression Performance
127 s / iteration
first iteration 174 s
further iterations 6 s
102
GraphX: Encoding Property Graphs as RDDs
103
GraphX: Graph System Optimizations
104
PageRank Benchmark
GraphX performs comparably to state-of-the-art graph processing systems.
105
Thanks for your attention!
106
References
 Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data
processing on large clusters." Communications of the ACM 51.1 (2008): 107113.
 Miner, Donald, and Adam Shook. MapReduce Design Patterns: Building
Effective Algorithms and Analytics for Hadoop and Other Systems. O'Reilly
Media, Inc., 2012.
 White, Tom. Hadoop: The definitive guide. O'Reilly Media, Inc., 2012.
 http://hadoop.apache.org/
 http://hadoop.apache.org/pig
 http://hadoop.apache.org/hive
 http://www.cloudera.com/
107
References




Hadoop: http://hadoop.apache.org/common
Pig: http://hadoop.apache.org/pig
Hive: http://hadoop.apache.org/hive
Spark: http://spark-project.org
 Hadoop video tutorials: www.cloudera.com/hadoop-training
 Amazon Elastic MapReduce:
http://aws.amazon.com/elasticmapreduce/
108