Download CamCube - Rethinking the Data Center Cluster - Bretagne

Document related concepts

Airborne Networking wikipedia , lookup

IEEE 1355 wikipedia , lookup

UniPro protocol stack wikipedia , lookup

Computer cluster wikipedia , lookup

Transcript
CamCube
Rethinking the
Data Center Cluster
Paolo Costa
[email protected]
joint work with
Austin Donnelly, Greg O’Shea, Antony Rowstron (MSRC)
Hussam Abu-Libdeh (Intern, Cornell), Simon Schubert (Intern, EPFL)
Paolo Costa
CamCube - Rethinking the Data Center Cluster
2
A New Software Stack
Paolo Costa
CamCube - Rethinking the Data Center Cluster
3
A New Software Stack
Dryad/DryadLINQ
Dremel
Paolo Costa
CamCube - Rethinking the Data Center Cluster
4
A New Software Stack
Dryad/DryadLINQ
Network is a critical component
Focus of this talk: How to make it easy to design
and deploy efficient data center applications
Dremel
Paolo Costa
CamCube - Rethinking the Data Center Cluster
5
Building Data Center Applications is Hard!
Abstraction
Reality
• Application logical topologies • Data center physical topology
Tree
Dremel
Dynamo
MapReduce
Databus
Paolo Costa
CamCube - Rethinking the Data Center Cluster
6
Abstraction & Reality Mismatch
Paolo Costa
CamCube - Rethinking the Data Center Cluster
7
Abstraction & Reality Mismatch
Router
Switches
One logical hop is mapped to multiple physical hops
Paolo Costa
CamCube - Rethinking the Data Center Cluster
8
Abstraction & Reality Mismatch
Router
Switches
Paolo Costa
CamCube - Rethinking the Data Center Cluster
9
Abstraction & Reality Mismatch
Router
Switches
Two disjoint logical paths share some physical links
Paolo Costa
CamCube - Rethinking the Data Center Cluster
10
Abstraction & Reality Mismatch
Router
Switches
Paolo Costa
CamCube - Rethinking the Data Center Cluster
11
Issue #1: Oversubscription
Router
Switches
Bandwidth gets scarce as you move up the tree
Paolo Costa
Locality is key to performance
CamCube - Rethinking the Data Center Cluster
12
Issue #2: Path collision
The network allocates paths independently
Applications cannot modify the way packets are routed
Paolo Costa
CamCube - Rethinking the Data Center Cluster
13
Addressing These Issues…
• Oversubscription: Fat-tree[SIGCOMM’08], VL2[SIGCOMM’09], …
• Path collision: Hedera[NSDI’10], MPTCP[SIGCOMM’11], SPAIN[NSDI’10], …
• TCP Incast: DCTCP [SIGCOMM’10], ICTCP[CoNEXT’10], FDS[OSDI’12], …
• Traffic prioritization: Orchestra [SIGCOMM’11], D2TCP[SIGCOMM’11], …
• Fair sharing: Seawall [NSDI’11], FairCloud [SIGCOMM’12], …
Paolo Costa
CamCube - Rethinking the Data Center Cluster
14
Applications & Network Gap
The network is a black box
for applications (and vice versa)
Paolo Costa
CamCube - Rethinking the Data Center Cluster
15
Applications & Network Gap
Applications perspective
• Applications only see IP addresses
− Hard to infer locality & congestion
• No control on packet routing
− Point-to-point only
• Need to reverse-engineer
the network
Why slow?
?
10.0.2.3
Paolo Costa
10.0.1.4
CamCube - Rethinking the Data Center Cluster
16
Applications & Network Gap
Applications perspective
Network Perspective
• Applications only see IP addresses
• The network only sees packets
• No control on packet routing
• No insights about application
behaviour
• Need to reverse-engineer
the network
• Has to infer application
patterns
− Hard to infer locality & congestion
− Point-to-point only
Why slow?
?
?
Long vs. short flows?
10.0.2.3
Paolo Costa
Are these related?
10.0.1.4
CamCube - Rethinking the Data Center Cluster
17
Applications & Network Gap
Applications perspective
Network Perspective
• Applications only see IP addresses
• The network only sees packets
• No control on packet routing
• No insights about application
behaviour
• Need to reverse-engineer
the network
• Has to infer application
patterns
− Hard to infer locality & congestion
− Point-to-point only
Why slow?
?
?
Long vs. short flows?
10.0.2.3
Paolo Costa
Are these related?
10.0.1.4
CamCube - Rethinking the Data Center Cluster
18
Internet & Data Centers
• This is due to how the Internet was designed…
− …but data centers are not mini-Internets
Internet
• Multiple administration domains
• Heterogeneous HW and network
Strict
layer isolation
• Topology not known
• Malicious software
Paolo Costa
CamCube - Rethinking the Data Center Cluster
19
Internet & Data Centers
• This is due to how the Internet was designed…
− …but data centers are not mini-Internets
Internet
Data Centers
• Multiple administration domains
• Single administration domain
• Heterogeneous HW and network
• Homogenous HW and network
− x86 and Ethernet
• Topology not known
• Topology known
• Malicious software
• Trusted components
Paolo Costa
− and can be customised
− e.g., using virtualization
CamCube - Rethinking the Data Center Cluster
20
Internet & Data Centers
• This is due to how the Internet was designed…
− …but data centers are not mini-Internets
Internet
Data Centers
• Multiple administration domains
• Single administration domain
•
How can we exploit this flexibility
Heterogeneous HW and network
• Homogenous HW and network
to improve efficiency and reduce
− x86 andcomplexity?
Ethernet
• Topology not known
• Topology known
• Malicious software
• Trusted components
Paolo Costa
− and can be customised
− e.g., using virtualization
CamCube - Rethinking the Data Center Cluster
21
CamCube
How can we design a data center closer
to what a distributed systems builder expects?
Paolo Costa
CamCube - Rethinking the Data Center Cluster
22
CamCube
•
•
How
cannetwork
we design
a data
center
closerto it
Today
: The
is a given
and
apps adapt
to
what a distributed
systems
CamCube:
Adapt the network
to builder
the apps’expects?
needs
Paolo Costa
CamCube - Rethinking the Data Center Cluster
23
CamCube
Physical Ethernet cable
Direct-Connect
topology
How can we design a data center closer
Servers are directly interconnected to each other
to what a distributed
systems
builder
expects?
(no switches / routers)
Paolo Costa
CamCube - Rethinking the Data Center Cluster
24
CamCube
Direct-Connect
topology
Howconnected
can we design
a data center
A fully
mesh topology
wouldcloser
be ideal
Servers are directly interconnected to each other
logical
topologies systems
can be mapped
perfectly
to All
what
a distributed
builder
expects?
(no switches / routers)
Paolo Costa
CamCube - Rethinking the Data Center Cluster
25
CamCube
Dynamo
Direct-Connect
topology
A fully connected mesh topology would be ideal
Servers are directly interconnected to each other
All logical topologies can be mapped perfectly
(no switches / routers)
Paolo Costa
CamCube - Rethinking the Data Center Cluster
26
CamCube
Direct-Connect
topology
A fully connected mesh topology would be ideal
Servers are directly interconnected to each other
All logical topologies can be mapped perfectly
(no switches / routers)
Paolo Costa
CamCube - Rethinking the Data Center Cluster
27
CamCube
Direct-Connect
topology
Not
very
scalable
Howconnected
can we design
a data center
A fully
mesh topology
wouldcloser
be ideal
ServersNode
are directly
interconnected
to each
other
degree
grows
linearly
with
N
logical
topologies systems
can be mapped
perfectly
to All
what
a distributed
builder
expects?
(no switches / routers)
(high server load and cabling complexity)
Paolo Costa
CamCube - Rethinking the Data Center Cluster
28
Which topology?
• Various options available
− Trees, rings, hypercubes, tori, …
• Scalable
− Node degree is constant (=6)
2D Torus
• Fault-tolerant
− High degree of multi-path
• Easy to wire
− Only short links are needed
• Trade-off
− Increased hop count
Paolo Costa
CamCube - Rethinking the Data Center Cluster
3D Torus
29
(1,2,1)
(1,2,2)
Network Visibility
y
z
10.0.2.3
10.0.1.4
• Limited network visibility • Nodes have (x,y,z)
coordinates
−Hard to infer server location
• IP addresses only
−Hard to infer congestion
Paolo Costa
x
− Easy to understand locality
• Servers have full visibility on
the status of network links
CamCube - Rethinking the Data Center Cluster
30
Packet Routing
• Single routing protocol
− Point-to-point only
Paolo Costa
• Servers can intercept, process,
and forward packets
− multiple custom routing protocols
− e.g., multicast, multipath
CamCube - Rethinking the Data Center Cluster
31
Packet Processing
• Application-agnostic
packet processing
− Typically header-only
− e.g., OpenFlow
Paolo Costa
• Application-specific packet
processing
− Servers understand the
application semantics
− E.g., caching, aggregation
CamCube - Rethinking the Data Center Cluster
32
CamCube Services
• Several services have been implemented
on top of CamCube, including:
• CamKey
− Key-value store
• Camdoop
− MapReduce-like system
• CamGraph
− Graph processing engine
• TCP/IP service
− Enables running unmodified TCP applications
Paolo Costa
CamCube - Rethinking the Data Center Cluster
33
CamCube Services
• Several services have been implemented
on top of CamCube, including:
• CamKey
− Key-value store
• Camdoop
− MapReduce-like system
• CamGraph
− Graph processing engine
• TCP/IP service
− Enables running unmodified TCP applications
Paolo Costa
CamCube - Rethinking the Data Center Cluster
34
(2,2,0)
Key-based Routing
y
• Packets are routed based on the
key rather than server address
• Inspired by Distributed
Hash Tables (DHTs)
(1,2,0)
z
− The (x,y,z)coordinates
define a key-space
• 160-bit keys are expressed as (x,y,z,w)
x
(2,1,0)
− If alive, (x,y,z) is the server responsible for
− Otherwise, keys are re-mapped to 1-hop neighbors based on w
• Example
− (2,2,0,27) -> (2,2,0), (2,1,0), (1,2,0), …
Paolo Costa
CamCube - Rethinking the Data Center Cluster
35
CamKey
• Reliable high-performance key-value store
− Combination of BigTable + memcached
Two components:
• Replicated store
− Ensures fault tolerance
• Caching service
− Provides high performance
Paolo Costa
CamCube - Rethinking the Data Center Cluster
36
Replicated Store
hash(ID) = e689eb3…
= (2,2,0,27)
Data objects IDs are hashed using SHA-1 and
the result is interpreted as 4D coordinates
Paolo Costa
CamCube - Rethinking the Data Center Cluster
37
Replicated Store
Primary replica
hash(ID) = e689eb3…
= (2,2,0,27)
(2,2,0)
The primary replica is stored at the server
responsible for the key
Paolo Costa
CamCube - Rethinking the Data Center Cluster
38
Replicated Store
Primary replica
hash(ID) = e689eb3…
= (2,2,0,27)
Secondary replica
(2,2,0,27) -> (2,2,0),
(2,1,0), (1,2,0), …
(2,1,0)
The first secondary replica is stored at the
server that will become responsible for the key
if the primary fails
Paolo Costa
CamCube - Rethinking the Data Center Cluster
39
Replicated Store
Primary replica
hash(ID) = e689eb3…
= (2,2,0,27)
Secondary replica
(2,2,0,27) -> (2,2,0),
(2,1,0), (1,2,0), …
(1,2,0)
The second secondary replica is stored on the
next server on the list and so on
Paolo Costa
CamCube - Rethinking the Data Center Cluster
40
Replicated Store
Primary replica
hash(ID) = e689eb3…
= (2,2,0,27)
Secondary replica
(2,2,0,27) -> (2,2,0),
(2,1,0), (1,2,0), …
High-locality
Secondary replicas are 1-hop neighbors
Disjoint paths can be used
Paolo Costa
CamCube - Rethinking the Data Center Cluster
41
Replicated Store
Route to
(2,2,0,27)
Primary replica
hash(ID) = e689eb3…
= (2,2,0,27)
Secondary replica
(2,2,0,27) -> (2,2,0),
(2,1,0), (1,2,0), …
Client Transparency
Clients do not need to know the replica identity
Key-based routing is used to deliver packets
Paolo Costa
CamCube - Rethinking the Data Center Cluster
42
Replicated Store
Primary replica
hash(ID) = e689eb3…
= (2,2,0,27)
Secondary replica
(2,2,0,27) -> (2,2,0),
(2,1,0), (1,2,0), …
Client Transparency
Clients do not need to know the replica identity
Key-based routing is used to deliver packets
Paolo Costa
CamCube - Rethinking the Data Center Cluster
43
Replicated Store
Primary replica
hash(ID) = e689eb3…
= (2,2,0,27)
Secondary replica
(2,2,0,27) -> (2,2,0),
(2,1,0), (1,2,0), …
Client Transparency
Clients do not need to know the replica identity
Key-based routing is used to deliver packets
Paolo Costa
CamCube - Rethinking the Data Center Cluster
44
Caching Service
Primary replica
hash(ID) = e689eb3…
= (2,2,0,27)
Caches
f(2,2,0,27) ->
(1,1,0,27), (3,1,0,27),
(1,3,0,27), (3,3,0,27),…
For each key, we generate c additional keys
that represent the location of caches
Paolo Costa
CamCube - Rethinking the Data Center Cluster
45
Caching Service
Primary replica
hash(ID) = e689eb3…
= (2,2,0,27)
Caches
f(2,2,0,27) ->
(1,1,0,27), (3,1,0,27),
(1,3,0,27), (3,3,0,27),…
These cache keys are assigned to servers
using the usual mapping function
Paolo Costa
CamCube - Rethinking the Data Center Cluster
46
Caching Service
Primary replica
hash(ID) = e689eb3…
= (2,2,0,27)
Caches
f(2,2,0,27) ->
(1,1,0,27), (3,1,0,27),
(1,3,0,27), (3,3,0,27),…
When a server lookups a key, the path is chosen
so as to pass through the closest cache
Paolo Costa
CamCube - Rethinking the Data Center Cluster
47
Caching Service
Primary replica
hash(ID) = e689eb3…
= (2,2,0,27)
Caches
f(2,2,0,27) ->
(1,1,0,27), (3,1,0,27),
(1,3,0,27), (3,3,0,27),…
When a server lookups a key, the path is chosen
so as to pass through the closest cache
Paolo Costa
CamCube - Rethinking the Data Center Cluster
48
Caching Service
Primary replica
hash(ID) = e689eb3…
= (2,2,0,27)
Caches
f(2,2,0,27) ->
(1,1,0,27), (3,1,0,27),
(1,3,0,27), (3,3,0,27),…
On a cache miss, the lookup request is
forwarded to the primary replica
and the response is cached on the way back
Paolo Costa
CamCube - Rethinking the Data Center Cluster
49
Caching Service
Primary replica
hash(ID) = e689eb3…
= (2,2,0,27)
Caches
f(2,2,0,27) ->
(1,1,0,27), (3,1,0,27),
(1,3,0,27), (3,3,0,27),…
On a cache miss, the lookup request is
forwarded to the primary replica
and response is cached on the way back
Paolo Costa
CamCube - Rethinking the Data Center Cluster
50
Caching Service
Primary replica
hash(ID) = e689eb3…
= (2,2,0,27)
Caches
f(2,2,0,27) ->
(1,1,0,27), (3,1,0,27),
(1,3,0,27), (3,3,0,27),…
On a cache miss, the lookup request is
forwarded to the primary replica
and response is cached on the way back
Paolo Costa
CamCube - Rethinking the Data Center Cluster
51
Caching Service
Primary replica
hash(ID) = e689eb3…
= (2,2,0,27)
Caches
f(2,2,0,27) ->
(1,1,0,27), (3,1,0,27),
(1,3,0,27), (3,3,0,27),…
On a cache miss, the lookup request is
forwarded to the primary replica
and response is cached on the way back
Paolo Costa
CamCube - Rethinking the Data Center Cluster
52
Caching Service
Primary replica
hash(ID) = e689eb3…
= (2,2,0,27)
Caches
f(2,2,0,27) ->
(1,1,0,27), (3,1,0,27),
(1,3,0,27), (3,3,0,27),…
On a cache miss, the lookup request is
forwarded to the primary replica
and response is cached on the way back
Paolo Costa
CamCube - Rethinking the Data Center Cluster
53
Caching Service
Primary replica
hash(ID) = e689eb3…
= (2,2,0,27)
Caches
f(2,2,0,27) ->
(1,1,0,27), (3,1,0,27),
(1,3,0,27), (3,3,0,27),…
On a cache miss, the lookup request is
forwarded to the primary replica
and response is cached on the way back
Paolo Costa
CamCube - Rethinking the Data Center Cluster
54
Caching Service
Primary replica
hash(ID) = e689eb3…
= (2,2,0,27)
Caches
f(2,2,0,27) ->
(1,1,0,27), (3,1,0,27),
(1,3,0,27), (3,3,0,27),…
On a cache miss, the lookup request is
forwarded to the primary replica
and response is cached on the way back
Paolo Costa
CamCube - Rethinking the Data Center Cluster
55
Caching Service
Primary replica
hash(ID) = e689eb3…
= (2,2,0,27)
Caches
f(2,2,0,27) ->
(1,1,0,27), (3,1,0,27),
(1,3,0,27), (3,3,0,27),…
On a cache miss, the lookup request is
forwarded to the primary replica
and response is cached on the way back
Paolo Costa
CamCube - Rethinking the Data Center Cluster
56
Caching Service
Primary replica
hash(ID) = e689eb3…
= (2,2,0,27)
Caches
f(2,2,0,27) ->
(1,1,0,27), (3,1,0,27),
(1,3,0,27), (3,3,0,27),…
Next requests for the same key are intercepted
on-path and the associated value is returned
Paolo Costa
CamCube - Rethinking the Data Center Cluster
57
Caching Service
Primary replica
hash(ID) = e689eb3…
= (2,2,0,27)
Caches
f(2,2,0,27) ->
(1,1,0,27), (3,1,0,27),
(1,3,0,27), (3,3,0,27),…
Next requests for the same key are intercepted
on-path and the associated value is returned
Paolo Costa
CamCube - Rethinking the Data Center Cluster
58
Caching Service
Primary replica
hash(ID) = e689eb3…
= (2,2,0,27)
Caches
f(2,2,0,27) ->
(1,1,0,27), (3,1,0,27),
(1,3,0,27), (3,3,0,27),…
Next requests for the same key are intercepted
on-path and the associated value is returned
Paolo Costa
CamCube - Rethinking the Data Center Cluster
59
Caching Service
Primary replica
hash(ID) = e689eb3…
= (2,2,0,27)
Caches
f(2,2,0,27) ->
(1,1,0,27), (3,1,0,27),
(1,3,0,27), (3,3,0,27),…
Next requests for the same key are intercepted
on-path and the associated value is returned
Paolo Costa
CamCube - Rethinking the Data Center Cluster
60
Caching Service
Primary replica
hash(ID) = e689eb3…
= (2,2,0,27)
Caches
Write operations always go to the primary
replica and caches are invalidated
Paolo Costa
CamCube - Rethinking the Data Center Cluster
61
Caching Service
Primary replica
hash(ID) = e689eb3…
= (2,2,0,27)
Caches
Write operations always go to the primary
replica and caches are invalidated
Paolo Costa
CamCube - Rethinking the Data Center Cluster
62
Caching Service
Primary replica
hash(ID) = e689eb3…
= (2,2,0,27)
Caches
Write operations always go to the primary
replica and caches are invalidated
Paolo Costa
CamCube - Rethinking the Data Center Cluster
63
Caching Service
Primary replica
hash(ID) = e689eb3…
= (2,2,0,27)
Caches
Write operations always go to the primary
replica and caches are invalidated
Paolo Costa
CamCube - Rethinking the Data Center Cluster
64
Evaluation
Testbed
− 27-server CamCube (3 x 3 x 3)
− Quad-core 2.27 Ghz, 12 GB RAM
− Six 1 Gbps ports per server
− Runtime & services implemented in user-space (C#)
Workload: Image store
− 9 external servers (up to 150 concurrent requests)
− Insert: 1.47 MB average image size
− Lookup: 3.55 KB average thumbnail size
Paolo Costa
CamCube - Rethinking the Data Center Cluster
65
Insert Throughput
Insert throughput (Gbps)
Better 6
switch
5
CamKey
4
switch (no disk)
CamKey (no disk)
3
2
1
Worse 0
0
25
50
75
100
125
150
Concurrent insert requests
Paolo Costa
Load increases
CamCube - Rethinking the Data Center Cluster
66
Insert Throughput
Insert throughput (Gbps)
Better 6
CamKey exploits
disjoint paths to
create replicas
switch
5
CamKey
4
switch (no disk)
CamKey (no disk)
3
2
Server bandwidth
bounded
1
Worse 0
0
25
50
75
100
Disk I/O
bounded
125
150
Concurrent insert requests
Paolo Costa
Load increases
CamCube - Rethinking the Data Center Cluster
67
Lookup Throughput
Better 160,000
switch
Lookup rate (reqs/s)
140,000
CamKey (disabled cache)
120,000
CamKey
100,000
80,000
60,000
40,000
20,000
Worse
0
0
25
50
75
100
125
150
Concurrent lookup requests
Paolo Costa
Load increases
CamCube - Rethinking the Data Center Cluster
68
Caches
reduce
hop count
Latency
0.83 ms (median)
1.70 ms (95th perc)
Lookup Throughput
Better 160,000
switch
Lookup rate (reqs/s)
140,000
CamKey (disabled cache)
120,000
CamKey
100,000
80,000
60,000
Latency
0.97 ms (median)
2.13 ms (95th perc)
40,000
20,000
Worse
Higher
hop count
0
0
25
50
75
100
125
150
Concurrent lookup requests
Paolo Costa
Load increases
CamCube - Rethinking the Data Center Cluster
69
Failures
Insert throughput (Gbps)
6
5
4
`
3
A random
server fails
every 10 s
2
1
CamKey
0
0
Paolo Costa
20
40
60
80
Time (s)
CamCube - Rethinking the Data Center Cluster
100
120
140
70
Only 18
servers left
Failures
Insert throughput (Gbps)
6
5
4
3
A random
server fails
every 10 s
2
1
CamKey
0
0
Paolo Costa
20
40
60
80
Time (s)
CamCube - Rethinking the Data Center Cluster
100
120
140
71
CamCube Services
• Several services have been implemented
on top of CamCube, including:
• CamKey
− Key-value store
• Camdoop
− MapReduce-like system
• CamGraph
− Graph processing engine
• TCP/IP service
− Enables running unmodified TCP applications
Paolo Costa
CamCube - Rethinking the Data Center Cluster
72
MapReduce
Input file
Chunk 0
Chunk 1
Chunk 2
Intermediate results
Final results
Map Task
Reduce Task
Map Task
Reduce Task
Map Task
Reduce Task
• Map
− Processes input data and generates (key, value) pairs
• Shuffle
− Distributes the intermediate pairs to the reduce tasks
• Reduce
− Aggregates all values associated to each key
Paolo Costa
CamCube - Rethinking the Data Center Cluster
73
Shuffle Phase
Intermediate results
Split 0
Split 1
Split 2
Map Task
Reduce Task
Map Task
Reduce Task
Map Task
Reduce Task
• Shuffle phase is challenging for data center networks
− All-to-all traffic pattern with O(N2) flows
• Often a bottleneck for MapReduce jobs
− Led to proposals for full-bisection bandwidth
Paolo Costa
CamCube - Rethinking the Data Center Cluster
74
Data Reduction
Input file
Split 0
Split 1
Split 2
Final results
Intermediate results
Map Task
Reduce Task
Map Task
Reduce Task
Map Task
Reduce Task
• The final results are typically much smaller than
the intermediate results (e.g., WordCount)
• In most Facebook jobs final size is 5.4 % of the
intermediate size
• In most Yahoo jobs the ratio is 8.2 %
Paolo Costa
CamCube - Rethinking the Data Center Cluster
75
Data Reduction
Input file
Split 0
Split 1
Split 2
Final results
Intermediate results
Map Task
Reduce Task
Map Task
Reduce Task
Map Task
Reduce Task
• The final results are typically much smaller than
the intermediate results (e.g., WordCount)
• In most Facebook jobs final size is 5.4 % of the
How
can we exploit
intermediate
size this to reduce the traffic and
the performance
ofisthe
phase?
•improve
In most Yahoo
jobs the ratio
8.2shuffle
%
Paolo Costa
CamCube - Rethinking the Data Center Cluster
76
Aggregation Tree
• We could use aggregation trees to perform
multiple steps of aggregation to reduce inter-rack traffic
− e.g., rack-level aggregation
Paolo Costa
CamCube - Rethinking the Data Center Cluster
77
Aggregation Tree
• We could use aggregation trees to perform
multiple steps of aggregation to reduce inter-rack traffic
− e.g., rack-level aggregation
Paolo Costa
CamCube - Rethinking the Data Center Cluster
78
Mapping a tree…
… on a traditional topology
• Mismatch between logical
and physical topology
Rack
Switch
Link shared by
all children
… on CamCube
• 1:1 mapping btw. logical
and physical topology
• Packets are aggregated
on path (=> less traffic)
Only one
child per link
Paolo Costa
CamCube - Rethinking the Data Center Cluster
79
Mapping a tree…
… on a traditional topology
• Mismatch between logical
and physical topology
Rack
Switch
Link shared by
all children
… on CamCube
• 1:1 mapping btw. logical
and physical topology
• Packets are aggregated
on path (=> less traffic)
Only one
child per link
Paolo Costa
CamCube - Rethinking the Data Center Cluster
80
Mapping a tree…
… on a traditional topology
• Mismatch between logical
and physical topology
Rack
Switch
Link shared by
all children
… on CamCube
• 1:1 mapping btw. logical
and physical topology
• Packets are aggregated
on path (=> less traffic)
Only one
child per link
Paolo Costa
CamCube - Rethinking the Data Center Cluster
81
Mapping a tree…
… on a traditional topology
• Mismatch between logical
and physical topology
Rack
Switch
Link shared by
all children
… on CamCube
• 1:1 mapping btw. logical
and physical topology
• Packets are aggregated
on path (=> less traffic)
Only one
child per link
Paolo Costa
CamCube - Rethinking the Data Center Cluster
82
Mapping a tree…
… on a traditional topology
… on CamCube
Camdoop
• 1:1
mapping
btw.
logical
• Mismatch
between
logical
Improve
the performance
of the
shuffle
phase
and physical topology
and physical topology
by reducing the traffic
• Packets
are aggregated
rather than by increasing
the bandwidth
on path (=> less traffic)
Rack
Switch
Paolo Costa
CamCube - Rethinking the Data Center Cluster
83
Workload Parameter
• Output size / intermediate size (S)
− S=1 (no aggregation)
o All map outputs have a disjoint set of keys
− S=1/N ≈ 0 (full aggregation)
o All map outputs share the same set of keys
− We use synthetic
workloads
to explore
Intermediate
results
Input file
different value of S
Output results
Map Task data size is 22.2 GB (843
o Intermediate
MB/server)
Reduce
Task
Split 0
Split 1
Split 2
Paolo Costa
Map Task
Reduce Task
Map Task
Reduce Task
CamCube - Rethinking the Data Center Cluster
84
Evaluation
Time (s) logscale
Worse 1000
100
Baseline
10
Camdoop (no agg.)
Camdoop
Better
1
0
Full
aggregation
Paolo Costa
0.2
0.4
0.6
Output size/ intermediate size (S)
CamCube - Rethinking the Data Center Cluster
0.8
1
No
aggregation
85
Impact of
in-network
aggregation
Impact of running
on CamCube
Running on the
switch using TCP
Evaluation
Time (s) logscale
Worse 1000
100
Baseline
10
Camdoop (no agg.)
Camdoop
Better
1
0
Full
aggregation
Paolo Costa
0.2
0.4
0.6
Facebook
reported
Output size/ intermediate size (S)
aggregation
ratio
CamCube - Rethinking the Data Center Cluster
0.8
1
No
aggregation
86
Summary
• Data centers present both unique challenges
and opportunities to network designers
• Good time to revisit previous assumptions
and rethink application and protocol design
• CamCube
− Enables applications to “control” the network
− Removes distinction between computation and
network devices
Paolo Costa
CamCube - Rethinking the Data Center Cluster
87