Download Surviving Failures in Bandwidth Constrained

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Algorithm wikipedia , lookup

Algorithm characterizations wikipedia , lookup

Genetic algorithm wikipedia , lookup

Self-replicating machine wikipedia , lookup

Mathematical optimization wikipedia , lookup

Fault tolerance wikipedia , lookup

Transcript
Surviving Failures in Bandwidth
Constrained Datacenters
Authors:
Peter Bodik
Ishai Menache
Mosharaf Chowdhury
Pradeepkumar Mani
David A.Maltz
Ion Stoica
Presented By,
Sneha Arvind Mani
OUTLINE
Introduction
 Motivation and Background
 Problem Statement
 Algorithmic Solutions
 Evaluation of the Algorithms
 Related Work
 Conclusion

Introduction

The main goal of this paper:
◦ To improve the fault tolerance of the deployed
applications
◦ Reduce bandwidth usage in the core.
-How? - By optimizing allocation of
applications to physical machines.
• Both of the above problems are NP-hard
• So they formulated a related convex
optimization problem that
• Incentivizes spreading machines of individual
services across fault domains.
• Adds a penalty term for machine reallocations that
increase bandwidth usage.
Introduction (2)
Their algorithm achieved 20%-50% reduction in
bandwidth usage and improving worst-case
survival by 40%-120%
 Improvement in Fault Tolerance – reduced the
fraction of services affected by potential
hardware failures by up to a factor of 14.
 The contribution of this paper is three-fold:
◦ Measurement Study
◦ Algorithms
◦ Methodology

Motivation and Background
Bing.com – a large scale Web application
running in multiple datacenters around the
world.
 Some definitions used in this paper:

◦ Logical Machine: Smallest logical component of a
web application.
◦ Service: Service consists of many logical machines
executing the same code.
◦ Environment: Consists of many services
◦ Physical Machine: Physical server that can run a
single logical machine.
◦ Fault Domain: Set of physical machines that share a
single point of failure.
Communication Patterns

On tracing communication between all
pairs of servers and for each pairs of
services i and j, it was observed that
datacenter network core is highly utilized.
link utilization
aggregate months
above utilization
>50%
115.7
>60%
47.5
Traffic matrix is very
sparse. Only 2% service
pairs communicate at all.

>70%
18.3
>80%
6.2
Communication Patterns(2)
Communication pattern is very skewed. 0.1% of
the services that communicate generate 60% of
all traffic & 4.8% of service pairs generate 99%
of traffic.
 Services that do not require lot of bandwidth can
be spread out across the datacenter, improving
their fault tolerance.

Communication Patterns(3)



The majority of the traffic, 45% stays within the same
service, 23% leaves the service but stays within the
same environment & 23% crosses environments.
Median services talk to nine other services.
Communicating services form small and large
components.
Failure Characteristics
Networking hardware failures causes significant
outages.
 Redundancy reduces impact of failures on lost
bytes by only 40%
 Power fault domains create non-trivial patterns.

Implications for Optimization Framework:
It has to consider the complex patterns of the
power and networking fault domains, instead of
simply spreading the services across several racks
to achieve good fault tolerance.
Problem Statement





Metrics:
Bandwidth (BW): The sum of the rates on the core links
is the overall measure of the bandwidth usage at the core
of network.
Fault Tolerance(FT): It is the average of Worst-CaseSurvival(WCS) across all the services.
No. of Moves(NM): The number of servers that have to
be re-imaged to get from initial datacenter allocation to
the proposed allocation.
Optimization:
Maximize FT – α BW
Subject to NM ≤ N0
α – tunable positive parameter
N0 – Upper limit on number of moves.
Algorithmic Solutions

The solution roadmap is as follows:
◦ Cells – a subset of physical machines that belong to
exactly the same fault domains. This allows reduction
in the size of optimization problem.
◦ Fault Tolerance Cost (FTC) is a convex structure,
hence the minimization of FTC improves FT.
◦ Their method to optimize BW is to perform a
minimum k-way cut on the communication graph.
◦ CUT + FT + BW consists of two- phases
 Minimum k-way cut to compute initial assignment that
minimizes bandwidth at the network core.
 Iteratively move machines to improve FT.
FT + BW does not perform graph-cut but starts with
current allocation & improves performance by greedy
moves that reduce weighted sum of BW and FTC.
Formal Definitions



I – the indicator function
I(n1,n2) = 1 if traffic from n1 to n2 traverses through
core link & I(n1,n2) = 0 otherwise.
Bandwidth is given by:
Where
is required BW between a pair of
machines from services k1 and k2.
 To define FT let
be the total
number of machines allocated to service k affected by
fault j. FT is given by:

K – total no. of services.
Formal Definitions(2)

Fault Tolerance Cost(FTC) is given by:
bk and wj are positive weights assigned to
services and faults.
 A decrease in FTC should increase FT, as
squaring the zk,j variables incentivizes keeping
their values small, obtained by spreading the
machine assignment across multiple fault
domain.
 Minimization of BW is based on minimum kway cut, which partitions the logical machines
into a given number of clusters.

Algorithms to improve both BW & FT
CUT+FT : Apply CUT in the first phase
then minimize FTC in the second phase
using machine swap
 CUT + FT +BW: As above but in second
phase a penalty term for bandwidth is
added. (i.e )ΔFTC + αΔ BW, α is the
weighing factor.
NM-aware algorithm:
 FT + BW: Start with initial allocation, do
only second phase of CUT + FT + BW.

Scaling to large Datacenters
An algorithm that directly exploits skewness of the
communication matrix.
 CUT+RandLow: Apply cut in the first phase.
Determine the subset of services whose aggregate
BW are lower than others then randomly permute
the machine allocation of all services belonging to
the subset.
Scaling to large datacenters:
 To scale to large datacenters, we sample a large
number of candidate swaps and choose the one that
most improves FTC.
 Also during graph cut, logical machines of same
service are grouped into smaller number of
representative nodes.
Evaluation of Algorithms
CUT + FT+ BW: When ignoring the
server moves, it achieves 30%-60%
reduction in BW usage at the same time
improving FT by 40-120%
 FT + BW is close to CUT + FT+BW :
FT+BW performs only steepest-descent
moves.It could be used in scenarios where
concurrent server moves is limited.
 Random allocation in CUT + RandLow
works well as many services transfer
relatively little data and they can be spread

randomly across DCs.
Methodology to Evaluate
The following information is needed to
perform evaluation:
 Network Topology of a cluster
 Services running in the cluster and list of
machines required for each services.
 List of fault domains and machines in each
fault domains
 Traffic matrix for services in the cluster.
The algorithms are compared on the entire
achievable tradeoff boundaries instead of their
performance.
Comparing Different Algorithms

The solid circles represents the FT and BW at starting
allocation(at origin), after BW-only
optimization(bottom-left-corner) & after FT-only
optimization (top-right-corner).
Optimizing for both BW and FT





Artificially partitioning each service to several
subgroups – did not lead to satisfactory results.
Augmenting the cut procedure with “spreading”
requirements for services – did not scale to large
applications.
Cut + FT: Graph is plotted by increasing
number of server swaps.
By changing the number of swaps, tradeoff
between FT & BW can be controlled.
The formulation is convex, so performing
steepest descent until convergence leads to
global minimum w.r.t. fault tolerance.
Optimizing for both BW and FT(2)



Cut + FT+BW: Depends on α . Higher the value of
α, more weight on improving BW at the cost of not
improving FT.
Not optimizing over a convex function, not
guaranteed to reach global optimum.
Cut + RandLow : Performs close to Cut+FT+BW
but does not optimize the BW of low-talking service
nor the FT of high-talking ones.

These graphs show the trade-off boundary
between FT and BW for different
algorithms across 3 more DCs.
Optimizing for BW,FT and NM

We notice significant improvements by
moving just 5% of the cluster. Moving 29%
of the cluster achieves results similar to
moving most of machines using Cut + FT
+ BW


When running FT + BW until convergence, it
achieves results close to Cut+FT+BW even without
the graph cut.
This is significant because it means we can use FT +
BW incrementally and still reach similar
performance as Cut+FT+BW reshuffles the whole
datacenter.
Improvements in FT & BW


For α = 0.1, FT+BW achieved reduction in BW
usage by 26% but improved FT by 140% and FT
was reduced only for 2.7% of services and it is
much lesser than for α = 1.0
For α = 1.0, FT+BW reduced core BW usage by
47% and improved average FT by 121%
Additional Scenarios
Optimization of bandwidth across
multiple layers.
 Preparing for maintenance and online
recovery.
 Adapting to changes in traffic patterns.
 Hard constraints on fault tolerance and
placement.
 Multiple logical machines on a server.

Related Work
Datacenter traffic analysis
 Datacenter resource allocation
 Virtual network embedding
 High availability in distributed systems
 VPN and network testbed allocation

Conclusion
Analysis shows that the communication
volume between pairs of services has long
tail, with majority of traffic being
generated by small fraction of service
pairs.
 This allowed the optimization algorithm
to spread most of the services across fault
domains without significantly increasing
BW usage in the core.

Thank You!