Download Presentation - College of Engineering

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data analysis wikipedia , lookup

Theoretical computer science wikipedia , lookup

Corecursion wikipedia , lookup

Error detection and correction wikipedia , lookup

Fault tolerance wikipedia , lookup

Transcript
Partial-Parallel-Repair (PPR):
A Distributed Technique for Repairing
Erasure Coded Storage
Subrata Mitra
Saurabh Bagchi
Rajesh Panta
Moo-Ryong Ra
Purdue University
AT&T Labs Research
1
Need for storage redundancy
Data center storages are frequently affected by unavailability events
• Unplanned unavailability:
- Component failures, network congestions, software glitches,
power failures
• Planned unavailability:
- Software/hardware updates, infrastructure maintenance
How storage redundancy helps ?
• Prevents permanent data loss (Reliability)
• Keeps the data accessible to the user (Availability)
2
Replication for storage redundancy
Data
• Keep multiple copies of the
data in different machines
• Data is divided in chunks
• Each chunk is replicated
multiple times
3
Replication not suitable for big-data
1 Zettabyte = 109 TB
Ref: UNECE
• Triple replication requires 2x for storage redundancy.
• Storage overhead becomes too much for large volume of data.
4
Outline of the talk
• Erasure coded storage as an alternative to replication
• The repair problem in erasure coded storage
• Overview of the prior works
• Our solution: Partial parallel Repair
• Implementation and evaluations
5
Erasure coded (EC) storage
Data
Stripe
k data chunks
m parity chunks
Can survive up to m chunk failures
• Reed-Solomon (RS) is the most popular coding method
• Erasure coding has much lower storage overhead while providing
same or better reliability.
Example for 10 TB of data
Redundancy method Total storage required
Reliability
Triple Replication
30 TB
2 Failures
RS (k=6, m=3)
15 TB
3 Failures
RS (k=12, m=4)
13.34 TB
4 Failures
6
The repair problem in EC storage
(4, 2) RS code
Chunk size = 256MB
Bottleneck
S1
S2
S3
S4
S5
S6
S7
Data chunks Parity chunks
Crashed
New destination
Network bottleneck slows down the repair process
7
The repair problem in EC storage(2)
Repair time in EC is much longer than replication
Example for 10 TB of data
Redundancy
method
Total storage
required
Reliability
# chunks
transferred during
a repair
Triple Replication
30 TB
2 Failures
1
RS (k=6, m=3)
15 TB
3 Failures
6
RS (k=12, m=4)
13.33 TB
4 Failures
12
For a chunk size of 256MB this would be a 12 x 256 Mega-Bytes of
data transfer over a particular link!
8
What triggers a repair ?
• Monitoring process finds unavailable chunks
- Regular repairs
- Chunk is re-created in a new server
• Client finds missing or corrupted chunks
- Degraded reads
- Chunk is re-created in the client
- On the critical path of the user application
9
Existing solutions
• Keep additional parities: Need additional storage
Huang et al. (ATC-2012), Sathiamoorthy et al. (VLDB-2013)
• Mix of both replication and erasure code: Higher storage overhead
than EC
Xia et al. (FAST-2015), Ma et al. (INFOCOM-2013)
• Repair friendly codes: Restricted parameters
Khan et al. (FAST-2012), Xiang et al. (SIGMETRICS-2010),
Hu et al. (FAST-2012), Rashmi et al. (SIGCOMM-2014)
• Delay repairs: Depends on policy. Immediate repair needed for
degraded reads
Silberstein et al. (SYSTOR-2014)
10
Our solution approach
Motivating observations:
• Over 98% involve single chunk failure in a stripe (Rashmi et al.
HotStorage’13)
• Network transfer time dominates the repair time
We introduce:
Partial Parallel Repair (PPR) - a distributed repair technique
• Focused on reducing repair time for a single chunk failure in a stripe
11
Key insight: partial calculations
Encoding
Repair
• Equations are associative
• Individual terms can be
calculated in parallel
12
Partial Parallel Repair Technique
Partial Parallel Repair
Traditional Repair
Bottleneck
S1
S2
S3
S4
S5
S6
S7
a2C2 + a3C3 + a4C4+ a5C5
S1
S2
S3
S4
S5
S6
S7
a2Ca2 2C2 + a3C3a4Ca4 24C24 + a35C35+ a4C4 + a5C5
PPR communication patterns
Traditional Repair
PPR
Network transfer time
O(k)
O(log2(k+1))
Repair traffic flow
Many to one
More evenly distributed
Amount of transferred data
Same
Same
14
When PPR is most useful ?
Network transfer times during repair:
• Traditional RS (k, m)
• PPR
(chunk size / bandwidth) * k
(chunk size / bandwidth) * ceil( log2(k+1) )
PPR / Traditional
1.2
PPR is useful when
1.0
• k is large
0.8
0.6
• Network is the
bottleneck
0.4
0.2
• Chunk size is large
0.0
2
4
6
8
10 12 14 16 18 20
k
15
Additional benefits of PPR
• Maximum data transferred to/from any node is logarithmically lower
- Implications: Less repair bandwidth reservation per node
• Computation is parallelized across multiple nodes
- Implications: Lower memory footprint per node and
computation speedup
• PPR works if encoding/decoding operations are associative
- Implications: Compatible to a wide range of codes including RS,
LRC, RS-Hitchhiker, Rotated-RS etc.
16
Can we try to reduce the repair time a
bit more ?
• Disk I/O is the second dominant factor in the total repair time
• Use caching technique to bypass disk I/O time
chunkID
Last access time
Server
C1
C2
t1
t2
A
B
chunkID Last access time
C1
t1
A
Repair
Manager
Client
chunkID Last access time
C2
t2
B
17
Multiple simultaneous failures
m-PPR: a scheduling mechanism for running multiple PPR based jobs
Chunk
Failures
Repair
Manager
C1
C2
C3
Schedule repair
Chosen
servers
C2
<,,,>
C3
<,,,>
C1
<,,,>
Details in the paper
A greedy approach. Attempts to minimize resource contention
18
Implementation and evaluations
• Implemented on top of Quantcast File System (QFS)
- QFS has similar architecture as HDFS.
• Repair Manager implemented inside the Meta Server of QFS
• Evaluated with various coding parameters and chunk sizes
• Evaluated PPR with Reed-Solomon code and two other repair
friendly codes (LRC and Rotated-RS)
19
% reduction in repair time
Repair time improvements
8MB
16MB
32MB
64MB
70
60
50
40
30
20
10
0
(6, 3)
(8, 3)
(10, 4)
(12, 4)
RS codes
PPR becomes more effective for higher values of “k”
20
Improvements for degraded reads
PPR for 6+3
PPR for 12+4
Degraded throughput
(MBytes/sec)
20
Regular for 6+3
Regular for 12+4
15
10
5
0
1024
800
600
400
200
Available network b/w (Mbits/sec)
PPR becomes more effective under constrained network bandwidth
21
Compatibility with existing codes
Repair time (sec)
12
10
8
6
4
2
0
RS
RS + PPR
LRC
LRC + PPR
Rotated RS Rotated RS +
PPR
• PPR on top of LRC (Huang et al. in ATC-2012) provides 19% additional
savings
• PPR on top of Rotated Reed-Solomon (by Khan et al. in FAST-2012)
provides 35% additional savings
22
Summary
• Partial Parallel Repair (PPR) a technique for distributing the
repair task over multiple nodes to improve network utilization
• Theoretically, PPR can logarithmically reduce the network
transfer time
• PPR is more attractive for higher “k” in (k, m) RS coding
• PPR is compatible to any associative codes
23
Thank you!
Questions ?
24
Backup
25
% of total reconstruction time
Network transfer time dominates
120
100
Computation
Disk read
Network transfer
80
60
40
20
0
8MB 16MB 32MB 64MB 8MB 16MB 32MB 64MB
6+3
RS codes
12+4
Network transfer time takes up to 94% of the total repair time
26
The protocol
27
Relationship with chunk size
28
Multiple simultaneous failures(2)
• A weight is calculated for each server
Wsrc = a1*hasCache – a2*(#reconstructions) - a3*userLoad
Wdst = -b1*(#repair destinations) – b2*userLoad
• The weights represent the “goodness” of the server for
scheduling the next repair
• Best “k” servers are chosen as the source servers. Similarly
best destination server is chosen
• All selections are subjected to reliability constraints.
E.g. chunks of the same stripe should be in separate failure
domains/update domains.
29
Total repair time (sec)
Improvements from m-PPR
1200
1000
Traditional RS repair time
PPR repair time
800
600
400
200
0
30
50
100
Number of simultaneous failures
150
• m-PPR can reduce repair time by 31%-47%
• It’s effectiveness reduces with higher number of simultaneous
failures because overall network transfers are more evenly
distributed
30
Benefits from caching
31
Replication for storage redundancy
Data
• Keep multiple copies of the
data in different machines
• Data is divided in chunks
• Each chunk is replicated
multiple times
32