Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Partial-Parallel-Repair (PPR): A Distributed Technique for Repairing Erasure Coded Storage Subrata Mitra Saurabh Bagchi Rajesh Panta Moo-Ryong Ra Purdue University AT&T Labs Research 1 Need for storage redundancy Data center storages are frequently affected by unavailability events • Unplanned unavailability: - Component failures, network congestions, software glitches, power failures • Planned unavailability: - Software/hardware updates, infrastructure maintenance How storage redundancy helps ? • Prevents permanent data loss (Reliability) • Keeps the data accessible to the user (Availability) 2 Replication for storage redundancy Data • Keep multiple copies of the data in different machines • Data is divided in chunks • Each chunk is replicated multiple times 3 Replication not suitable for big-data 1 Zettabyte = 109 TB Ref: UNECE • Triple replication requires 2x for storage redundancy. • Storage overhead becomes too much for large volume of data. 4 Outline of the talk • Erasure coded storage as an alternative to replication • The repair problem in erasure coded storage • Overview of the prior works • Our solution: Partial parallel Repair • Implementation and evaluations 5 Erasure coded (EC) storage Data Stripe k data chunks m parity chunks Can survive up to m chunk failures • Reed-Solomon (RS) is the most popular coding method • Erasure coding has much lower storage overhead while providing same or better reliability. Example for 10 TB of data Redundancy method Total storage required Reliability Triple Replication 30 TB 2 Failures RS (k=6, m=3) 15 TB 3 Failures RS (k=12, m=4) 13.34 TB 4 Failures 6 The repair problem in EC storage (4, 2) RS code Chunk size = 256MB Bottleneck S1 S2 S3 S4 S5 S6 S7 Data chunks Parity chunks Crashed New destination Network bottleneck slows down the repair process 7 The repair problem in EC storage(2) Repair time in EC is much longer than replication Example for 10 TB of data Redundancy method Total storage required Reliability # chunks transferred during a repair Triple Replication 30 TB 2 Failures 1 RS (k=6, m=3) 15 TB 3 Failures 6 RS (k=12, m=4) 13.33 TB 4 Failures 12 For a chunk size of 256MB this would be a 12 x 256 Mega-Bytes of data transfer over a particular link! 8 What triggers a repair ? • Monitoring process finds unavailable chunks - Regular repairs - Chunk is re-created in a new server • Client finds missing or corrupted chunks - Degraded reads - Chunk is re-created in the client - On the critical path of the user application 9 Existing solutions • Keep additional parities: Need additional storage Huang et al. (ATC-2012), Sathiamoorthy et al. (VLDB-2013) • Mix of both replication and erasure code: Higher storage overhead than EC Xia et al. (FAST-2015), Ma et al. (INFOCOM-2013) • Repair friendly codes: Restricted parameters Khan et al. (FAST-2012), Xiang et al. (SIGMETRICS-2010), Hu et al. (FAST-2012), Rashmi et al. (SIGCOMM-2014) • Delay repairs: Depends on policy. Immediate repair needed for degraded reads Silberstein et al. (SYSTOR-2014) 10 Our solution approach Motivating observations: • Over 98% involve single chunk failure in a stripe (Rashmi et al. HotStorage’13) • Network transfer time dominates the repair time We introduce: Partial Parallel Repair (PPR) - a distributed repair technique • Focused on reducing repair time for a single chunk failure in a stripe 11 Key insight: partial calculations Encoding Repair • Equations are associative • Individual terms can be calculated in parallel 12 Partial Parallel Repair Technique Partial Parallel Repair Traditional Repair Bottleneck S1 S2 S3 S4 S5 S6 S7 a2C2 + a3C3 + a4C4+ a5C5 S1 S2 S3 S4 S5 S6 S7 a2Ca2 2C2 + a3C3a4Ca4 24C24 + a35C35+ a4C4 + a5C5 PPR communication patterns Traditional Repair PPR Network transfer time O(k) O(log2(k+1)) Repair traffic flow Many to one More evenly distributed Amount of transferred data Same Same 14 When PPR is most useful ? Network transfer times during repair: • Traditional RS (k, m) • PPR (chunk size / bandwidth) * k (chunk size / bandwidth) * ceil( log2(k+1) ) PPR / Traditional 1.2 PPR is useful when 1.0 • k is large 0.8 0.6 • Network is the bottleneck 0.4 0.2 • Chunk size is large 0.0 2 4 6 8 10 12 14 16 18 20 k 15 Additional benefits of PPR • Maximum data transferred to/from any node is logarithmically lower - Implications: Less repair bandwidth reservation per node • Computation is parallelized across multiple nodes - Implications: Lower memory footprint per node and computation speedup • PPR works if encoding/decoding operations are associative - Implications: Compatible to a wide range of codes including RS, LRC, RS-Hitchhiker, Rotated-RS etc. 16 Can we try to reduce the repair time a bit more ? • Disk I/O is the second dominant factor in the total repair time • Use caching technique to bypass disk I/O time chunkID Last access time Server C1 C2 t1 t2 A B chunkID Last access time C1 t1 A Repair Manager Client chunkID Last access time C2 t2 B 17 Multiple simultaneous failures m-PPR: a scheduling mechanism for running multiple PPR based jobs Chunk Failures Repair Manager C1 C2 C3 Schedule repair Chosen servers C2 <,,,> C3 <,,,> C1 <,,,> Details in the paper A greedy approach. Attempts to minimize resource contention 18 Implementation and evaluations • Implemented on top of Quantcast File System (QFS) - QFS has similar architecture as HDFS. • Repair Manager implemented inside the Meta Server of QFS • Evaluated with various coding parameters and chunk sizes • Evaluated PPR with Reed-Solomon code and two other repair friendly codes (LRC and Rotated-RS) 19 % reduction in repair time Repair time improvements 8MB 16MB 32MB 64MB 70 60 50 40 30 20 10 0 (6, 3) (8, 3) (10, 4) (12, 4) RS codes PPR becomes more effective for higher values of “k” 20 Improvements for degraded reads PPR for 6+3 PPR for 12+4 Degraded throughput (MBytes/sec) 20 Regular for 6+3 Regular for 12+4 15 10 5 0 1024 800 600 400 200 Available network b/w (Mbits/sec) PPR becomes more effective under constrained network bandwidth 21 Compatibility with existing codes Repair time (sec) 12 10 8 6 4 2 0 RS RS + PPR LRC LRC + PPR Rotated RS Rotated RS + PPR • PPR on top of LRC (Huang et al. in ATC-2012) provides 19% additional savings • PPR on top of Rotated Reed-Solomon (by Khan et al. in FAST-2012) provides 35% additional savings 22 Summary • Partial Parallel Repair (PPR) a technique for distributing the repair task over multiple nodes to improve network utilization • Theoretically, PPR can logarithmically reduce the network transfer time • PPR is more attractive for higher “k” in (k, m) RS coding • PPR is compatible to any associative codes 23 Thank you! Questions ? 24 Backup 25 % of total reconstruction time Network transfer time dominates 120 100 Computation Disk read Network transfer 80 60 40 20 0 8MB 16MB 32MB 64MB 8MB 16MB 32MB 64MB 6+3 RS codes 12+4 Network transfer time takes up to 94% of the total repair time 26 The protocol 27 Relationship with chunk size 28 Multiple simultaneous failures(2) • A weight is calculated for each server Wsrc = a1*hasCache – a2*(#reconstructions) - a3*userLoad Wdst = -b1*(#repair destinations) – b2*userLoad • The weights represent the “goodness” of the server for scheduling the next repair • Best “k” servers are chosen as the source servers. Similarly best destination server is chosen • All selections are subjected to reliability constraints. E.g. chunks of the same stripe should be in separate failure domains/update domains. 29 Total repair time (sec) Improvements from m-PPR 1200 1000 Traditional RS repair time PPR repair time 800 600 400 200 0 30 50 100 Number of simultaneous failures 150 • m-PPR can reduce repair time by 31%-47% • It’s effectiveness reduces with higher number of simultaneous failures because overall network transfers are more evenly distributed 30 Benefits from caching 31 Replication for storage redundancy Data • Keep multiple copies of the data in different machines • Data is divided in chunks • Each chunk is replicated multiple times 32