Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Deduplication File System & Course Review Kai Li 12/13/13 Topics u Deduplication File System u Review 12/13/13 2 Storage Tiers of A Tradi/onal Data Center Mirrored storage $$$$ $$$ Dedicated Fibre Clients 12/13/13 Server Primary storage Kai Li 3 Replacing Tape Library using Disk Storage? Mirrored storage Dedicate d Fibre Clients Server Primary storage Costs • Purchase: $$$$ • Bandwidth: $$$$ • Power: $$$ 20 × primary (3-month retention) 20 × primary WAN Onsite Remote 12/13/13 Kai Li 4 Vision: Deduplica+on Storage Eco-‐System Mirrored storage Dedicated WAN Clients Server Primary storage Value Proposi/ons: • Purchase: ~Tape libraries • Space: 10-‐30X reduc/on • WAN BW: 10-‐50X reduc/on • Power: ~10X reduc/on 12/13/13 Onsite WAN Remote Kai Li 5 Before: 17 Tape Libraries 12/13/13 Kai Li 6 After: 3 3-U Systems! 12/13/13 Kai Li 7 DD Protected 26EB & Replaced 33M Tapes -‐ EMC Blog by C. Gordon (4/26/2013) Local vs. Global Redundancies “Local” compression Deduplication ~100k Encode a sliding window [Ziv&Lempel77] WAN ~2-3X compression ~10-30X compression Larger “windows” have more redundancies 12/13/13 Kai Li 9 Consolidated … Media server … Backup data streams Dedup Storage System for Backups … data streams Dedup Storage Controller Replication WAN Media server Meta data Unique Data segments Physical storage For deduped data Dedup Storage 12/13/13 Kai Li 10 How Does It Work? Segmented data stream Dedup Storage Controller Meta data Unique data segments Dedup Storage 12/13/13 Kai Li 11 Fixed vs. Variable Segmenta/on • Fixed size 4k 4k 4k ... Cannot handle deletes, shifts ... No problem w/ deletes, shifts [Manber93, Brin94] • Content-‐based, variable size A X C D A Y C D A B C D A B fp = 10110110 fp = 10110100 ... fp = 10110000 12/13/13 Kai Li 12 More Details Segmented data stream SegmentedDedup Storage data stream Controller Meta data Unique Data segments For each segment • Compute a strong fingerprint Dedup Storage • Use an index to lookup Controller – If unique Dedup Storage Meta data • Locally compress the segment • store segment Unique • data store meta data segments – If duplicate • store meta data Dedup Storage 12/13/13 Kai Li 13 Design Challenges • Very reliable and self-healing – Corrupting a segment may corrupt multiple files – NVRAM to store log (transactions) – Invulnerability features: • Frequent verifications • Metadata reconstruction from self-describing containers • Self-correction from RAID-6 • High-speed high-compression at low HW cost – Speed challenge: data 2X/18 months and 24 hours/day – Compression: reduce cost – Use commodity server hardware 12/13/13 14 High Deduplica/on Ra/o • High deduplica/on factor è hardware cost – Smaller segments achieve higher compression ra/os? – Smaller segments result in higher ra/o of metadata to physical segments • Data Domain’s approach – Use the sweet spots of segment sizes – Mul/ple local compression algorithms 12/13/13 Kai Li 15 Segment Sizes for Backup Data Rule of thumb: 2X segment size will u u u increase space for unique segments by 15% decrease metadata by about 50% deduce disk I/Os for writes and reads 12/13/13 Kai Li 16 Real World Example at Datacenter A 12/13/13 Kai Li 17 Real World Compression at Datacenter A 12/13/13 Kai Li 18 Real World Example at Datacenter B 12/13/13 Kai Li 19 Real World Compression at Datacenter B 12/13/13 Kai Li 20 High Throughput Is Challenging Divide data streams into segments Lookup 12/13/13 Fingerprint Index Kai Li Index size for 160TB w/ 8KB segments = (160TB/8KB) * 24B = 480GB! 21 Caching? Divide data streams into segments Miss Lookup Index Cache Fingerprint Index Problem: No locality. 12/13/13 Kai Li 22 Parallel Index Need Many Disks [Venti02] Divide data streams into segments Miss Lookup ... Index Cache Problem: Need a lot of disks. 7200RPM disk does 120 lookups/sec. 1MB/sec with 8KB segment per disk 1GB/sec needs 1,000 disks! 12/13/13 Fingerprint Index Kai Li 23 Staging Needs More Disks Data Streams Divide data streams into segments Fingerprint Index Lookup ... Very Big Disk Buffer Problem: The Buffer needs to be as large or larger than the full backup! Big delay for replication 12/13/13 Kai Li 24 Stream Informed Segment Layout Log structured layout (inspired by LFS) u Fixed size large containers to create locality u A container u Segments from the same stream l Metadata (index data) l Metadata section … Metadata section Data section Data section Stream 1 Stream 2 12/13/13 Metadata section Data section … Stream 3 25 A Combination of Techniques u Stream Informed Segment Layout u A sophisticated cache for the fingerprint index Summary data structure for new data l “locality-preserved caching” for old data l u Parallelized software systems to leverage multicore processors Benjamin Zhu, Kai Li and Hugo Patterson. Avoiding the Disk Bottleneck in the Data Domain Deduplication File System. In Proceedings of The 6th USENIX Conference on File and Storage Technologies (FAST’08). February 2008 12/13/13 Kai Li 26 Summary Data Structure Goal: Use minimal memory to test for new data ⇒ Summarize what segments have been stored, with Bloom filter (Bloom’70) in RAM ⇒ If Summary Vector says no, it’s new segment Summary Vector 1 1 1 0 1 1 h1 Set bits fp(s i) Approximation h2 h3 Index Data Structure h1 Read bits fp(s i) h2 h3 12/13/13 1 27 Locality Preserved Caching Algorithm u Disk Index has all <fingerprint, containerID> pairs u On a miss, lookup Disk Index to find containerID u Load the container into memory u Load the metadata of a container into Index Cache, replace if needed Metadata Fingerprint Disk Index Replacement Load metadata ContainerID Data Index Cache Container 12/13/13 Kai Li 28 Putting Them Together A fingerprint Duplicate Index Cache No New Replacement Summary Vector Maybe Disk Index 12/13/13 metadata metadata metadata data metadata data data data 29 Disk I/O Reduction Results Exchange data (2.56TB) Engineering data (2.39TB) 135-daily full backups # disk I/Os No summary, No SISL/LPC % of total 100-day daily inc, weekly full # disk I/Os % of total 328,613,503 100.00% 318,236,712 100.00% Summary only 274,364,788 83.49% 259,135,171 81.43% SISL/LPC only 57,725,844 17.57% 60,358,875 18.97% Summary & SISL/LPC 3,477,129 1.06% 1,257,316 0.40% 12/13/13 Kai Li 30 Revenue ($million) 1100 4083 1000 4000 3500 900 800 3000 700 2500 600 2000 500 1500 400 300 1500 1000 200 400 100 500 200 100 80 40 0 0 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 A1 A2 B C IPO Dedup Throughput (MB/sec) Deduplica/on Throughput M/A Improved by ~100X in 6 years 12/13/13 Kai Li 31 Physical Capacity 290 1000 250 900 800 200 700 600 140 500 400 150 100 300 200 48 100 0 300 8 20 50 Physical Capacity (TB) Revenue ($million) 1100 0.875 4 0 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 A1 A2 B C IPO M/A Increased by ~330X in 6 years 12/13/13 Kai Li 32 Even More Details Segmented data stream Dedup Storage Controller Meta data Unique Data Segs Concurrent Garbage Collec/on • A segment may be shared by mul/ple files • Backups need to be deleted some/me • GC cannot interfere with backups Dedup Storage F1 A 12/13/13 Kai Li F2 B C D 33 More Details Segmented data stream Dedup Storage Controller Meta data Unique Data Segs Concurrent Garbage Collec/on • A segment may be shared by mul/ple files • Backups need to be deleted some/me • GC cannot interfere with backups Dedup Storage F1 A 12/13/13 Kai Li F2 B C D 34 Summary • Dedup storage systems – Replace tape libraries – Near-‐line and archival storage – Primary storage • More Redundancy in larger windows • Locality preserved caching – Reduce cost – Achieve high throughput 12/13/13 Kai Li 35 Review Topics • • • • • • • OS structure Process management CPU scheduling I/O devices Virtual memory Disks and file systems General concepts 36 Operating System Structure • Abstraction • Protection and security • Kernel structure – Layered – Monolithic – Micro-kernel • Virtualization – Virtual machine monitor 37 Process Management • Implementation – State, creation, context switch – Threads and processes • Synchronization – Race conditions and inconsistencies – Mutual exclusion and critical sections – Semaphores: P() and V() – Atomic operations: interrupt disable, test-and-set. – Monitors and Condition Variables – Mesa-style monitor l Deadlocks – How deadlocks occur? – How to prevent deadlocks? 38 CPU Scheduling • Allocation – Non-preemptible resources • Scheduling -- Preemptible resources – FIFO – Round-robin – STCF – Lottery 39 I/O Devices • • • • • • Latency and bandwidth Interrupts and exceptions DMA mechanisms Synchronous I/O operations Asynchronous I/O operations Message passing 12/13/13 40 Virtual Memory • Mechanisms – Paging – Segmentation – Page and segmentation – TLB and its management • Page replacement – FIFO with second chance – Working sets – WSClock 41 Disks and File Systems • Disks – Disk behavior and disk scheduling – RAID4, RAID5 and RAID6 • Flash memory – Write performance – Wear leveling – Flash translation layer • • • • • • • Directories and implementation File layout Buffer cache Transaction and journaling file system NFS and Stateless file system Snapshot Deduplication file system 42 Major Concepts • Locality – Spatial and temporal locality • Scheduling – Use the past to predict the future • Layered abstractions – Synchronization, transactions, file systems, etc • Caching – TLB, VM, buffer cache, etc 43 Operating System as Illusionist Physical reality • Single CPU • Interrupts • Limited memory • No protection • Raw storage device Abstraction • Infinite number of CPUs • Cooperating sequential threads • Unlimited virtual memory • Each address has its own machine • Organized and reliable storage system 44 Future Courses in Systems • Spring – COS 461: computer networks – COS 598C: Analytics and systems of big data • Fall: – COS 432: computer security – COS 561: Advance computer networks or (or COS 518: Advanced OS) – Some grad seminars in systems 12/13/13 45