Download Deduplication File System & Course Review Kai Li

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Spring (operating system) wikipedia , lookup

CP/M wikipedia , lookup

Object storage wikipedia , lookup

Transcript
Deduplication File System
&
Course Review
Kai Li
12/13/13
Topics
u  Deduplication
File System
u  Review
12/13/13
2
Storage Tiers of A Tradi/onal Data Center Mirrored
storage
$$$$
$$$
Dedicated
Fibre
Clients
12/13/13
Server
Primary
storage
Kai Li
3
Replacing Tape Library using Disk Storage? Mirrored
storage
Dedicate
d
Fibre
Clients
Server
Primary
storage
Costs •  Purchase: $$$$ •  Bandwidth: $$$$ •  Power: $$$ 20 × primary (3-month retention)
20 × primary
WAN
Onsite
Remote
12/13/13
Kai Li
4
Vision: Deduplica+on Storage Eco-­‐System Mirrored
storage
Dedicated
WAN
Clients
Server
Primary
storage
Value Proposi/ons: •  Purchase: ~Tape libraries •  Space: 10-­‐30X reduc/on •  WAN BW: 10-­‐50X reduc/on •  Power: ~10X reduc/on 12/13/13
Onsite
WAN
Remote
Kai Li
5
Before: 17 Tape Libraries
12/13/13
Kai Li
6
After: 3 3-U Systems!
12/13/13
Kai Li
7
DD Protected 26EB & Replaced 33M Tapes -­‐ EMC Blog by C. Gordon (4/26/2013) Local vs. Global Redundancies
“Local”
compression
Deduplication
~100k
Encode a sliding window
[Ziv&Lempel77]
WAN
~2-3X compression
~10-30X compression
Larger “windows” have more redundancies
12/13/13
Kai Li
9
Consolidated
…
Media server …
Backup data streams
Dedup Storage System for Backups …
data streams
Dedup Storage Controller Replication
WAN Media server Meta data Unique Data segments Physical storage
For deduped data
Dedup Storage
12/13/13
Kai Li
10
How Does It Work? Segmented
data stream
Dedup Storage Controller Meta data Unique data segments Dedup Storage
12/13/13
Kai Li
11
Fixed vs. Variable Segmenta/on •  Fixed size 4k 4k 4k
...
Cannot handle
deletes, shifts
...
No problem w/
deletes, shifts
[Manber93, Brin94]
•  Content-­‐based, variable size A X C D A Y C D A B C D A B
fp = 10110110
fp = 10110100
...
fp = 10110000
12/13/13
Kai Li
12
More Details Segmented
data stream
SegmentedDedup Storage data stream Controller Meta data Unique Data segments
For each segment •  Compute a strong fingerprint Dedup Storage •  Use an index to lookup Controller –  If unique Dedup Storage
Meta data •  Locally compress the segment •  store segment Unique •  data store meta data segments –  If duplicate •  store meta data Dedup Storage
12/13/13
Kai Li
13
Design Challenges
•  Very reliable and self-healing
–  Corrupting a segment may corrupt multiple files
–  NVRAM to store log (transactions)
–  Invulnerability features:
•  Frequent verifications
•  Metadata reconstruction from self-describing containers
•  Self-correction from RAID-6
•  High-speed high-compression at low HW cost
–  Speed challenge: data 2X/18 months and 24 hours/day
–  Compression: reduce cost
–  Use commodity server hardware
12/13/13
14
High Deduplica/on Ra/o •  High deduplica/on factor è hardware cost –  Smaller segments achieve higher compression ra/os? –  Smaller segments result in higher ra/o of metadata to physical segments •  Data Domain’s approach –  Use the sweet spots of segment sizes –  Mul/ple local compression algorithms 12/13/13
Kai Li
15
Segment Sizes for Backup Data
Rule of thumb: 2X segment size will
u 
u 
u 
increase space for unique segments by 15%
decrease metadata by about 50%
deduce disk I/Os for writes and reads
12/13/13
Kai Li
16
Real World Example at Datacenter A
12/13/13
Kai Li
17
Real World Compression at Datacenter A
12/13/13
Kai Li
18
Real World Example at Datacenter B
12/13/13
Kai Li
19
Real World Compression at Datacenter B
12/13/13
Kai Li
20
High Throughput Is Challenging
Divide data streams
into segments
Lookup
12/13/13
Fingerprint
Index
Kai Li
Index size for 160TB
w/ 8KB segments
= (160TB/8KB) * 24B
= 480GB!
21
Caching?
Divide data streams
into segments
Miss
Lookup
Index
Cache
Fingerprint
Index
Problem: No locality.
12/13/13
Kai Li
22
Parallel Index Need Many Disks [Venti02]
Divide data streams
into segments
Miss
Lookup
...
Index
Cache
Problem:
Need a lot of disks.
7200RPM disk does 120 lookups/sec.
1MB/sec with 8KB segment per disk
1GB/sec needs 1,000 disks!
12/13/13
Fingerprint
Index
Kai Li
23
Staging Needs More Disks
Data Streams
Divide data streams
into segments
Fingerprint
Index
Lookup
...
Very Big Disk Buffer
Problem: The Buffer needs to be as large or
larger than the full backup!
Big delay for replication
12/13/13
Kai Li
24
Stream Informed Segment Layout
Log structured layout (inspired by LFS)
u  Fixed size large containers to create locality
u  A container
u 
Segments from the same stream
l  Metadata (index data)
l 
Metadata
section
…
Metadata
section
Data
section
Data
section
Stream 1
Stream 2
12/13/13
Metadata
section
Data
section
…
Stream 3
25
A Combination of Techniques
u  Stream
Informed Segment Layout
u  A sophisticated cache for the fingerprint index
Summary data structure for new data
l  “locality-preserved caching” for old data
l 
u  Parallelized
software systems to leverage
multicore processors
Benjamin Zhu, Kai Li and Hugo Patterson. Avoiding the Disk
Bottleneck in the Data Domain Deduplication File System. In
Proceedings of The 6th USENIX Conference on File and Storage
Technologies (FAST’08). February 2008
12/13/13
Kai Li
26
Summary Data Structure
Goal: Use minimal memory to test for new data
⇒ Summarize what segments have been stored, with
Bloom filter (Bloom’70) in RAM
⇒ If Summary Vector says no, it’s new segment
Summary Vector
1
1
1 0
1
1
h1
Set bits fp(s i)
Approximation
h2
h3
Index Data Structure
h1
Read bits fp(s i)
h2
h3
12/13/13
1
27
Locality Preserved Caching
Algorithm
u  Disk Index has all <fingerprint, containerID> pairs
u  On a miss, lookup Disk Index to find containerID
u  Load the container into memory
u  Load the metadata of a container into Index Cache,
replace if needed
Metadata
Fingerprint
Disk
Index
Replacement
Load
metadata
ContainerID
Data
Index
Cache
Container
12/13/13
Kai Li
28
Putting Them Together
A fingerprint
Duplicate
Index
Cache
No
New
Replacement
Summary
Vector
Maybe
Disk
Index
12/13/13
metadata
metadata
metadata
data
metadata
data
data
data
29
Disk I/O Reduction Results
Exchange data (2.56TB) Engineering data (2.39TB)
135-daily full backups
# disk I/Os
No summary,
No SISL/LPC
% of total
100-day daily inc, weekly full
# disk I/Os
% of total
328,613,503 100.00%
318,236,712
100.00%
Summary only
274,364,788
83.49%
259,135,171
81.43%
SISL/LPC only
57,725,844
17.57%
60,358,875
18.97%
Summary &
SISL/LPC
3,477,129
1.06%
1,257,316
0.40%
12/13/13
Kai Li
30
Revenue ($million)
1100 4083 1000 4000 3500 900 800 3000 700 2500 600 2000 500 1500 400 300 1500 1000 200 400 100 500 200 100 80 40 0 0 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 A1 A2
B
C
IPO
Dedup Throughput (MB/sec)
Deduplica/on Throughput M/A
Improved by ~100X in 6 years
12/13/13
Kai Li
31
Physical Capacity 290 1000 250 900 800 200 700 600 140 500 400 150 100 300 200 48 100 0 300 8 20 50 Physical Capacity (TB)
Revenue ($million)
1100 0.875 4 0 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 A1 A2
B
C
IPO
M/A
Increased by ~330X in 6 years
12/13/13
Kai Li
32
Even More Details Segmented
data stream
Dedup Storage Controller Meta data Unique Data Segs Concurrent Garbage Collec/on •  A segment may be shared by mul/ple files •  Backups need to be deleted some/me •  GC cannot interfere with backups Dedup Storage
F1 A 12/13/13
Kai Li
F2 B C D 33
More Details Segmented
data stream
Dedup Storage Controller Meta data Unique Data Segs Concurrent Garbage Collec/on •  A segment may be shared by mul/ple files •  Backups need to be deleted some/me •  GC cannot interfere with backups Dedup Storage
F1 A 12/13/13
Kai Li
F2 B C D 34
Summary •  Dedup storage systems –  Replace tape libraries –  Near-­‐line and archival storage –  Primary storage •  More Redundancy in larger windows •  Locality preserved caching –  Reduce cost –  Achieve high throughput 12/13/13
Kai Li
35
Review Topics
• 
• 
• 
• 
• 
• 
• 
OS structure
Process management
CPU scheduling
I/O devices
Virtual memory
Disks and file systems
General concepts
36
Operating System Structure
•  Abstraction
•  Protection and security
•  Kernel structure
–  Layered
–  Monolithic
–  Micro-kernel
•  Virtualization
–  Virtual machine monitor
37
Process Management
•  Implementation
–  State, creation, context switch
–  Threads and processes
•  Synchronization
–  Race conditions and inconsistencies
–  Mutual exclusion and critical sections
–  Semaphores: P() and V()
–  Atomic operations: interrupt disable, test-and-set.
–  Monitors and Condition Variables
–  Mesa-style monitor
l  Deadlocks
–  How deadlocks occur?
–  How to prevent deadlocks?
38
CPU Scheduling
•  Allocation
–  Non-preemptible resources
•  Scheduling -- Preemptible resources
–  FIFO
–  Round-robin
–  STCF
–  Lottery
39
I/O Devices
• 
• 
• 
• 
• 
• 
Latency and bandwidth
Interrupts and exceptions
DMA mechanisms
Synchronous I/O operations
Asynchronous I/O operations
Message passing
12/13/13
40
Virtual Memory
•  Mechanisms
–  Paging
–  Segmentation
–  Page and segmentation
–  TLB and its management
•  Page replacement
–  FIFO with second chance
–  Working sets
–  WSClock
41
Disks and File Systems
•  Disks
–  Disk behavior and disk scheduling
–  RAID4, RAID5 and RAID6
•  Flash memory
–  Write performance
–  Wear leveling
–  Flash translation layer
• 
• 
• 
• 
• 
• 
• 
Directories and implementation
File layout
Buffer cache
Transaction and journaling file system
NFS and Stateless file system
Snapshot
Deduplication file system
42
Major Concepts
•  Locality
–  Spatial and temporal locality
•  Scheduling
–  Use the past to predict the future
•  Layered abstractions
–  Synchronization, transactions, file systems, etc
•  Caching
–  TLB, VM, buffer cache, etc
43
Operating System as Illusionist
Physical reality
•  Single CPU
•  Interrupts
•  Limited memory
•  No protection
•  Raw storage device
Abstraction
•  Infinite number of CPUs
•  Cooperating sequential
threads
•  Unlimited virtual memory
•  Each address has its own
machine
•  Organized and reliable
storage system
44
Future Courses in Systems
•  Spring
–  COS 461: computer networks
–  COS 598C: Analytics and systems of big data
•  Fall:
–  COS 432: computer security
–  COS 561: Advance computer networks or (or COS
518: Advanced OS)
–  Some grad seminars in systems
12/13/13
45