Download Project Description Userspace Deduplication File System using FUSE

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Object storage wikipedia , lookup

Data vault modeling wikipedia , lookup

Asynchronous I/O wikipedia , lookup

Business intelligence wikipedia , lookup

Design of the FAT file system wikipedia , lookup

File system wikipedia , lookup

Lustre (file system) wikipedia , lookup

File Allocation Table wikipedia , lookup

Disk formatting wikipedia , lookup

XFS wikipedia , lookup

Computer file wikipedia , lookup

File locking wikipedia , lookup

Files-11 wikipedia , lookup

Transcript
Project Description
Userspace Deduplication File System using FUSE
CS 519 Operating Systems Theory
October 24, 2013
Due by Midnight, Friday, December 13, 2013
Background
As we will cover later in the course, disk-based deduplication storage [1,3] has emerged as a dominant
form of cost-efficient storage for data protection. In summary, as data is written to (or ingested by)
such storage systems, deduplication mechanisms remove redundant segments of the data to compress
the data into a highly compacted form. Data protection (i.e. disk backup) is the canonical application
for such a system, and a key requirement for this is high write throughput. As Zhu et al. [3] state,
throughput is a critical requirement for enterprise data protection since deduplication storage systems
have been used to replace more traditional tape-based data protection systems (which typically have
high streaming I/O throughput).
In this project, we will explore the core features and functionality of a disk-based deduplication storage
system by implementing one possible architecture. The remainder of this project description document
covers the phases of the project, provides some implementation guidelines and hints, and specifies the
submission requirements. As has been stated for the homeworks, please do not wait until the final
week to start working on this project. It will be time consuming and difficult to correctly implement
this system.
Project Overview
The goal of this project is to build a functioning deduplication file system. We will build our system
using the FUSE [2] file system framework within the Linux operating system environment. You will
NOT be using xv6 for this project. This will enable us to use it just like any other file system, but will
allow us to code and debug it as a userspace program. Figure 1 illustrates the high-level architecture of
the file system. It also includes some lower-level details, but there may be other details or data
structures not fully described. You should only consider this to be an informed set of suggested
guidelines, rather than a final complete description.
In the figure, there are three layers: (i) the VFS interface layer, (ii) the data/metadata handling layer,
and (iii) the log-structured file system layer (persistence layer). The VFS layer is at the top of the
software stack and represents the glue code that implements the FUSE APIs to support the required file
system calls. Some of those system calls are illustrated in the figure (i.e. read(), write(),
create(), etc.) The next layer down in the stack implements the logic to handle the data and
metadata operations. Since our system will only perform deduplication on file system data (data
operations), we can segregate the two types of operations (data and metadata) into different sub-layers.
The next layer of the stack is the persistence layer, which we will implement as a type of log-structured
file system. Since the upper layers will implement the file system interface, this layer will primarily
handle the storage and retrieval of fixed-sized objects, which we will refer to as containers. Finally,
below that is the disk device. For this project, you should just emulate a real disk device by using a
persistent flat file in the real file system of the PC. This file can be allocated, for example, the first
1
time the file system is mounted, or you could provide an external utility to pre-allocate the “disk”.
Your system should treat this file as a linear array of fixed-sized blocks. There is no need to do any
further disk layer emulation.
Figure 1: System Architecture Diagram
VFS Layer: This layer is the thinnest layer in the stack. It simply represents the code required to
register your system module(s) within the FUSE framework such that they will be able to handle all of
the standard file system calls. These calls are listed in the struct fuse_operations data
structure within the fuse.h header file.
Deduplication Sub-Layer: This component handles the coalescing and packing of data from the user
to be stored in the LFS layer, and the retrieval and unpacking of data from the LFS to be returned to the
user. In order to pack the data (during write operations), this layer performs a set of data manipulation
operations to reduce the data in size as much as possible. First, it creates fixed-sized chunks from the
data (e.g., 4 KB). Then, it uses some form of cryptographically secure form of hashing (e.g., SHA1) to
determine when any fixed-sized chunk is a duplicate of an existing chunk. For any unique chunks, it
compresses and packs them into containers, which are ultimately stored in the LFS layer. Of course,
this description does not cover all of the detailed steps required to safely store and retrieve the data
from the persistent store.
2
Namespace Sub-Layer: This component handles all of the metadata operations related to supporting a
traditional hierarchical file system namespace. Therefore, it needs to implement the mechanisms that
provide the abstraction of files and directories ordered in a hierarchy headed by a root directory. It
should also implement the mechanisms to provide simple support for file system statistics (i.e. create
time, access time, modification time, file size, etc.) and UNIX-style user permissions (i.e. rwx for
users, groups, and others). Also, if you need to implement any custom file system calls, you should do
so as a custom ioctl().
Log-Structured File System Layer: This component implements the persistence layer. The interface
provided to the upper layers should be as simple as possible but still provide the required functionality.
This layer implements container-based log-structured storage. That is, new containers are appended to
a log whenever written out to disk. Containers on disk are immutable. Containers (or portions of
containers as an optimization) can be read from disk. Each container can (and probably should) have a
metadata section as a header or footer that describes the container’s contents (to aid reads and
cleaning). The containers can store data or metadata, such that all file system information can be (and
should be) made persistent through the mount/umount cycle. Finally, an LFS can suffer from
fragmentation. So, it should implement a cleaning mechanism in order to coalesce containers to
aggregate free space for future container appends.
Project Phases
This section of the document describes a rough phasing for the project and the requirements for each
phase. This is meant to guide you and help get you started. It is not meant to be a step-by-step HowTo
guide, though. The key to doing well on this project is not blindly trying to follow the steps outlined in
this section, but to THINK about the problems and issues. Don’t just jump into coding or you will
likely waste hours of your time. You should consider the architecture first, then design the layers and
system components logically. This includes understanding the various conditions under which the
system will operate. That is, you should consider the following questions as well as others that occur to
you. How will each type of file system call be handled? What corner cases will arise and how will the
system handle them correctly? Where are the performance bottlenecks? What optimizations can be
leveraged to alleviate them? What are the performance trade-offs? What workloads will the system
service with good performance? What workloads will experience poor performance? As part of your
project write-up, you should describe the thinking behind your system. For example, what design
choices did you consider? Why did you choose one set of design parameters over another? This step is
as important as any other step, since it is a way for you to demonstrate your thought process.
Phase 0: Working with FUSE
In this project, you will be expected to implement the entire system as a FUSE-based [4] file system
within the Linux operating system environment. So, the first phase of the project is for you to get
familiar with using FUSE and coding FUSE-based file systems.
To complete this phase you must satisfy the following requirements.
Requirement 1: Work through the following on-line tutorial:
Writing a FUSE Filesystem: a Tutorial
Please note that this phase is recommended, but optional. There will be no credit given for successfully
learning how to use and code within the FUSE framework. Therefore, if you are completely familiar
3
with FUSE feel free to skip this. Otherwise, you should make sure you complete this phase.
Phase 1: Building a Log-structured File System in FUSE
As we will also discuss later in the course, log-structured file systems (LFS) [2] have been proposed as
an alternative approach for organizing disk-based storage. The key idea in this type of file system is to
treat a disk as a single large log. Data, once written to the log is considered immutable, and new data
can only ever be appended to the log. As such, updates to existing data occur as invalidations to prior
versions and appends of new versions (potentially copying forward unmodified portions of live data,
i.e. read-modify-write).
Under LFS, disk devices are still addressed as a linear array of blocks, but to improve I/O performance
(especially for small writes) individual data updates are collected into larger units for writing. In the
context of this project, we will refer to the I/O access unit as a container. As shown in Figure 1, a
container typically consists of some multiple of a disk block (2 blocks per container in the Figure 1
example; it can be more blocks per container though). The benefit of this is to amortize the seek time
of the disk head, to achieve close to full disk bandwidth for reads and writes. Finally, disk space
management becomes an issue for log-structured file systems. As the log is written and the
invalidations/updates occur, the file system becomes more and more fragmented. Cleaning is an
important process that an LFS uses to compact portions of the log to free larger extents of space for
future log appends. Such cleaning can hurt file system performance if implemented in a sub-optimal
manner, though.
To complete this phase you must satisfy the following requirements.
Requirement 1: Define and implement the core on-disk data structures and mechanisms required for
file and directory creation, retrieval, and destruction (see Table I in the LFS paper [3]). You should
document these data structures and fully describe them in your writeup.
Requirement 2: Implement a simplified file system interface. Since the LFS will ultimately be the
bottom layer of the deduplication storage software stack, you will only need a thin interface between
this layer and the upper layer. You will need to define an interface such that the LFS will minimally
support the following operations: (i) writing a new container to the LFS (the LFS should return a
unique container id for this newly written container), and (ii) reading a specific container (identified by
its container id) from the LFS. In this version of an LFS, all I/O operations are done at the granularity
of whole containers. Also, you may need other interface functions. Finally, for testing purposes, you
should be able to compile and link against this simplistic LFS to test it by issuing sequences of
container writes and reads. You should document your final API as function stubs and description as
part of your project writeup.
Requirement 3: Implement a cleaning mechanism for your log-structured file system layer. The
cleaning mechanism should not run automatically, but instead should support manually cleaning
specific containers via a function call. This call will be needed so that the upper layer components can
implement deduplication-specific garbage collection. The goal of cleaning is to copy forward live
portions of the container into a new container, thereby removing internal container fragmentation
caused by segment invalidation due to overwrites. So, the call should minimally be given a container
id and a vector of live segments for the container. It may require other arguments. You should
document this feature and any related function calls in your writeup.
4
Phase 2: Deduplication Layer Implementation
Once you have implemented and tested the LFS layer, it is time to start building the upper layer
components. This layer includes the FUSE file system interface and will provide support for all of the
required VFS API functions. There are generally three types of calls that can be made into this layer:
data write operations, data read operations, and metadata operations. Your design should consider and
support all three types. Also, your file system will need to provide support for a namespace. This boils
down to supporting the directory and file abstractions.
Namespace and associated metadata operations: For this simple version of a deduplicated file
system, you should maintain separate metadata-only containers for inodes and directory data. As in
LFS, you should have an inode map (stored at a fixed location on disk) that maps file inodes to data
containers and directory inodes to metadata containers. Also, you may need to keep indirect blocks of
inodes stored in metadata containers to support large files. A directory is just a special type of file that
contains a map of sub-object (subdirectory or file) names to inode numbers. So, to resolve a file name
to an inode you must scan the parent directory of the file for the name to inode number mapping. Then
you use the inode number as an index into the inode map to determine the file stats and data containers.
File stats could be stored in the inode map directly while file data will be stored in data containers.
The layout of file data for a file is described by a file recipe. Essentially, a file recipe is an ordered list
of file segments (chunks) and their location. In a traditional file system this would be the list of blocks
composing the file. In a deduplicating file system, this is a list of segments. Each entry in the file
recipe maps a file segment to its fingerprint. A fingerprint is a cryptographically secure hash (e.g.,
SHA1) of a data segment used to uniquely identify the data segment.
The namespace maintains an index of fingerprints to container id's. Since the data is stored in the file
system in a deduplicated manner, there will only ever be one entry in the fingerprint index per unique
data segment in the file system. Finally, each container should have a header or footer that stores
container metadata and maps the fingerprints for segments stored in a container to the block offset in
that container where the data segment starts.
Write operations: A file system client will write data to the file system in terms of files. Therefore,
your file system must translate all file write operations ultimately into operations on the disk layer
(block writes). For illustration purposes, let's assume that all writes enter the file system as a <file_id,
offset, size, buffer> tuple. You must align the buffer on disk block boundaries (which may entail
reading some data from disk before applying the updates), then deduplicating the file blocks. For this
assignment, let us make the simplifying assumption that the granularity of deduplication (segment size)
is the same as the granularity of disk I/O (file system block size). We can set it to some fixed power of
2 size, such as 4 KB. Note that by doing so, we are choosing to do fixed-sized deduplication.
Therefore, to perform deduplication, we must generate the fingerprint (unique hash) of each 4 KB
block and check the fingerprint index to see if it exists in the system already. If so, then we update the
file recipe to point to the existing entry and move on to the next block. If not, then we add the segment
to the next available container, update the container header for the newly added segment, add an entry
in the fingerprint index for the new unique fingerprint/segment, and update the file recipe to point to
the new fingerprint entry. There are, of course, numerous corner cases that you will have to check for
and handle correctly. Also, there may be optimizations that you can apply to improve the performance
of the system, while maintaining correctness. For example, you can apply compression (e.g., gz, lz,
etc.) to each segment prior to packing in a container to improve data reduction of the file system even
further.
5
Read operations: A file system client also performs reads in terms of files. For illustration purposes,
let's assume that all reads enter the file system as a <file_id, offset, size> tuple. Again, you must align
read requests to disk block boundaries (even though you will only return the requested data based on
offset and size). For each segment (disk block) of the file that must be read, we perform the following
steps. First, we fetch the file recipe and fetch the fingerprint for the specific segment we wish to read.
Then, we query the fingerprint index to find the container id of the container that stores the segment we
are about to read. We read the container header into memory, find the offset for the segment we seek,
and read out the segment from the LFS container. Again, there are numerous corner cases that you will
have to check for and handle correctly. Also, there may be optimizations that you can apply to improve
the performance of the system, while maintaining correctness.
To complete this phase you must satisfy the following requirements.
Requirement 1: Implement the required set of VFS functions within the FUSE framework (see fuse.h
for API details). Your implementation of the VFS API should comprehensively cover the complete
(reasonable) set of functions, such that it can be used as a typical file system. If you choose to leave
any functions unimplemented, please justify this decision in your writeup. Also, if you extend the API
(via custom ioctl()'s, for example) also explain this in the writeup.
Requirement 2: Implement a fixed-sized deduplication mechanism using simple file recipes. It
should support in-line deduplication. This means that as data is written into the system, it should be
deduplicated prior to being written out to disk. File recipes should also be implemented as part of the
deduplication mechanism. These can be stored contiguously within metadata containers, and do not
need to be deduplicated. I have sketched a rough work flow in the description above that meets this
criteria, but you are free to consider other design possibilities. The choice of hash function for
deduplication is an important one as it impacts both performance and correctness. If the hash is too
weak, then you run the risk of having a high number of collisions in the fingerprint namespace (i.e. data
corruption). If the hash is too strong then the performance of the system will suffer due to high CPU
and memory requirements. As part of your writeup, please justify your design decision for this part.
Also, please describe any optimizations you have included to improve performance while not
sacrificing correctness.
Requirement 3: Implement a fingerprint index as a means to map cryptographically secure hash
strings to container id’s. The choice of data structure should be considered carefully, since the system
will query this index frequently in the critical performance path of reads and writes. As part of your
writeup, describe the structure of the fingerprint index and justify your choice of this structure within
the context of the performance and correctness of the system. Also, describe any other data structures
you considered (if any), the trade-offs in the decision between them and why you ultimately did not
choose them.
Phase 3: Pairing Deduplication with the Log-structured File System
By this point, we have a simplified log-structured file system to store and fetch containers,
deduplication logic to perform data reduction during write operations and to reassemble files during
read operations, a mechanism to pack unique segments into containers, and a set of namespace handlers
to implement the hierarchical namespace abstractions. By this point, you should have tested the
correctness of all of these components separately. Now, it is time to put them together. In this phase,
the system components should be connected together to support end-to-end handling of persistent
writes and reads of file data. Also, the namespace and any other required metadata should be
6
persistently stored within the LFS layer. The end result is a file system that can handle the VFS file
system operation between successive mounts and umounts.
To complete this phase you must satisfy the following requirement.
Requirement 1: A completed and tested deduplication file system module that works within the Linux
FUSE framework. As part of your writeup, please describe the issues encountered in combining the
software layers. Also, you should describe the tests you performed on the system to prove correctness
(including the various boundary conditions that can occur), as well as the tests you performed to
measure performance. This should include a simple graph or two that reports the performance of your
system under sequential file reads and writes, random file reads and writes, and metadata operations.
Phase 4: Project Write-up
Along with your code, you should prepare and submit a project write-up. Throughout this project
description document, I have pointed out things that should be included in your write-up. It is expected
that each team member will contribute to the writeup for the project. The structure of the write-up
should be as follows:
Authors: Please list the authors of the writeup at the top (these are all of the project team members, of
course).
Introduction Section: Summarizes your project write-up.
System Architecture and Design Section: Describe the architecture and design details of your
specific system implementation. Point out the key or interesting features and trade-off in the design.
Implementation Section: Describe any relevant code issues/complexities. Also, this section should
delineate which layers/components/sections of the system that were written by which members of the
team. It is expected that each member contribute substantively to both the coding and documentation
aspects of the project.
Evaluation Section: Describe how you evaluated your system and why you chose to do so in that
way. Also, present the results of your evaluation. Finally, if you can draw any conclusions regarding
your system, present those in this section along with the results. Be sure to support your claims with
the results, though.
Conclusion Section: Conclude the writeup by summarizing the key aspects of the project, results, and
major conclusions.
Extra Credit Extensions
The following are a list of possible ways to extend the base system. They are not required for
completion of the assignment, but provide a way for interested students to extend the project in
interesting ways. They are intentionally left somewhat open-ended to allow for students to decide
exactly how far they wish to explore the optional topic. Extra credit given, for any specific topic, will
be based upon the depth and quality of coverage for that topic. Teams may choose to do more than one
topic, as well. Please note that you should first complete the base project requirements prior to
completing any of these extensions. Although you will receive credit for all work completed, it is more
efficient (in terms of credit) to have a completed system than to have an incomplete, yet extended
system.
7
Phase E1: Extra Credit – Garbage Collection
As mentioned earlier in the document, as well as in the LFS [2] paper, log-structured file systems can
suffer from fragmentation over long periods of use. The typical way to address this is through
cleaning. The idea behind cleaning is to select candidate containers, copy the live blocks forward into a
new container, and then free the selected container that has been cleaned. This will open a new
available container slot in the file system for a future container write.
Up to this point, we have just assumed that there will always be an available free container slot in the
file system by over-provisioning the file backing the file system. Although we have implemented a
cleaning mechanism, we have not fully implemented garbage collection (GC). We shall do so now.
To complete this phase you must satisfy the following requirements.
Requirement 1: Candidate selection is an important part of the cleaning procedure. If the system
selects candidates that have too much live data, then copying forward will take more time. Also, since
cleaning might be required to free space in a timely manner, in order to write a ready container,
spending too much time to select candidates will affect the performance of the system. To satisfy this
requirement you should consider this trade-off and implement a candidate selection algorithm that
logically make sense for the system. You should include a discussion of this decision in your project
writeup.
Requirement 2: Choosing when to run garbage collection is another important design decision.
Should it be run constantly? Should it only run when the system is idle? Should it be run whenever a
free container slot is needed but unavailable? Or, should it be based upon the level of fragmentation of
the file system? To satisfy this requirement, you should consider the timing choices and choose an
algorithm that best fits the goals of your system. You should also include a discussion of this decision
in your project writeup.
Requirement 3: Implement an interface that can be called from the upper layer to control garbage
collection. This includes integrating GC with the upper layers of the software stack and evaluating the
impact of GC on the performance of your system. You should include a discussion of the GC
integration, as well as results from an evaluation of the performance impact on the system introduced
by GC. This might include, for example, a study of the common case performance vs. worst case
performance.
Phase E2: Extra Credit – Variable-sized Chunking
One of the early, simplifying, design decisions we made was to choose fixed-sized chunking (chunks
are the same as segments) instead of variable-sized (content defined) chunking. It has been shown [3]
that variable-sized chunking can increase the deduplication ratio (better compression) substantially. In
fact, in some cases fixed-sized chunking fails to deduplicate very well at all.
The goal of this system extension is to implement variable-sized chunking along side fixed-sized
chunking in your system. This would allow you to evaluate the impact of content-defined chunking vs.
fixed-sized.
To complete this phase you must satisfy the following requirements.
Requirement 1: It is not obvious the correct way to define chunk boundaries when you move away
from fixed-sized chunking. The typical way is to use a form of Rabin fingerprinting [5] (rolling
hashing) to deterministically identify chunk boundaries based on the data content. You should also try
8
to set the average chunk size by specifying a min and maximum size for each chunk. Of course, there
will be numerous corner cases that must be addressed, for example, end of file handling, small file
handling, zero-filled chunks, etc. To satisfy this requirement, you should consider the trade-offs in
rolling fingerprint choices (which fingerprinting function to use) and choose one that best matches your
system goals. As part of your write-up, you should discuss and justify your choice of chunking
algorithm.
Requirement 2: Although you could just replace your existing fixed-sized chunking algorithm with
the new variable-sized chunking algorithm, you would not be able to compare them side-by-side in a
systematic fashion. So, you should implement the new algorithm as a feature of the system in such a
way that a user could choose either fixed-sized chunking or variable-sized chunking during file system
initialization. This will allow us to compare one against the other. To satisfy this requirement, you
must implement the new algorithm and build in a way to optionally select one or the other. As part of
your write-up, you should discuss and the implementation details of your variable-sized chunking
algorithm, and evaluate how well it works in practice. How well does it handle shifts in data? When
does it perform better (in terms of deduplication) than fixed-sized chunking? When does it do worse?
How does variable chunking affect the throughput performance of the system?
Phase E3: Extra Credit – Deduplicated File Recipes
A second simplifying design decision was to treat file recipes specially and to store them in an
uncompressed, contiguous format in containers. Although, for a system that has a small amount of data
or a small number of files, this may be fine. As the number of files grows to a large amount, and/or the
amount of data stored in the system scales, file recipes will consume a non-trival portion of the
available storage. Therefore, one interesting extension is to store them in an efficient manner. Since
we already have a deduplication storage system available to us, it seems that we should be able to
utilize that same system to store our file recipes. As you will likely discover, this can complicate
things. In fact, the common way to handle this has been to utilize a data structure called a Merkle Tree
[6]. Now, instead of having an independent list of hashes for each file, we can have independent root
nodes for each file, and have duplicates share portions/nodes/subtrees of their Merkle Trees.
To complete this phase you must satisfy the following requirements.
Requirement 1: Start by implementing and testing a Merkle Tree data structure. You should be able
to create, update, and destroy these trees. This code will be utilized by the system to implement the file
recipes. As part of your write-up you should include a description of your Merkle Tree implementation
and interface. How will the system use the tree?
Requirement 2: Now that you have a working Merkle Tree, implement file recipes using it. Once
again, the system should provide an option to choose between Merkle Trees and simple file recipes
during file system initialization. As part of your write-up, you should discuss the implementation
details of your Merkle Tree based file recipes, and evaluate how well they work in practice. How much
reduction in storage used (better deduplication) do Merkle Tree based file recipes provide? How do
Merkle Tree based file recipes affect the throughput performance of the system?
Phase E4: Extra Credit – Independent Proposals
You may have thought of some interesting extensions on your own. If so, please propose them to me
so that we can discuss. Your proposal should include a brief description of the feature you wish to
implement and the reasons why this feature would be good to add to the system. Your proposal should
9
also include some notion of how feasible it would be to implement in the system given the time until
the project is due. Finally, your proposal should include how, if possible, you will attempt to evaluate
the effectiveness of your proposed extension. If you are unsure about your idea and would like to get
preliminary feedback from me, please feel free to email me about it or talk to me after class.
Project Submission Guidelines
Minimally, the following items should be submitted as part of this project:
1) Project writeup document in PDF format.
2) Source code for your deduplication file system. This code should be compilable on a typical recent
x64-based Linux system. The code should include a Makefile and a README file. The Makefile
should build the whole system. The README file should describe how to build the system, what
external libraries (if any) are needed to build it, how to initialize the file system, and how to run it.
References
[1] Quinlan, S. and Dorward, S., Venti: A New Approach to Archival Data Storage. In Proceedings of
the 1st USENIX Conference on File and Storage Technologies (FAST '02). Berkeley, CA, 2002.
[2] Rosenblum, M. and Ousterhout, J. K., The Design and Implementation of a Log-Structured File
System. ACM Trans. Comput. Syst. 10, 1 (February 1992), 26-52.
[3] Zhu, B., Li, K., and Patterson, H., Avoiding the Disk Bottleneck in the Data Domain Deduplication
File System. In Proceedings of the 6th USENIX Conference on File and Storage Technologies
(FAST'08). Berkeley, CA, 2008.
[4] FUSE, Filesystem in Userspace, http://fuse.sourceforge.net/, 2013.
[5] Rabin Fingerprint, Wikipedia, http://en.wikipedia.org/wiki/Rabin_fingerprint, 2013.
[6] Merkle Tree, Wikipedia, http://en.wikipedia.org/wiki/Merkle_tree, 2013.
10