Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Object storage wikipedia , lookup
Data vault modeling wikipedia , lookup
Asynchronous I/O wikipedia , lookup
Business intelligence wikipedia , lookup
Design of the FAT file system wikipedia , lookup
File system wikipedia , lookup
Lustre (file system) wikipedia , lookup
File Allocation Table wikipedia , lookup
Disk formatting wikipedia , lookup
Computer file wikipedia , lookup
Project Description Userspace Deduplication File System using FUSE CS 519 Operating Systems Theory October 24, 2013 Due by Midnight, Friday, December 13, 2013 Background As we will cover later in the course, disk-based deduplication storage [1,3] has emerged as a dominant form of cost-efficient storage for data protection. In summary, as data is written to (or ingested by) such storage systems, deduplication mechanisms remove redundant segments of the data to compress the data into a highly compacted form. Data protection (i.e. disk backup) is the canonical application for such a system, and a key requirement for this is high write throughput. As Zhu et al. [3] state, throughput is a critical requirement for enterprise data protection since deduplication storage systems have been used to replace more traditional tape-based data protection systems (which typically have high streaming I/O throughput). In this project, we will explore the core features and functionality of a disk-based deduplication storage system by implementing one possible architecture. The remainder of this project description document covers the phases of the project, provides some implementation guidelines and hints, and specifies the submission requirements. As has been stated for the homeworks, please do not wait until the final week to start working on this project. It will be time consuming and difficult to correctly implement this system. Project Overview The goal of this project is to build a functioning deduplication file system. We will build our system using the FUSE [2] file system framework within the Linux operating system environment. You will NOT be using xv6 for this project. This will enable us to use it just like any other file system, but will allow us to code and debug it as a userspace program. Figure 1 illustrates the high-level architecture of the file system. It also includes some lower-level details, but there may be other details or data structures not fully described. You should only consider this to be an informed set of suggested guidelines, rather than a final complete description. In the figure, there are three layers: (i) the VFS interface layer, (ii) the data/metadata handling layer, and (iii) the log-structured file system layer (persistence layer). The VFS layer is at the top of the software stack and represents the glue code that implements the FUSE APIs to support the required file system calls. Some of those system calls are illustrated in the figure (i.e. read(), write(), create(), etc.) The next layer down in the stack implements the logic to handle the data and metadata operations. Since our system will only perform deduplication on file system data (data operations), we can segregate the two types of operations (data and metadata) into different sub-layers. The next layer of the stack is the persistence layer, which we will implement as a type of log-structured file system. Since the upper layers will implement the file system interface, this layer will primarily handle the storage and retrieval of fixed-sized objects, which we will refer to as containers. Finally, below that is the disk device. For this project, you should just emulate a real disk device by using a persistent flat file in the real file system of the PC. This file can be allocated, for example, the first 1 time the file system is mounted, or you could provide an external utility to pre-allocate the “disk”. Your system should treat this file as a linear array of fixed-sized blocks. There is no need to do any further disk layer emulation. Figure 1: System Architecture Diagram VFS Layer: This layer is the thinnest layer in the stack. It simply represents the code required to register your system module(s) within the FUSE framework such that they will be able to handle all of the standard file system calls. These calls are listed in the struct fuse_operations data structure within the fuse.h header file. Deduplication Sub-Layer: This component handles the coalescing and packing of data from the user to be stored in the LFS layer, and the retrieval and unpacking of data from the LFS to be returned to the user. In order to pack the data (during write operations), this layer performs a set of data manipulation operations to reduce the data in size as much as possible. First, it creates fixed-sized chunks from the data (e.g., 4 KB). Then, it uses some form of cryptographically secure form of hashing (e.g., SHA1) to determine when any fixed-sized chunk is a duplicate of an existing chunk. For any unique chunks, it compresses and packs them into containers, which are ultimately stored in the LFS layer. Of course, this description does not cover all of the detailed steps required to safely store and retrieve the data from the persistent store. 2 Namespace Sub-Layer: This component handles all of the metadata operations related to supporting a traditional hierarchical file system namespace. Therefore, it needs to implement the mechanisms that provide the abstraction of files and directories ordered in a hierarchy headed by a root directory. It should also implement the mechanisms to provide simple support for file system statistics (i.e. create time, access time, modification time, file size, etc.) and UNIX-style user permissions (i.e. rwx for users, groups, and others). Also, if you need to implement any custom file system calls, you should do so as a custom ioctl(). Log-Structured File System Layer: This component implements the persistence layer. The interface provided to the upper layers should be as simple as possible but still provide the required functionality. This layer implements container-based log-structured storage. That is, new containers are appended to a log whenever written out to disk. Containers on disk are immutable. Containers (or portions of containers as an optimization) can be read from disk. Each container can (and probably should) have a metadata section as a header or footer that describes the container’s contents (to aid reads and cleaning). The containers can store data or metadata, such that all file system information can be (and should be) made persistent through the mount/umount cycle. Finally, an LFS can suffer from fragmentation. So, it should implement a cleaning mechanism in order to coalesce containers to aggregate free space for future container appends. Project Phases This section of the document describes a rough phasing for the project and the requirements for each phase. This is meant to guide you and help get you started. It is not meant to be a step-by-step HowTo guide, though. The key to doing well on this project is not blindly trying to follow the steps outlined in this section, but to THINK about the problems and issues. Don’t just jump into coding or you will likely waste hours of your time. You should consider the architecture first, then design the layers and system components logically. This includes understanding the various conditions under which the system will operate. That is, you should consider the following questions as well as others that occur to you. How will each type of file system call be handled? What corner cases will arise and how will the system handle them correctly? Where are the performance bottlenecks? What optimizations can be leveraged to alleviate them? What are the performance trade-offs? What workloads will the system service with good performance? What workloads will experience poor performance? As part of your project write-up, you should describe the thinking behind your system. For example, what design choices did you consider? Why did you choose one set of design parameters over another? This step is as important as any other step, since it is a way for you to demonstrate your thought process. Phase 0: Working with FUSE In this project, you will be expected to implement the entire system as a FUSE-based [4] file system within the Linux operating system environment. So, the first phase of the project is for you to get familiar with using FUSE and coding FUSE-based file systems. To complete this phase you must satisfy the following requirements. Requirement 1: Work through the following on-line tutorial: Writing a FUSE Filesystem: a Tutorial Please note that this phase is recommended, but optional. There will be no credit given for successfully learning how to use and code within the FUSE framework. Therefore, if you are completely familiar 3 with FUSE feel free to skip this. Otherwise, you should make sure you complete this phase. Phase 1: Building a Log-structured File System in FUSE As we will also discuss later in the course, log-structured file systems (LFS) [2] have been proposed as an alternative approach for organizing disk-based storage. The key idea in this type of file system is to treat a disk as a single large log. Data, once written to the log is considered immutable, and new data can only ever be appended to the log. As such, updates to existing data occur as invalidations to prior versions and appends of new versions (potentially copying forward unmodified portions of live data, i.e. read-modify-write). Under LFS, disk devices are still addressed as a linear array of blocks, but to improve I/O performance (especially for small writes) individual data updates are collected into larger units for writing. In the context of this project, we will refer to the I/O access unit as a container. As shown in Figure 1, a container typically consists of some multiple of a disk block (2 blocks per container in the Figure 1 example; it can be more blocks per container though). The benefit of this is to amortize the seek time of the disk head, to achieve close to full disk bandwidth for reads and writes. Finally, disk space management becomes an issue for log-structured file systems. As the log is written and the invalidations/updates occur, the file system becomes more and more fragmented. Cleaning is an important process that an LFS uses to compact portions of the log to free larger extents of space for future log appends. Such cleaning can hurt file system performance if implemented in a sub-optimal manner, though. To complete this phase you must satisfy the following requirements. Requirement 1: Define and implement the core on-disk data structures and mechanisms required for file and directory creation, retrieval, and destruction (see Table I in the LFS paper [3]). You should document these data structures and fully describe them in your writeup. Requirement 2: Implement a simplified file system interface. Since the LFS will ultimately be the bottom layer of the deduplication storage software stack, you will only need a thin interface between this layer and the upper layer. You will need to define an interface such that the LFS will minimally support the following operations: (i) writing a new container to the LFS (the LFS should return a unique container id for this newly written container), and (ii) reading a specific container (identified by its container id) from the LFS. In this version of an LFS, all I/O operations are done at the granularity of whole containers. Also, you may need other interface functions. Finally, for testing purposes, you should be able to compile and link against this simplistic LFS to test it by issuing sequences of container writes and reads. You should document your final API as function stubs and description as part of your project writeup. Requirement 3: Implement a cleaning mechanism for your log-structured file system layer. The cleaning mechanism should not run automatically, but instead should support manually cleaning specific containers via a function call. This call will be needed so that the upper layer components can implement deduplication-specific garbage collection. The goal of cleaning is to copy forward live portions of the container into a new container, thereby removing internal container fragmentation caused by segment invalidation due to overwrites. So, the call should minimally be given a container id and a vector of live segments for the container. It may require other arguments. You should document this feature and any related function calls in your writeup. 4 Phase 2: Deduplication Layer Implementation Once you have implemented and tested the LFS layer, it is time to start building the upper layer components. This layer includes the FUSE file system interface and will provide support for all of the required VFS API functions. There are generally three types of calls that can be made into this layer: data write operations, data read operations, and metadata operations. Your design should consider and support all three types. Also, your file system will need to provide support for a namespace. This boils down to supporting the directory and file abstractions. Namespace and associated metadata operations: For this simple version of a deduplicated file system, you should maintain separate metadata-only containers for inodes and directory data. As in LFS, you should have an inode map (stored at a fixed location on disk) that maps file inodes to data containers and directory inodes to metadata containers. Also, you may need to keep indirect blocks of inodes stored in metadata containers to support large files. A directory is just a special type of file that contains a map of sub-object (subdirectory or file) names to inode numbers. So, to resolve a file name to an inode you must scan the parent directory of the file for the name to inode number mapping. Then you use the inode number as an index into the inode map to determine the file stats and data containers. File stats could be stored in the inode map directly while file data will be stored in data containers. The layout of file data for a file is described by a file recipe. Essentially, a file recipe is an ordered list of file segments (chunks) and their location. In a traditional file system this would be the list of blocks composing the file. In a deduplicating file system, this is a list of segments. Each entry in the file recipe maps a file segment to its fingerprint. A fingerprint is a cryptographically secure hash (e.g., SHA1) of a data segment used to uniquely identify the data segment. The namespace maintains an index of fingerprints to container id's. Since the data is stored in the file system in a deduplicated manner, there will only ever be one entry in the fingerprint index per unique data segment in the file system. Finally, each container should have a header or footer that stores container metadata and maps the fingerprints for segments stored in a container to the block offset in that container where the data segment starts. Write operations: A file system client will write data to the file system in terms of files. Therefore, your file system must translate all file write operations ultimately into operations on the disk layer (block writes). For illustration purposes, let's assume that all writes enter the file system as a <file_id, offset, size, buffer> tuple. You must align the buffer on disk block boundaries (which may entail reading some data from disk before applying the updates), then deduplicating the file blocks. For this assignment, let us make the simplifying assumption that the granularity of deduplication (segment size) is the same as the granularity of disk I/O (file system block size). We can set it to some fixed power of 2 size, such as 4 KB. Note that by doing so, we are choosing to do fixed-sized deduplication. Therefore, to perform deduplication, we must generate the fingerprint (unique hash) of each 4 KB block and check the fingerprint index to see if it exists in the system already. If so, then we update the file recipe to point to the existing entry and move on to the next block. If not, then we add the segment to the next available container, update the container header for the newly added segment, add an entry in the fingerprint index for the new unique fingerprint/segment, and update the file recipe to point to the new fingerprint entry. There are, of course, numerous corner cases that you will have to check for and handle correctly. Also, there may be optimizations that you can apply to improve the performance of the system, while maintaining correctness. For example, you can apply compression (e.g., gz, lz, etc.) to each segment prior to packing in a container to improve data reduction of the file system even further. 5 Read operations: A file system client also performs reads in terms of files. For illustration purposes, let's assume that all reads enter the file system as a <file_id, offset, size> tuple. Again, you must align read requests to disk block boundaries (even though you will only return the requested data based on offset and size). For each segment (disk block) of the file that must be read, we perform the following steps. First, we fetch the file recipe and fetch the fingerprint for the specific segment we wish to read. Then, we query the fingerprint index to find the container id of the container that stores the segment we are about to read. We read the container header into memory, find the offset for the segment we seek, and read out the segment from the LFS container. Again, there are numerous corner cases that you will have to check for and handle correctly. Also, there may be optimizations that you can apply to improve the performance of the system, while maintaining correctness. To complete this phase you must satisfy the following requirements. Requirement 1: Implement the required set of VFS functions within the FUSE framework (see fuse.h for API details). Your implementation of the VFS API should comprehensively cover the complete (reasonable) set of functions, such that it can be used as a typical file system. If you choose to leave any functions unimplemented, please justify this decision in your writeup. Also, if you extend the API (via custom ioctl()'s, for example) also explain this in the writeup. Requirement 2: Implement a fixed-sized deduplication mechanism using simple file recipes. It should support in-line deduplication. This means that as data is written into the system, it should be deduplicated prior to being written out to disk. File recipes should also be implemented as part of the deduplication mechanism. These can be stored contiguously within metadata containers, and do not need to be deduplicated. I have sketched a rough work flow in the description above that meets this criteria, but you are free to consider other design possibilities. The choice of hash function for deduplication is an important one as it impacts both performance and correctness. If the hash is too weak, then you run the risk of having a high number of collisions in the fingerprint namespace (i.e. data corruption). If the hash is too strong then the performance of the system will suffer due to high CPU and memory requirements. As part of your writeup, please justify your design decision for this part. Also, please describe any optimizations you have included to improve performance while not sacrificing correctness. Requirement 3: Implement a fingerprint index as a means to map cryptographically secure hash strings to container id’s. The choice of data structure should be considered carefully, since the system will query this index frequently in the critical performance path of reads and writes. As part of your writeup, describe the structure of the fingerprint index and justify your choice of this structure within the context of the performance and correctness of the system. Also, describe any other data structures you considered (if any), the trade-offs in the decision between them and why you ultimately did not choose them. Phase 3: Pairing Deduplication with the Log-structured File System By this point, we have a simplified log-structured file system to store and fetch containers, deduplication logic to perform data reduction during write operations and to reassemble files during read operations, a mechanism to pack unique segments into containers, and a set of namespace handlers to implement the hierarchical namespace abstractions. By this point, you should have tested the correctness of all of these components separately. Now, it is time to put them together. In this phase, the system components should be connected together to support end-to-end handling of persistent writes and reads of file data. Also, the namespace and any other required metadata should be 6 persistently stored within the LFS layer. The end result is a file system that can handle the VFS file system operation between successive mounts and umounts. To complete this phase you must satisfy the following requirement. Requirement 1: A completed and tested deduplication file system module that works within the Linux FUSE framework. As part of your writeup, please describe the issues encountered in combining the software layers. Also, you should describe the tests you performed on the system to prove correctness (including the various boundary conditions that can occur), as well as the tests you performed to measure performance. This should include a simple graph or two that reports the performance of your system under sequential file reads and writes, random file reads and writes, and metadata operations. Phase 4: Project Write-up Along with your code, you should prepare and submit a project write-up. Throughout this project description document, I have pointed out things that should be included in your write-up. It is expected that each team member will contribute to the writeup for the project. The structure of the write-up should be as follows: Authors: Please list the authors of the writeup at the top (these are all of the project team members, of course). Introduction Section: Summarizes your project write-up. System Architecture and Design Section: Describe the architecture and design details of your specific system implementation. Point out the key or interesting features and trade-off in the design. Implementation Section: Describe any relevant code issues/complexities. Also, this section should delineate which layers/components/sections of the system that were written by which members of the team. It is expected that each member contribute substantively to both the coding and documentation aspects of the project. Evaluation Section: Describe how you evaluated your system and why you chose to do so in that way. Also, present the results of your evaluation. Finally, if you can draw any conclusions regarding your system, present those in this section along with the results. Be sure to support your claims with the results, though. Conclusion Section: Conclude the writeup by summarizing the key aspects of the project, results, and major conclusions. Extra Credit Extensions The following are a list of possible ways to extend the base system. They are not required for completion of the assignment, but provide a way for interested students to extend the project in interesting ways. They are intentionally left somewhat open-ended to allow for students to decide exactly how far they wish to explore the optional topic. Extra credit given, for any specific topic, will be based upon the depth and quality of coverage for that topic. Teams may choose to do more than one topic, as well. Please note that you should first complete the base project requirements prior to completing any of these extensions. Although you will receive credit for all work completed, it is more efficient (in terms of credit) to have a completed system than to have an incomplete, yet extended system. 7 Phase E1: Extra Credit – Garbage Collection As mentioned earlier in the document, as well as in the LFS [2] paper, log-structured file systems can suffer from fragmentation over long periods of use. The typical way to address this is through cleaning. The idea behind cleaning is to select candidate containers, copy the live blocks forward into a new container, and then free the selected container that has been cleaned. This will open a new available container slot in the file system for a future container write. Up to this point, we have just assumed that there will always be an available free container slot in the file system by over-provisioning the file backing the file system. Although we have implemented a cleaning mechanism, we have not fully implemented garbage collection (GC). We shall do so now. To complete this phase you must satisfy the following requirements. Requirement 1: Candidate selection is an important part of the cleaning procedure. If the system selects candidates that have too much live data, then copying forward will take more time. Also, since cleaning might be required to free space in a timely manner, in order to write a ready container, spending too much time to select candidates will affect the performance of the system. To satisfy this requirement you should consider this trade-off and implement a candidate selection algorithm that logically make sense for the system. You should include a discussion of this decision in your project writeup. Requirement 2: Choosing when to run garbage collection is another important design decision. Should it be run constantly? Should it only run when the system is idle? Should it be run whenever a free container slot is needed but unavailable? Or, should it be based upon the level of fragmentation of the file system? To satisfy this requirement, you should consider the timing choices and choose an algorithm that best fits the goals of your system. You should also include a discussion of this decision in your project writeup. Requirement 3: Implement an interface that can be called from the upper layer to control garbage collection. This includes integrating GC with the upper layers of the software stack and evaluating the impact of GC on the performance of your system. You should include a discussion of the GC integration, as well as results from an evaluation of the performance impact on the system introduced by GC. This might include, for example, a study of the common case performance vs. worst case performance. Phase E2: Extra Credit – Variable-sized Chunking One of the early, simplifying, design decisions we made was to choose fixed-sized chunking (chunks are the same as segments) instead of variable-sized (content defined) chunking. It has been shown [3] that variable-sized chunking can increase the deduplication ratio (better compression) substantially. In fact, in some cases fixed-sized chunking fails to deduplicate very well at all. The goal of this system extension is to implement variable-sized chunking along side fixed-sized chunking in your system. This would allow you to evaluate the impact of content-defined chunking vs. fixed-sized. To complete this phase you must satisfy the following requirements. Requirement 1: It is not obvious the correct way to define chunk boundaries when you move away from fixed-sized chunking. The typical way is to use a form of Rabin fingerprinting [5] (rolling hashing) to deterministically identify chunk boundaries based on the data content. You should also try 8 to set the average chunk size by specifying a min and maximum size for each chunk. Of course, there will be numerous corner cases that must be addressed, for example, end of file handling, small file handling, zero-filled chunks, etc. To satisfy this requirement, you should consider the trade-offs in rolling fingerprint choices (which fingerprinting function to use) and choose one that best matches your system goals. As part of your write-up, you should discuss and justify your choice of chunking algorithm. Requirement 2: Although you could just replace your existing fixed-sized chunking algorithm with the new variable-sized chunking algorithm, you would not be able to compare them side-by-side in a systematic fashion. So, you should implement the new algorithm as a feature of the system in such a way that a user could choose either fixed-sized chunking or variable-sized chunking during file system initialization. This will allow us to compare one against the other. To satisfy this requirement, you must implement the new algorithm and build in a way to optionally select one or the other. As part of your write-up, you should discuss and the implementation details of your variable-sized chunking algorithm, and evaluate how well it works in practice. How well does it handle shifts in data? When does it perform better (in terms of deduplication) than fixed-sized chunking? When does it do worse? How does variable chunking affect the throughput performance of the system? Phase E3: Extra Credit – Deduplicated File Recipes A second simplifying design decision was to treat file recipes specially and to store them in an uncompressed, contiguous format in containers. Although, for a system that has a small amount of data or a small number of files, this may be fine. As the number of files grows to a large amount, and/or the amount of data stored in the system scales, file recipes will consume a non-trival portion of the available storage. Therefore, one interesting extension is to store them in an efficient manner. Since we already have a deduplication storage system available to us, it seems that we should be able to utilize that same system to store our file recipes. As you will likely discover, this can complicate things. In fact, the common way to handle this has been to utilize a data structure called a Merkle Tree [6]. Now, instead of having an independent list of hashes for each file, we can have independent root nodes for each file, and have duplicates share portions/nodes/subtrees of their Merkle Trees. To complete this phase you must satisfy the following requirements. Requirement 1: Start by implementing and testing a Merkle Tree data structure. You should be able to create, update, and destroy these trees. This code will be utilized by the system to implement the file recipes. As part of your write-up you should include a description of your Merkle Tree implementation and interface. How will the system use the tree? Requirement 2: Now that you have a working Merkle Tree, implement file recipes using it. Once again, the system should provide an option to choose between Merkle Trees and simple file recipes during file system initialization. As part of your write-up, you should discuss the implementation details of your Merkle Tree based file recipes, and evaluate how well they work in practice. How much reduction in storage used (better deduplication) do Merkle Tree based file recipes provide? How do Merkle Tree based file recipes affect the throughput performance of the system? Phase E4: Extra Credit – Independent Proposals You may have thought of some interesting extensions on your own. If so, please propose them to me so that we can discuss. Your proposal should include a brief description of the feature you wish to implement and the reasons why this feature would be good to add to the system. Your proposal should 9 also include some notion of how feasible it would be to implement in the system given the time until the project is due. Finally, your proposal should include how, if possible, you will attempt to evaluate the effectiveness of your proposed extension. If you are unsure about your idea and would like to get preliminary feedback from me, please feel free to email me about it or talk to me after class. Project Submission Guidelines Minimally, the following items should be submitted as part of this project: 1) Project writeup document in PDF format. 2) Source code for your deduplication file system. This code should be compilable on a typical recent x64-based Linux system. The code should include a Makefile and a README file. The Makefile should build the whole system. The README file should describe how to build the system, what external libraries (if any) are needed to build it, how to initialize the file system, and how to run it. References [1] Quinlan, S. and Dorward, S., Venti: A New Approach to Archival Data Storage. In Proceedings of the 1st USENIX Conference on File and Storage Technologies (FAST '02). Berkeley, CA, 2002. [2] Rosenblum, M. and Ousterhout, J. K., The Design and Implementation of a Log-Structured File System. ACM Trans. Comput. Syst. 10, 1 (February 1992), 26-52. [3] Zhu, B., Li, K., and Patterson, H., Avoiding the Disk Bottleneck in the Data Domain Deduplication File System. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST'08). Berkeley, CA, 2008. [4] FUSE, Filesystem in Userspace, http://fuse.sourceforge.net/, 2013. [5] Rabin Fingerprint, Wikipedia, http://en.wikipedia.org/wiki/Rabin_fingerprint, 2013. [6] Merkle Tree, Wikipedia, http://en.wikipedia.org/wiki/Merkle_tree, 2013. 10