Download NFS/DBMS: Film at 11 - Pages - University of Wisconsin–Madison

NFS meets Data Bases Alan Halverson [email protected] Babis Samios [email protected] Computer Sciences Department University of Wisconsin, Madison Abstract This work develops a UNIX file system API that talks to a relational database in the backend. All file system data and meta-data are stored in the database. The API is used as the intermediate layer between the database and a user-level NFS server. As a result, the database file-system is accessible to any NFS client in a transparent to the client manner. Due to client-server model employed by the DBMS the system provides a flexible 3-tier architecture. Our results show that a reasonable implementation of such a system is possible. The most crucial factor leading to viable performance is the use of sever-side caching between the database and the NFS server. Also important is the choice of the database schema and the combination of the chunk size used for storing the actual data in the database, and the message size used by the NFS client to transmit the data 1. Introduction Most file systems guarantee only limited consistency of user data and do not provide solid data semantics. On the other hand, database research has focused on both these issues during the past decades. Major DBMSs provide mechanisms to ensure data consistency and efficient crash recovery, such as transactional semantics and write-ahead logging. They also offer a very wide range of both simple and elaborate data primitives, allowing the user to conveniently describe, store and query almost any kind of data-set. In this work we take advantage of available database technology in order to build a file system that is more reliable than conventional file systems. This idea is not new. The Inversion File System \cite{inv93} introduced by Olson in 1993, was built on top of the Postgres database, providing the above mentioned features, along with a few others like fine grained time travel and user-defined types and functions for files. The main drawback of that system was relative performance to a UNIX file system (ULTRIX 4.2 NFS). 10 years after the Inversion File System was introduced we decided to revisit some of the performance issues raised in the 93 paper, in the presence of both modern software and hardware. One of our primary design goals was to provide transparent access to the database file system. We decided that a highly portable way to do that would be to hide our system behind an NFS server. This ensures almost global availability since the NFS clients is a built-in component of most UNIX-like operating systems. A final design goal of the system was to provide maximum flexibility in terms of the physical location that the various systems components reside. By using NFS we ensure that the end user can connect remotely to the NFS server. In order to provide flexibility on the other end of our system as well, i.e. the connection between the NFS server and the database, we chose to use a client-server DBMS. Our initial results indicated that the system was more than an order of magnitude slower than a system where the same NFS server was used on top of the native UNIX file system. We decided to include a number of additional mechanisms in our system to improve performance. The mechanism that proved more effective was a caching layer located between the NFS server and the database. More specifically, the caching of file meta-data and naming entries at that level proved extremely beneficial. In the presence of server-side caching the I/O performance of our system was in the same order of magnitude with that of NFS on top of the native file system, proving our system viable with respect to performance. implementation and we will refer to it in the remaining of the paper as Data Base File System (DBFS). We will now examine each one of those components in more detail. 3.1 DBMS 2. Related Work To our knowledge, the only attempt to implement a file system on top of a data base is that of the Inversion file system. Inversion is built on top of Postgres. It provides transactions and fine-grained time travel as well as fast recovery. Other features of the system are typing of user files and powerful query support on system data and meta-data. Inversion provides a set of interface routines to create, open, close, read, write and seek on files. The tables in the database schema used are as follows:    The system naming information that describe the structure of the directory hierarchy is stored in a table called naming. File meta-data are stored in the fileatt table. For every file in the system there is a separate table, called inv<inumber>, where inumber is the (unique) i-node number of the file. It is in those tables that the actual data are stored, chopped in tuples of 8KB. What Inversion does not provide is transparency. It requires programmers to link a special library in order to access Inversion file data. Our main goal was to be able to provide the users with a fully functional file system on top of a database in a totally transparent way. This meant that all the file system API routines had to be re-implemented and not only a small subset of them as was done for the Inversion file system. 3. Infrastructure There are three main components in the system, the Data Base Management System (DBMS), the NFS server and the layer that abridges the first two. The latter is the main part of our In choosing the DBMS to use we had a wide variety of options. Since one of the main goal of this work was to revisit the Inversion file system results we decided to use Postgres \cite{Postgres}. Postgres is a shareware, open source database system. It provides many useful features but we will only refer to those important in the context of our work. Postgres employs the client-server model that allows the realization of the 3-tier architecture. In brief, the way the client/server model is implemented in Postgres is the following: A server process that manages the database files accepts connections from database clients and performs database operations on behalf of the clients. The communication between the server and the client goes through a socket so that both local and remote (over the network) connections can be established. When the client and the server are in different hosts they communicate over a TCP/IP network connection. The server can accept multiple connections from clients and uses the process per connection model to do so. One characteristic of Postgres is that performance is expected to deteriorate gradually under write intensive workloads (i.e. workloads that include many inserts, deletes and updates). The reason for the performance degradation is the fragmentation resulting from tuples that were made obsolete after either a delete or an update. Under a typical file system workload that creation, deletion and updates of files are very common and frequent this could result to a significant performance penalty over time. A way to alleviate the impact of deletes and updates in Postgres is to use the VACUUM SQL command. This operation removes expired rows, reclaims unused space and moves tuples across blocks in an attempt to compact tables to the minimum number of disk blocks. We will further investigate the effect of VACUUM in section \ref{sec:vacuum}. In order to store the data for an arbitrary file in Postgres we had to use the binary type that Postgres provides for the table attributes. Although Postgres is able to store binary data in a database table, it does not accept binary data as part of a query. Instead, it provides an escaping routine to turn the data to ASCII before including it in the query. Escaping the data has a considerably high overhead since it can potentially increase the size of the data transferred from the NFS server to the database by a factor of four, thus hurting performance significantly. 3.2 NFS Sever The main issue in choosing which NFS server to use was the choice between a kernel or a user space implementation. The kernel option had the advantages that such an implementation is included in the Linux kernels and supports NFS v3. The drawback was that, as with any kernel code, it is difficult to debug. The main reason we did not choose such an implementation, though, was that the user library provided by Postgres could only be linked from user space. Consequently we chose to use a user space NFS server, and specifically UNFSD \cite{UNFSD}. The downside of that was that UNFSD only supports NFS v2. A feature of UNFSD that proved extremely valuable to us was that it provided an easy and straightforward way to re-link the application with a different implementation of the file system API. This was done just by changing a header file. 3.3 DBFS The choice of a client/server DBMS and the use of NFS as the bridge between the end user of the file system and actual file system stored in the database, allows a 3-tier architecture shown in figure 1. … NFS Client NFS Client UNFSD Server DBFS Postgres Client Postgres Server … Postgres Server Figure 1: 3-tier Architecture The main part of our work focused on the DBFS layer. We first had to chose a convenient database schema for our purposes. Given the schema of our choice, we then had to implement a version of the file system API routines that communicates with the database and issues the required queries to accomplish the desired functions. After the routines were implemented the system was fully functional. The final step was to add a number of optimizations to enhance performance. In the following we address all the above issues in detail 3.3.1 Database Schema The first schema we used was very similar to the one used in the Inversion file system, as described in section \ref{sec:Inversion}. We call this the inversion schema. The exact format of the naming and fileatt tables, representing the directory hierarchy and the file meta-data respectively, is the following : fileatt inode uid gid mode nlinks size ctime mtime int int int int int bigint timestamp timestamp atime timestamp naming filename parent_inode inode varchar int int In the above tables attributes in bold indicate the primary keys. The inode and parent_inode fields in the naming table are both foreign keys, referencing the inode field of the fileatt table. The fileatt table entries are identical to those of a stat entry in a UNIX file system. The actual file data are stored in a different table for each file. The name of the table for a file with inode number inumber will be inv<inumber>, exactly as in the inversion file system. The format of such a table is the following: inv<inumber> chunk_id chunk_data int bytea The file data are chopped up in chunks and for each new chunk a new tuple is added to the table. We made the chunk size tunable in order to get a sense of the effect it has in performance. The type of the chunk_data field is bytea which indicates binary data. One of our concerns with regard to the inversion schema was that of the table creation and deletion overheads. Previous studies \cite{Baker?} have shown that the average life time of a file in a typical file system workload is very small. This implies that under the inversion schema tables would be created and dropped very frequently. Since these operations have a high cost, performance could be degraded due to the design choice of having one table per file. We performed experiments to validate the above argument and found that, as we expected, the penalty for the table creations and deletions was indeed high (see section \ref{sec:schema comparison} for the exact numbers). As a result we decided to try a different approach concerning the part of the schema relevant to actual file data. Specifically, we chose to experiment with the other end of the spectrum regarding the number of tables used for file data, that being a single common table for all the files in the system. The second schema which we will refer to as the single table schema uses the naming and fileatt tables in exactly the same way with the inversion schema. The actual file data are stored in a single table for all files. We call this table allfiles and its format is shown below: allfiles inode chunk_id chunk_data int int bytea An extra field is needed, relative to the table format used in the inversion table for file data, to indicate which chunks belong to which file. Using this kind of schema there are no table creations and deletions, so the overheads of these operations are avoided. On the other hand writing and reading on that table, when it grows in size, could prove to be slow. We address that issue in section \ref{sec:schema comparison}. 3.3.2 New File System API Once the schema was set up in the database, we had to implement an appropriate API that would translate the requests received by the NSF server to the corresponding database operations. All the implementation was done in C. The total code written for all the routines was a little more than 2000 lines of code. Two versions of that code were developed for the two different database schemas used, but the differences between them were small. Our implementation is independent from NFS in the most part. Since it is a full file system API, it could potentially be used by any application assuming the standard UNIX file system API. There are some points, though, were the presence of the NFS server affected our implementation and allowed us to make certain simplifications. For example, we were not required to keep file offsets across multiple read or write request. The reason is that due to the statelessness of the NFS server, the NFS client, at every new read or write request issued by the end-user first opens the file setting the offset to the start of the file, then does an lseek to the appropriate offset and then does the actual read or write operations. As an illustrative example of our implementation we outline the implementation of the write routine in figure 2. ssize_t write(int fd, void *buf, size_t n) { start_chunk = fd->offset/CHUNKSIZE end_chunk = (fd->offset+n-1)/CHUNKSIZE produce and escape chunk_data BEGIN XACT for(chunk = start_chunk to end_chunk) { SELECT the chunk from <file_table> if(found) UPDATE in <file_table> else INSERT in <file_table> command that should be issued is INSERT. We could overcome this problem without fetching the actual chunk data but just by polling the database to determine whether the chunk exists or not. There are many cases where fetching the actual chunk data if found in the database is necessary. These are the cases where only a part of a certain chunk needs to be written. In those cases, the old chunk data is retrieved, the part that has to be changed is substituted with the new data and the updated chunk is sent back to the database. In addition to the actual data reads and writes, two more database updates must be issued to keep the file system meta-data up to date. Specifically, the size of the file might need to be modified if the write caused the file to grow and also the ctime and mtime of the appropriate tuple in the fileatt table must be set to the time that the write occurred. Finally, the atime field of the tuple corresponding to the parent directory of the file that was written must also be set to the time the write happened. } UPDATE file metadata(size,ctime,mtime) in fileatt UPDATE parent dir metadata (atime) in fileatt COMMIT XACT return number of bytes written } Figure 2: write pseudo-code A final point to make is that all the modifications to the database (and, effectively, to the file system) which are required for the write request are issued inside a transaction block. This guarantees all-ornothing semantics for all the actions described above. This is different from a conventional file system where if, for example, a crash was to occur after the writing of the actual data but before the update of the system meta-data, the file system would end up being in an inconsistent state. 3.3.3 Optimizations for Performance There are several interesting points regarding the write routine. As already mentioned Postgres does not accept binary data as part of an SQL query. This is why actual file data had to be escaped to ASCII before being sent over to the database. The escaping was done using the escaping method that Postgres provides exactly for that purpose. Another point is that we cannot simply write a chunk to the database before checking if it already exists or not. There are multiple reasons for this. The first reason is that if the chunk already exists we have to issue an UPDATE command to the appropriate file table whereas if it doesn’t exist the With all file system API routines in place we had a fully functional file system accessible via NFS. Initial comparisons with the native UNIX file system accessed also via NFS indicated very poor performance of our system. As an example reading and writing was slower by an order of magnitude. This did not satisfy our initial performance goal which was to stay in the same order of magnitude with NFS over the native file system for basic file system operations. Consequently, we applied several optimizations to our system to make it faster. There were two main directions towards which we worked to improve system performance. The most important was to avoid as many as possible roundtrips to the database since that was clearly the bottleneck of the system. The second was to find a more efficient way to deal with the ASCII data requirement of Postgres than the very costly native Postgres escaping method. Towards the first goal, the main optimization we employed was the use of server-side caching. The client side buffer cache was already known to improve performance so we decided to follow a similar approach on the DBFS layer, i.e. between the NFS server and the database. We observed that a very high percentage of the requests issued to the database were directed to the file system metadata. These requests were mainly generated when the server received requests with full path inputs that had to be resolved. A break down of the number of calls for each routine in our system indicated that the most heavily used routine was the one that performed path resolution. We call the latter inode_of_rightmost and is outlined in figure 3. int inode_of_rightmost(const char *path) { pinode = root_inode while(more elements in path) { cur = name of next element SELECT mode FROM fileatt WHERE inode = cur if(mode != dir) return -1 SELECT inode FROM fileatt WHERE filename = cur AND parent_inode = pinode pinode = inode } return inode } Figure 3: inode_of_rightmost pseudo-code The above routine is called by the most of the file system API routines, including stat. It receives a full path as an input and returns the inode of the rightmost element in the path. It breaks the path into its components and traverses the directory hierarchy represented in the naming table to retrieve the inode number of one element at a time. For each element but the rightmost it queries the fileatt table to make sure the entry is a directory. The number of round-trips to the database is twice the number of elements in the input path. Given the heavy usage of the fileatt and naming tables resulting mainly from the numerous calls to inode_of_rightmost, it was evident that the presence of memory caches to hold entries from those two tables was bound to significantly cut down on database round-trips. The two caches were implemented as infinitely growing hashtables. No bounds on size were applied due to the very small size of such entries (approximately 40 bytes each). In the rest of the paper we will refer to those two caches as the stat cache and the naming cache. Next, we decided to employ a third cache to hold chunks of actual data, aiming to cut down on round-trips to the database resulting from read and write operations. This cache was implemented as a pair of linked lists, one to keep the buffered chunks and the other to keep track of the free buffers. In contrary with the other two caches the size of the buffer cache had to be restricted due to the large size of its entries (typical buffer size is 8KB). We made the size tunable and the replacement policy used is LRU. The effect of all three caches in performance is investigated in section \ref{sec:caches}. Our final concern was to alleviate the degradation in performance resulting from the inflation of the data transferred to the database on write operations due to the necessary conversion of binary data to ASCII. In order to reduce the overhead, instead of using the native Postgres escaping function, we employed a base64 encoding mechanism. Using this type of encoding the size of data increases only by a factor of 4/3 instead of a worst case of 4 that can occur with the escape mechanism. We further investigate the impact of the encoding in performance in section \ref{sec:base64}. This section presents experimental results which validate the viability of our approach; All experiments were executed on a dual 1 GHz PIII PC running RedHat Linux 8.0. The machine we booted using the uniprocessor 2.4.18-14 Linux kernel to remove any side effects of the two processors. The machine has 1GB of SDRAM, and a single 30GB 7200 RPM IDE hard drive we utilized. All tests were run with local NFS mounts. Although this causes some CPU contention between the NFS client and server, it does serve to remove the physical network between two machines as a variable. For our first experiment, we desired to understand how the choice of the database schema would affect performance. For each schema, we wrote 50 and 500 files. For each quantity of files, we varied the size of each file, using 8KB, 80KB, and 800KB. For each combination of file quantity and size, we measured the total time to write the files, read a random block from each file, and delete all of the files. Figure 1 shows how the total performance of the file system varies based on the database schema. Of course, this is not a complete analysis of the effects the schema has on performance, but we can draw a couple of conclusions here. The first is relatively obvious from the graph. As the size of files increases, it becomes more and more costly to use a single table to store the data for all files. You see this trend for for both 50 and 500 files here, and clearly the penalty is large once each of the files in 800KB in size. However, the vast majority of files in a typical filesystem are still small. The cost of table creation is quite high relative to the cost to write the data of a small file – see the data for the 50 files/8KB each case in Figure 1 as an example. We could get the best of both worlds by adopting a hybrid approach. We could use a single table to store all files less than a certain size, but for any file which exceeds that threshold we create a dedicated table for it. We would incur some cost when a write() operation causes a file to cross this threshold, but on the whole this would give us the best performance. For the remainder of our experiments, we have chosen the one table schema. The largest file written by any of the experiments is 3.6 MB, but when this occurs only one such file exists, keeping the total number of chunks under 500. 10000 Execution Time (secs) 4. Experiments 1000 One Table Many Tables 100 10 1 50 files/8 50 50 500 500 500 KB each files/80 files/800 files/8 files/80 files/800 KB each KB each KB each KB each KB each File Quantity/Size Figure 1 - The effects of database schema on performance. When the size of each file is small, the best performance is achieved by using a single table to store all chunks. As the size of each file increases, the cost of writing chunks begins to dominate the one table schema, and the many table schema’s performance is preferred. As mentioned in an earlier section, we noticed performance problems related to large numbers of file creations and deletions. To quantify the extent of this problem, we present the results in Figure 2. For this experiment, we measured the wall time necessary to copy a 3.6MB file into our filesystem, 20 times in a sequence. After each copy completes, the file is deleted; the time to delete the file is not included in the measurement, however. We performed this test twice, once issuing the PostgreSQL command “VACUUM ALL;” between each copy, and again without the vacuum command. 9 8 No Vacuum With Vacuum 6 5 Our measurements clearly indicate that utilizing the VACUUM command frequently provides steady performance. Without using VACUUM, however, we see that the time it takes to copy a file into the system has almost doubled by the 20th copy event in the sequence. We consider this result to be an artifact of PostgreSQL’s decision to use a log-structured writing policy. As data is written to and deleted from the database, subsequent write performance is hindered by its maintenance of disk blocks which must be reclaimed later. Due to this result, all subsequent tests utilize the VACUUM ALL command on the database just prior to beginning the timing of the event. Now that we understand some of the factors that affect the performance of our system, we can present some overall performance results. We have chosen a standard NFS benchmark suite called Connectathon for this purpose. The basic tests in Connectathon perform the following operations:  File/dir create - create 155 files 62 directories 5 levels deep  File/dir remove - remove 155 files 62 directories 5 levels deep  Lookups - 500 getcwd and stat calls  Get/set attr - 1000 chmods and stats on 10 files  Read/write - write/read 1048576 byte file 10 times lin k m e di r k/ lin sy m lin k/ re re ad na ad re Lo ok up s Se t/g et at tr re ad /w rit e 1 ov e Figure 2 - Effectiveness of the PostgreSQL VACUUM ALL command. Regular usage of the VACUUM command allows consistent performance results, while neglecting to use the command results in steadily decreasing performance results. 10 m 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Copy 3.6 MB file - sequence number re 5 te 4 ir 3 ea 2 cr 1 le /d 0 100 ir 1 None SC NC SC+NC BC+SC+NC Fi 2 1000 le /d 3 Execution Time (Normalized to Native) 4 Fi Execution time (secs) 7  readdir - read 20500 entries, 200 files  link/rename - 200 renames and links on 10 files  symlink/readlink - 400 symlinks and readlinks on 10 files Figure 3 - Connectathon basic tests performance, normalized to the performance of the native file system. We show results for each test, and within each test for each of zero or more caches enabled. The caches are: Stat Cache (SC), Naming Cache (NC), and Buffer Cache (BC). For the Lookups test, the missing bars for SC+NC and BC+SC+NC signify that our performance as equal to that of the native file system. We show results for our system running these tests in Figure 3. Our performance on the read/write test is reasonable – just slightly more than 5 times worse than the native file system for the cases with SC+NC turned on. On the Lookups test, we achieve equal performance with the native file system when the stat cache and naming cache are enabled. For this test, there is no reason for us to communicate with the database. You will notice that there is almost no benefit to enabling the buffer cache. This is in large part due to good NFS client caching – most of the time, they don’t even call the NFS server read method. We do see some benefit in the symlink/readlink test. Our implementation of readlink uses the buffer cache to hold the symlink target path, and the NFS client does not cache symlinks, so we see buffer cache hits. Another factor for read and write performance of large files is the coincidence of the block size we use for storing the file (“chunk size”) and the buffer size utilized by the NFS client for each read or write request (“write size”). Choosing these sizes is a tradeoff between the typical file size and an efficient transfer size. We decided to test the effects of various combinations of these two variables. Version 2 of NFS supports a maximum read and write size of 8KB, and so we limit our investigation to chunk sizes and write sizes at this amount or smaller. The results of this experiment can be found in Figure 4. 50 1K Chunk 2K Chunk 4K Chunk 8K Chunk 45 Execution time (secs) 40 35 30 the socket. Further, the data stored in PostgreSQL is of equivalently larger size that the binary data would be. Table 1 - Overhead of ASCII-only communication with PostgreSQL Time (ms) Base time to write 8KB of data to PostgreSQL Time to encode 8KB binary to base64 Time to write additional 8/3 KB data Total time to write base64 encoded data 3.07 + 0.19 + 0.63 3.89 25 20 15 10 5 0 1024 2048 3072 4096 5120 6144 7168 8192 NFS write size (bytes) Figure 4 - Coincidence of Chunk Size and Write Size, and their effect on performance of copying a the 3.6 MB file into our filesystem. The results of this experiment are interesting in a few ways. As you might expect, when both the chunk size and the write size are the full 8KB, we see the best performance. An extension of this fact, though, is that for any given write size, the best performance is achieved when the chunk size is the same. For example, if the NFS write size is 2KB, we see the best performance for a 2KB chunk size. Also of note is that the 8KB chunk size is subpar for all write sizes except the 8KB write size. All other experiments use the 8KB chunk size with an 8KB write size. One shortcoming of the communication protocol provided by PostgreSQL is a lack of support for binary updatable cursors. The implication is that we must send base64 text encoded equivalents of the binary chunks of data given to us via the NFS write method. This carries with it a two performance penalties. First, we must spend time for each write and read encoding and decoding the binary data to and from ASCII. Secondly, these encodings require more space than their binary equivalent, and so we must send more data through In Table 1, we show the measured costs of these two overheads. The price paid for this method of writing is an additional 26.7% of the base 8KB write time. So, for our best case result in figure 4 of writing a 3.6 MB file in 4.334 seconds, 0.914 seconds of this is directly attributable to this overhead.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download NFS/DBMS: Film at 11 - Pages - University of Wisconsin–Madison