Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ECE 569 Database System Engineering Fall 2004 Yanyong Zhang www.ece.rutgers.edu/~yyzhang Course URL www.ece.rutgers.edu/~yyzhang/fall04 ECE569 Lecture 04-2.1 Fall 2004 Associative access The system is not asked to retrieve tuples based on information about their storage location; rather, it has to find all tuples the attribute values of which fulfill certain conditions – associative access. Associative access can be realized by sequential scanning, which happens for complicated queries. select R.x, S.y, from R,S where R.k = S.f and R.b < 12; But for simple selection predicates, this is very slow (even for an in-memory database) ECE569 Lecture 04-2.2 Fall 2004 Access Path The class of algorithms and data structures designed for translating attribute values into TID, or into other types of internal addresses of tuples having those attribute values, is called access paths. Depending on what kind of selection predicate is to be supported, the techniques for associative access vary greatly. ECE569 Lecture 04-2.3 Fall 2004 Content addressability techniques Primary key access. A tuple of a relation must be retrieved efficiently via the value of its primary (unique) key(s). e.g., key-sequenced files and hased files. Point query vs. range query Secondary key access A set of tuples are produced Multi-table access Tuple access is often based on relationships between different tuples. E.g., all orders placed by a given customer have to be found. ECE569 Lecture 04-2.4 Fall 2004 Associative access path techniques Hashing (key transformation) Using the primary key value as a parameter to a function, which returns the storage location of the tuple. Key comparison Maintaining a dynamic search structure on the set of values in the key attribute. These values can be organized into tables, lists, trees, and so on e.g, B+ tree ECE569 Lecture 04-2.5 Fall 2004 Operations on files (heap files) Assumptions n = number of records in file R = number of records that can fit in block Lookup – Given a key find corresponding record Insertion – add record to file (allows duplicates) On average, n / (2R) block accesses. Read last block; it may need to allocate a new block. Approximately, requires 2 accesses Deletion – delete record look up record n / (2R) Write back to disk (1 access) Reorganize (unpinned) – move tuple from last page to utilize space (2 disk accesses) ECE569 Lecture 04-2.6 Fall 2004 Hashed Files File is divided into B buckets Hash function h maps elements of the key space to range [0, B) Key space is large and unevenly distributed - SSNs as character strings - Each character takes on at most 10 of the possible 256 values Hash function h must map key values evenly among a relatively small number of values. ECE569 Lecture 04-2.7 Fall 2004 Range of Potential Key Values (the shaded areas denote used key values) ECE569 Lecture 04-2.8 FOLDING Range of positive integers Hash-based associative access HASHING tuple address space Fall 2004 Folding Convert arbitrary data types to a positive integer h can be applied to. Reduce number of bits so that arithmetic is efficient. Example: Key is “Keefe” and 16803 Key value is the concatenation of byte representation of individual fields Folded value of key is 0x4b 0x65 0x65 0x66 0x65 0x0 0x0 0x0 0x41 0xa3 Partition result into words and combine using XOR 0x4b 0x65 0x65 0x66 0x65 0x0 0x0 0x0 0x41 0xa3 0x0 0x0 0x6f 0xc6 0x65 0x66 = 1875273062 ECE569 Lecture 04-2.9 Fall 2004 Hashing goal of hashing How to choose hash function if all the key values are uniformly distributed? The critical issue is to produce 1:1 mapping Collision: different inputs are mapped to the same output. The criteria of a good hash function is to keep the collision as small as possible. ECE569 Lecture 04-2.10 Fall 2004 Static Hashing Input: folded key values Output: bytes (relative to the beginning of the file), blocks ?? Bytes are not good because of the varying tuple size. A block/page is called a bucket. H: {0 … 232-1} -> {0, B-1} Continuous allocation Fixed size: B pages are allocated at file creation time. Insert - Determine the bucket - Check the bucket ( collision may happen) ECE569 Lecture 04-2.11 Fall 2004 How to find a good hash function Division / remainder (Congruential hashing) H(Kb) = kb mod B where kb is folded key value and B is the number of buckets. Nth power Compute kbN, and from the resulting bit string (n x 31 bits) take log2B bits from the middle. Base transformation Polynomial division Numerical analysis encryption ECE569 Lecture 04-2.12 Fall 2004 Performance Assumption Perfect hash function (tuples are uniformly distributed over B buckets) Lookup ½ n/R 1/B To finish first match n/R 1/B If tuple does not exist Insertion n/R 1/B + 1 Test for duplicates 1 Otherwise Deletion ½ n/R 1/B delete first match ECE569 Lecture 04-2.13 Fall 2004 Collision Two keys collide if they hash to same value A bucket with room for R tuples can accommodate R – 1 collisions before it overflows Internal resolution: Place overflow blocks in another bucket - (h(K) + 1) mod B linear probing - (h2(h1(K)) multiple hashing ECE569 Lecture 04-2.14 Fall 2004 Collision - continued External resolution: Allocation overflow block, link to overflow chain buckets Overflow pages ECE569 Lecture 04-2.15 Fall 2004 Discussion What are the disadvantages of static hashing? How do you limit the number of pages accessed when retrieving a tuple, for both external and internal resolution? ECE569 Lecture 04-2.16 Fall 2004 Trie The buckets will dynamically grow/shrink/balance Fundamental: trie A 0 A 0 0 0 0 B 1 ECE569 Lecture 04-2.17 B C (b) (a) 000 001 010 011 100 101 110 111 1 1 C 1 1 D A D B C (c) Fall 2004 Dynamic hashing function We need a hash function whose range of values can change dynamically One such hash function can be constructed using a series of functions hi(k), i = 0, 1, …., such that for any k, either hi(k) = hi-1(k) or hi(k) = hi-1(k)+2i-1. Choose H(k), which maps the key space into random bit patterns of length m, for m sufficiently large. Then hi(k) may be defined as the integers formed by the last i bits of H(k). ECE569 Lecture 04-2.18 Fall 2004 Extendible Hashing The number of buckets can grow/shrink. An intermediate data structure translates the hash results into page addresses. This data structure needs to be as compact as possible. Hashes into an array of pointer to buckets (directory). The array is small enough to be kept in memory. ECE569 Lecture 04-2.19 Fall 2004 Directory Growth • To adapt to dynamically varying size of hash filemodify directory size • Assume a hash function h(Kb) that produces a bit string s. • The directory is of size 2d. d is called the global depth and is initially 0. • Use least significant d bits of s to determine bucket to access • Each bucket has a corresponding local depth in the range [0, d] which indicates the difference between all the records in this bucket ECE569 Lecture 04-2.20 Fall 2004 Example Insert Each 0x13, 0x10, 0x07, 0x00, 0x1f page can contain no more than 2 tuples 0x13,0x10 0 0 0 0x10,0x00 1 0 1 d=1 d=0 0x10,0x00 1 00 0 01 1 10 d=1 0x13,0x10 0x13,0x07 1 2 11 d=2 0x13,0x07 ECE569 Lecture 04-2.21 2 Fall 2004 Example – insert 0x1f 0x10,0x00 1 000 local degree = 001 global degree – 010 log2(# of arrows pointing to this bucket) 2 011 100 101 0x13 3 0x07,0x1f 3 110 111 d=3 ECE569 Lecture 04-2.22 Fall 2004 Performance 2 steps for retrieving a tuple If we can keep the directory in memory, each retrieval is one page access Assuming 4 bytes per entry, 4KB pages, 1GB hash files, and we want to keep the entire directory in memory, what is the minimum buffer size? ECE569 Lecture 04-2.23 Fall 2004 Discussion How easy is it to keep the directory in the memory? How do we reduce the structure when the file shrinks? How do you make the directory small, and increase space utilization? (deferred splitting) ECE569 Lecture 04-2.24 Fall 2004 Linear hashing (a) a b c d d=2 00 01 10 11 w (b) a b 000 01 c d 10 11 d=2 100 x (c) a b c d 000 001 10 11 w d=2 100 101 y x (c) a b c d 000 001 010 11 d=2 ECE569 Lecture 04-2.25 y w 100 101 110 (d) a b c d w x 000 001 010 011 100 101 110 111 d=3 Fall 2004 Addressing in linear hashing Which hash function should be used? d is the degree, p is the address of the next page to split The algorithm is as follows: begin if (hd(k) >= p) then page:= hd(k) else page := hd+1(k) if necessary, chase the overflow chain end ECE569 Lecture 04-2.26 Fall 2004 Discuss Where to store the overflow? ECE569 Lecture 04-2.27 Fall 2004