Download queryProcessing

Query Processing CENG 352 Database Management Systems 1 The Role of Relational Algebra in a DBMS CENG 352 Database Management Systems 2 Relational Operations • We will consider how to implement: – – – Selection (  ) Projection (  ) Join (  ) CENG 352 Database Management Systems 3 Computing Selection (attr op value) • No index on attr: – If rows are not sorted on attr: • Scan all data pages to find rows satisfying selection condition • Cost = F (number of pages in file to be sorted) – If rows are sorted on attr and op is =, >, < then: • Use binary search (at log2 F ) to locate first data page containing row in which (attr = value) • Scan further to get all rows satisfying (attr op value) • Cost = log2 F + (cost of scan) CENG 352 Database Management Systems 4 Computing Selection (attr op value) • Primary B+ tree index on attr (for “=” or range search): – Locate first index entry corresponding to a row in which (attr = value). Cost = depth of tree – Rows satisfying condition packed in sequence in successive data pages; scan those pages. Cost: number of pages occupied by qualifying rows B+ tree index entries (containing rows) that satisfy condition CENG 352 Database Management Systems 5 Computing Selection (attr op value) • Secondary B+ tree index on attr (for “=” or range search): – Locate first index entry corresponding to a row in which (attr = value). Cost = depth of tree – Index entries with pointers to rows satisfying condition are packed in sequence in successive index pages • Scan entries and sort record Ids to identify table data pages with qualifying rows Any page that has at least one such row must be fetched once. • Cost: number of rows that satisfy selection condition CENG 352 Database Management Systems 6 Computing Selection (attr = value) • Hash index on attr (for “=” search only): – Hash on value. Cost  1.2 • 1.2 – typical average cost of hashing (> 1 due to possible overflow chains) • Finds the (unique) bucket containing all index entries satisfying selection condition • Clustered index – all qualifying rows packed in the bucket (a few pages) Cost: number of pages occupies by the bucket • Unclustered index – sort row Ids in the index entries to identify data pages with qualifying rows Each page containing at least one such row must be fetched once Cost: number of rows in bucket CENG 352 Database Management Systems 7 Schema for Examples Sailors (sid: integer, sname: string, rating: integer, age: real) Reserves (sid: integer, bid: integer, day: dates, rname: string) • Similar to old schema; rname added for variations. • Reserves: – Each tuple is 40 bytes long, 100 tuples per page, 1000 pages. • Sailors: – Each tuple is 50 bytes long, 80 tuples per page, 500 pages. CENG 352 Database Management Systems 8 Simple Selections • With no index, unsorted: SELECT * FROM Reserves R WHERE R.rname = ‘Joe’ Must essentially scan the whole relation; cost is 1000 I/Os (#pages in R). • With no index, sorted data: Utilize the sort order on rname by doing a binary search to locate the first Joe. Cost is log2 1000  10 I/Os. • With a B+ tree index on selection attribute: Use index to find qualifying data entries, then retrieve corresponding data records. Cost of finding the starting page is 2 or 3 I/Os; for a clustered index add one more I/O; for an unclustered index add one page per qualifying tuple. • Hash index: 1 or 2 I/Os to retrieve the index pages. If 100 reservations by Joe then an additional 1-100 disk accesses depending how these records are distributed. CENG 352 Database Management Systems 9 Using an Index for Selections SELECT * FROM Reserves R WHERE R.rname < ‘C%’ • Cost depends on #qualifying tuples, and clustering. • Assume we estimate roughly 10% of Reserves tuples will be in result ( = 10,000 tuples, or 100 pages). – With a primary B+tree index: cost is 100 I/Os + 1 or 2 disk accesses for index. – With a secondary B+tree index: cost could be as high as 10,000 I/Os in the worst case. (might be cheaper to simply scan the entire relation) CENG 352 Database Management Systems 10 The Projection Operation • To implement projection we have to do the following: 1. Remove unwanted attributes. 2. Eliminate any duplicate tuples produced. • The expensive part is removing duplicates. – SQL systems don’t remove duplicates unless the keyword DISTINCT is specified in a query. • There are two basic algorithms: 1. Sorting Approach. 2. Hashing Approach. CENG 352 Database Management Systems 11 Approach based on sorting • Modify Pass 1 of external sort to eliminate unwanted fields. If B buffer pages are available, runs of about 2B pages can be produced, but tuples in runs are smaller than input tuples. (Size ratio depends on # and size of fields that are dropped.) • Modify merging passes to eliminate duplicates. Thus, number of result tuples smaller than input. (Difference depends on # of duplicates.) CENG 352 Database Management Systems 12 Example SELECT DISTINCT R.sid, R.bid FROM Reserves R Cost: • In Pass 1, read original relation (1000 pages), write out same number of smaller tuples. – Assume we have 20 buffer pages – Assume that each smaller tuple is 10 bytes long. – Thus cost is 250 pages (7 runs about 40 pages each). • In merging passes, fewer tuples written out in each pass. – The temporary relation can be merged in 1 pass, since we have 20 buffer pages. – We read the runs at a cost of 250 I/Os and merge them. • The total cost is 1500 I/Os. CENG 352 Database Management Systems 13 Computing Joins: R A=B S • The cost of joining two relations makes the choice of a join algorithm crucial • Assume: M pages in R, pR tuples per page, N pages in S, pS tuples per page. – In our examples, R is Reserves and S is Sailors. • Cost metric: # of I/Os. We will ignore output costs. CENG 352 Database Management Systems 14 Simple Nested Loops Join foreach tuple r in R do foreach tuple s in S do if ri == sj then add <r, s> to result • For each tuple in the outer relation R, we scan the entire inner relation S. – Cost: M + pR * M * N = 1000 + 100*1000*500 I/Os. • Page-oriented Nested Loops join: For each page of R, get each page of S, and write out matching pairs of tuples <r, s>, where r is in R-page and S is in S-page. – – • Cost: M + M*N = 1000 + 1000*500 If smaller relation (S) is outer, cost = 500 + 500*1000 Choose smaller relation for the outer loop CENG 352 Database Management Systems 15 Block Nested Loops Join • Use one page as an input buffer for scanning the inner S, one page as the output buffer, and use all remaining pages to hold ``block’’ of outer R. – For each matching tuple r in R-block, s in S-page, add <r, s> to result. Then read next R-block, scan S, etc. • Cost can be reduced to M + (M/(B-2))  N R&S (by using B buffer pages instead of 1.) Hash table for block of R (B-2 pages) Join Result ... ... ... Input buffer for S Output buffer CENG 352 Database Management Systems 16 Examples of Block Nested Loops • Cost: Scan of outer + #outer blocks * scan of inner – – #outer blocks = # of pages of outer / blocksize i.e. Cost = M + N * M/(B-2)  • With Reserves (R) as outer, and 100 pages of R: – – – Cost of scanning R is 1000 I/Os; a total of 10 blocks. Per block of R, we scan Sailors (S); 10*500 I/Os. If space for just 90 pages of R, we would scan S 12 times. • With 100-page block of Sailors as outer: – – Cost of scanning S is 500 I/Os; a total of 5 blocks. Per block of S, we scan Reserves; 5*1000 I/Os. CENG 352 Database Management Systems 17 Index Nested Loops Join foreach tuple r in R do foreach tuple s in S where ri == sj do add <r, s> to result • If there is an index on the join column of one relation (say S), can make it the inner and exploit the index. – Cost: M + ( (M*pR) * cost of finding matching S tuples) • For each R tuple, cost of probing S index is about 1.2 for hash index, 2-4 for B+ tree. Cost of then finding S tuples depends on clustering. – • Clustered index: 1 I/O (typical), unclustered: upto 1 I/O per matching S tuple. Effective if number of rows of S that match tuples in R is small and index is clustered CENG 352 Database Management Systems 18 Examples of Index Nested Loops • Hash-index (unclustered) on sid of Sailors (as inner): – – Scan Reserves: 1000 page I/Os, 100*1000 tuples. For each Reserves tuple: 1.2 I/Os to get data entry in index, plus 1 I/O to get (the exactly one) matching Sailors tuple. Total: 220,000 I/Os. • Hash-index (unclustered) on sid of Reserves (as inner): – – Scan Sailors: 500 page I/Os, 80*500 tuples. For each Sailors tuple: 1.2 I/Os to find index page with data entries (1.2 * 40000= 48,500 I/Os), plus cost of retrieving matching Reserves tuples. Assuming uniform distribution, 2.5 reservations per sailor (100,000 / 40,000), thus cost of retrieving them is 2.5*40,000 I/Os. Total cost is 48,500 + 100,000 = 148,500 I/Os (still better than simple nested loops) CENG 352 Database Management Systems 19

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download queryProcessing