Download Slides 01 - University of California, Irvine

ICS 224: Database Management Systems Spring 2011 Professor Sharad Mehrotra Information and Computer Science Department University of California, Irvine 1 Course General Info • URL: http://www.ics.uci.edu/~cs224/ – All course info will be posted online • Lecture times: Tue-Thurs 5 – 6.30 • Instructor: Sharad Mehrotra, BH 2082, [email protected] • Office Hours: on request ICS214A Notes 01 2 Prerequisites • Basic Data Management Concepts: – DB design, relational model, SQL, database programming  CS 122 or equivalent – Database system implementation  Indexing, query optimization, query processing, storage management, etc.  ICS 222 or equivalent • Basic Computer Science Concepts: – Depth-first search, directed/undirected graphs, “big O” notation, computational complexity, NP completeness … ICS214A Notes 01 3 Course Requirements • Class Participation: 50% – Attendance, presentations, comments, interaction, enthusiasm, etc. • Class Projects: 50% – Implementation Oriented:  Take a idea/topic, identify a project, get it okayed by instructor, develop a demonstration – Survey of an area  In depth survey in the style of computing survey articles. Provide your own perspective in a subarea. – MUST commit to project at end of 2nd week. ICS214A Notes 01 4 Class Structure • Each week we will – Pick a topic – identify 1 paper per student/group of 2 students – 2 papers as lead papers for presentation (one for each class), others presented as short presentations • Each week – – – – ICS214A Start with overview Lead paper presentation short presentation of other papers (main idea) Discussions Notes 01 5 This course … Most important ideas in data management (instructor’s pick) But with the eye towards an end application … Sentient spaces ICS214A Notes 01 6 Sentient Spaces … • Spaces in which sensors are used to capture the dynamic evolving state which is then analyzed for implementing adaptations. • Numerous examples … – intelligent transportation systems – reconnaissance – surveillance systems – smart buildings – smart grid ... 7 Example:Smart Video Surveillance Query Query Analysis CS Building in UC Irvine Event Database Semantic Extraction Surveillance Video Database Video collection 8 Implications of Sentient Space focus .. • Class focuses on topics which you might need to know if you wanted to explore application in sentient space … • Projects should target something about sentient spaces … – E.g., data cleaning of sentient data, data model to represent sentient spaces, … ICS214A Notes 01 9 Data Models (2 weeks) – Representing time - TSQL2 – Representing space – Querying streaming data – CQL, ASQL – Semi-structured data –OEM, Lore ICS214A Notes 01 10 New Ideas in Storage & Indexing (2 weeks) • New storage models – Key-Value store – Bigtable – Column Stores • New database system architecture – Data outsourcing – Multitenant databases • New Indexing techniques – Correlation maps ICS214A Notes 01 11 Data Quality (2 weeks) • Data quality issues – Inaccuracy, incompleteness, ambiguity, errors, … • Two aspects: – Techniques to improve quality  Exploiting contextual knowledge, issues of efficiency – Techniques to tolerate poor quality of data in applications. ICS214A Notes 01 12 New Computing Architecture (2 weeks) • • • • • • Map Reduce framework Hive Pig latin Join processing HadoopDB Hyrax? ICS214A Notes 01 13 Data Privacy (2 weeks) • Use cases – Data publishing, queries, sharing, data outsourcing. • Diverse criteria – Differential privacy, Anonymity, l-diversity, .. • Mechanisms to implement ICS214A Notes 01 14 A walk down the history of data models … Two papers (MUST READ) •Inclusion of New types in relational databases, Stonebraker •Postgrest Next Generation databsase, Stonebraker. 16 The Paleolithic Period … • There were no general purpose tools for managing large volumes of data… – OS provided resource management – Data was stored in files – Applications performed data management functionalities      Fault-tolerance Concurrency control Reliability Optimizations … – Such functionalities had to be re-implemented for each application ICS214A Notes 01 17 The Neolithic Period… • Early file systems evolve into general-purpose data management tools. • DBMS Goals: – Efficiency and scalability (faster than files) – Management of large heterogeneous types of structured data – High reliability – Information sharing (multiple users) • DBMS Users: – E-commerce companies, banks, airlines, transportation companies, corporate databases, government agencies, … – Anyone you can think of! ICS214A Notes 01 18 The Dark Ages …. • Network & hierarchical data models – Resulted in data spaghetti – Applications needed to chase pointers – There was little data abstraction or separation of concerns  little difference between physical data representation and logical data representation – optimization was entirely left to application writers – There were no clean data management languages  Unless you are a Cobol fan! ICS214A Notes 01 19 The Relational Era.. • Relational model proposed by Codd – Everything is a relation – Query consists of algebraic composition of a few powerful operators – Equivalent to a first-order relational calculus • Primary features – Simple clean data representation  solid mathematical basis – data abstraction  Users did not need to be concerned about how data is stored physically – simple declarative query language  User’s specify what to compute not how to do it. – ICS214A optimization by the system Notes 01 20 Data Wars (1) • Codasyl versus relational debates began… – Heated arguments during early SIGMODS – Codasyl: relational model is too simple, applications built using it will never scale in performance. – Relational: network/hierarchical models have no formal basis, are too complex, and unmanageable as application complexity increases. • Relational model found many supporters – Specially at universities – Its simplicity was enticing ICS214A Notes 01 21 Data Wars (2) • Many projects started off trying to implement a relational DBMS – System R @ IBM Almaden – Ingres @ Berkeley – These early systems led to the technologies that drive modern data management • Early prototypes became products – DB2 & Ingres • Principle designers from both the System R teams & Ingres left to start companies – Oracle, Sybase • Early relational companies went door to door converting industry to the relational model – Industry got hooked on to the simplicity of writing complex applications in relational model – Boeing among the first converts ICS214A Notes 01 22 Pointer’s Strike Back… Application data structures Relational Copy and Transparent translation ODBMS representation data transfer RDBMS • Complex objects in emerging DBMS applications cannot be effectively represented as records in relational model. • Representing information in RDBMSs requires complex and inefficient conversion into and from the relational model to the application programming language • ODBMSs provide a direct representation of objects to DBMSs overcoming the impedance mismatch problem ICS214A Notes 01 23 Object Model • Object: – observable entity in the world being modeled – similar to concept to entity in the E/R model • An object consists of: – attributes: properties built in from primitive types – relationships: properties whose type is a reference to some other object or a collection of references – methods: functions that may be applied to the object. ICS214A Notes 01 24 Object Oriented Databases • Evolved as persistent Object Oriented Programming Languages: • Start with an OO language (e.g., C++, Java, SMALLTALK) which has a rich type system • Add persistence to the objects in programming language where persistent objects stored in databases ICS214A Notes 01 29 Persistent Programming Languages • Single programming language for application and data management ii2 a[ j]  a[ j 1]  3 EmployeeSpousebenefit_levelbenefitlevel1  • Update to persistent variable results in automatic update to database. • Persistent data could be types such as sets and lists and arrays. • Application can follow pointers (OID) to navigate through data. ICS214A Notes 01 30 Persistence • Objects created may have different lifetimes: – transient: allocated memory managed by the programming language run-time system.  E.g., local variables in procedures have a lifetime of a procedure execution  global variables have a lifetime of a program execution – persistent: allocated memory and stored managed by ODBMS runtime system. • Classes are declared to be persistence-capable or transient. • Different languages have different mechanisms to make objects persistent: – creation time: Object declared persistent at creation time (e.g., in C++ binding) (class must be persistent-capable) – persistence by reachability: object is persistent if it can be reached from a persistent object (e.g., in Java binding) (class must be persistent-capable). ICS214A Notes 01 31 Persistent Object-Oriented Programming Languages • Persistent objects are stored in the database and accessed from the programming language. • Single programming language for applications as well as data management. – Avoid having to translate data to and from application programming language and DBMS  efficient implementation  less code – Programmer does not need to write explicit code to fetch data to and from database  persistent objects to programmer looks exactly the same as transient objects.  System automatically brings the objects to and from memory to storage device. (pointer swizzling). ICS214A Notes 01 32 Approaches To Persistent Programming • Persistent Virtual Memory – disk representation and memory representation of data is identical. – No cost to translate data from one representation to another— efficient! – DB size limited to address space 32bit processor  2^32 byte addressability (4 GBytes) – Differentiating persistent objects and non-persistent objects is difficult. – Difficult to optimize disk layout and locality of access. – Example system using approach: OBJECT STORE. ICS214A Notes 01 33 Approaches To Persistent Programming Languages • Store persistent objects in files – Objects brought to memory on demand. – Implementation of OID complex since pointers do not suffice in general.  If object in memory pointer can be used for OID  if object on disk a disk address still not good as OID since storage can be reorganized. A separate mechanism needed.  Pointer swizzling for efficiency. ICS214A Notes 01 34 Challenges In Building Persistent Languages • Efficient caching of objects in client address space. – Cache coherence. • In OODB data migrates to clients unlike relational client server systems where query migrates to server. • Given a large number of clients each with the cache of objects ensuring consistency of object across multiple clients is a challenge. ICS214A Notes 01 35 Disadvantages of ODBMS Approach • Low protection – since persistent objects manipulated from applications directly, more changes that errors in applications can violate data integrity. • Non-declarative interface: – difficult to optimize queries – difficult to express queries • But ….. – Most ODBMSs offer a declarative query language OQL to overcome the problem. – OQL is very similar to SQL and can be optimized effectively. – OQL can be invoked from inside ODBMS programming language. – Objects can be manipulated both within OQL and programming language without explicitly transferring values between the two languages. – OQL embedding maintains simplicity of ODBMS programming language interface and yet provides declarative access. ICS214A Notes 01 36 The Return of the Relations … POSTGRES • Relational model evolved into ORDBMSs that include “best of” objectoriented concepts • Amongst the first ORDBMS prototype built @ Berkeley POSTGRES commercialized Illustra bought by Informix IUS • Has had major impact on major commercial DBMS which have all migrated to ORDBMS model. • SQL3 supported by modern databases adapted many of the concepts developed in Postgres ICS214A Notes 01 37 POSTGRES — Combinations • Introduced object orientation into relation DBMSs. • Fundamental Concepts. – Each record has an OID. – Access to data though:  query language POSTQUEL.  navigation through OIDs. – Classes: – Inheritance: – Types: rich set of types available for columns. – Functions: can be called within POSTQUEL. ICS214A Notes 01 38 Classes And Inheritance • Class analogous to relation • User can create new class create Emp (name = c12, salary = float, age = int) • Classes can inherit from others create Salesman (quota = float) inherits Emp • Multiple inheritance permitted. If new class causes ambiguity it is not created. • Classes: – real: base classes or relations – derived: views – version: maintained differentially compared to parent class ICS214A Notes 01 39 Types In POSTGRES • Standard base types – float, int, charac. Strings, etc. – Abstract data type (ADT) facility to create new base types e.g.; create type point (x = int, y = int) create type polygon • ADT’s can be used in class definitions. Create Dept( dname = c10, mgr = c12, floorspace = polygon mailstop = point ) mailstop ICS214A Notes 01 40 Functions In POSTGRES • Three types: (1) C functions (2) Operators (3) POSTQUEL functions • C-functions – any C-function over base types or composite type retrieve (Dept. name) where area (Dept. floorspace) > 500 retrieve (Emp. name) where overpaid (Emp) Function over a class or method ICS214A Notes 01 41 Operators • Arbit C-functions are not optimized by query optimizers. – Special functions - operators can utilize indexes for their evaluation. • Operator: function with 1 or 2 operand Area Greater Than retrieve (Dept. name) where Dept. floor space-AGT “(0,0), (1,1), (0,2)” • Index (e.g.; B-tree) defined properly can be used to speed up evaluation of operators such as AGT. ICS214A Notes 01 42 Other Features Of POSTGRES • Allowed creation of new indices by user. • To an extent pioneered the approach of extensible database technology which is prevalent with vendors today • Supported transitive closure in query. retrieve* into ans (parent. older) from a in answer where. Parent. younger = “John” or parent. younger = a. older • Supported rules ICS214A Notes 01 43 POSTQUEL Functions • Any collection of commands in POSTQUEL. – query = POSTQUEL function. define function high-pay returns Emp as retrieve (Emp. all) where Emp. salary > 50k • POSTQUEL function with parameters. define function Sal-lookup (c12) returns float as retrieve (Emp. Salary) where Emp. name = $1 • Usage of POSTQUEL function retrieve Emp. name where Emp. Salary = Sal-lookup (“Joe”) ICS214A Notes 01 45 Composite Types In POSTGRES • POSTQUEL: – Composite types accessed via path expressions, using nested dot notation. remove (Emp  mgr  age) where (Emp  name = ‘joe’) • Prevents having to specify a join. ICS214A Notes 01 46 Composite Types In POSTGRES • Attributes can have a class name as a type resulting in complex objects with structure. Create Emp ( name = c12, salary = float [c12], age = int, Refers to 0 or more references of Emp class. mgr = Emp, coworker = Emp ) Could be elements of any class • A set type that can hold elements of any class. Add to Emp (hobbies = set) ICS214A Notes 01 47 Types In POSTGRES • Array type (constructor) crate Emp ( name = c12, salary = float [12], age = int Salary for each month. ) • POSTQUEL query retrieve (Emp  name) where (Emp  salary [4] = 1000) Array in query usage. ICS214A Notes 01 48 Database Technology Matrix Q u r y S u p p o r t Y E S RDBMSs ORDBMSs N O File System OODBMSs Simple Complex Database Types ICS214A Notes 01 49 XML & RDF - the new revolution • Just when relational model had driven out object-oriented database technology, WWW led to the proliferation of semi-structured data. • 2 approaches to supporting XML/RDF – Extend relational technology to support XML/RDF – Native XML databases ICS214A Notes 01 50 Summary of Evolution of Data Model • The Dark Ages: network & heirarchical models • Victory of simplicity and beauty over data spaghetti: The Relational DBMS: • The pointers strike back -- Object-Orientation, OODBMSs • The return of the relations -- ORDBMS -- took the best of the OO concepts and incorporated them in the relational model. • The current and near future -- support for XML & RDF • The final frontier -- anyone’s guess! ICS214A Notes 01 51 Key Data Management Technologies (quick review)… 52 Key Database Technologies • File Management – provides a file abstraction as a collection of records stored in disk • Index Management and Access Methods – implements techniques for associative access to data • Query Optimization and Processing – given a query and data storage structures, determines an efficient strategy to evaluate the query. • Transaction management – ensures consistency of the database in presence of concurrent transactions and various types of failures • Catalog Management – maintains database schema information • Authorization and Integrity Management – tests for integrity constraints and user authorization ICS214A Notes 01 53 Database Management System Architecture Application Queries Schema changes compilers Metadata and data dictionary optimizer evaluator Query processor Buffer manager Transaction Manager File system Storage manager Database and Indices ICS214A Notes 01 54 Storage Media and their Properties • Main Memory – – – – – costs $100/Mbyte -- reduces every year ‘volatile’ -- does not survive system failures random I/O very fast data can be processed by CPU directly capacity limited to orders of magnitude lower than what database needs. • Magnetic Disk – – – – costs $0.50/Mbyte -- reduces each year Non-volatile (except when disk crashes) random I/O not as fast CPU cannot directly process data. Needs to be transferred to main memory • Tape – Cheaper but slower than disks. Sequential I/O devices. Handy for backups, sometimes for archival. ICS214A Notes 01 Databases and Storage Devices • • • • Due to capacity, cost, volatility factors databases traditionally stored in disks. Data brought to main memory for processing from disks There are many ways to interface memory with disk resident data E.g., virtual memory: – VM size limited to max address generated by CPU – Existing VM does not support durability • • File system provides a more powerful mapping between memory and disk storage A bunch of tricks used ensure that high latency of secondary storage does not impact application response time and system throughput – access disks asynchronously with active applications – prefetch data before application needs it – intelligent caching techniques ICS214A Notes 01 56 Functional Abstraction of a Simplistic DBMS beginT SQL SQL endT Access plan optimizer SQL statements beginT SQL SQL endT Query Processor Read write records, scan relations Record-oriented file system Get page containing tuples Buffer manager Basic file system Read/write file pages Hardware ICS214A Notes 01 57 Basic File System • Provides the abstraction of a file where a file is an array of fixed size blocks • Hides the disk geometry -- cylinders, tracks, sectors, slots and other functional components like arms, head, etc. such that the programs do not need to deal with these complexities • Operations supported: – – – – – – – ICS214A create a file delete a file open a file close a file extend a file read (set of) file blocks into buffers in memory write (set of) file blocks Notes 01 58 Basic File System Design Issues • File allocation: how to allocate blocks on disk to a file. – Contiguous allocation: file stored in contiguous disk blocks. Blocks for storing file found using either of best-fit, worst-fit or first-fit policies.  +ve: provides fast sequential scan of file  -ve: fragmentation, difficult to enlarge files – Linked allocation: file is a linked list of disk blocks  +ve: prevents fragmentation, easy to enlarge files  -ve: slow for both sequential and random access – Index allocation: file implemented using fixed size blocks pointed to by an index (e.g., B-tree). Popularized by Unix  +ve: good random access, easy enlargement, no fragmentation.  -ve: poor sequential access performance – Extent based allocation: file is a collection of clusters of consecutive disk blocks (extents) where collection maintained using linked lists or index  Most popular approach with vendors. • Free space management: information about which blocks are free ICS214A Notes 01 59 Buffer Management • Makes file pages addressable in memory and coordinates writing of pages to disk with other components to guarantee transactional properties • Acts as a mediator between basic file system and recordoriented file system • Buffer frames maintained in main memory. When a request for file page access comes, check if page in buffer. Else get a free frame and load file page into buffer • Operations Supported: – – – – ICS214A bufferfix bufferunfix get block flush Notes 01 60 • Database Buffer Management Design Issues DBMS buffer manager returns pointer to frame containing data instead of returning copy of requested page to caller. – Efficiency: prevents unnecessary copying of data – Allows sharing of data at finer granularity than a page  2 transactions T1 and T2.  T1 and T2 update records r1 and r2 on same page  if buffer manager allowed applications to copy data to their address space and rewrite updated versions, updates might be lost • • Database buffer manager participates in protocols to implement transactions (WAL, FL@C, pinning buffer slots) Novel page replacement strategies: – Traditional LRU strategy used in OS works well only under the assumption of locality of reference which may not hold for DBMSs – Since DBMS query language are declarative, system has much more information about reference patterns which it can exploit to improve caching performance of buffer manager ICS214A Notes 01 61 Record-Oriented File System • • Provides the abstraction of a file as a collection of records. Records can be: – – – – • fixed size or variable length short, long, or very long attributes can be fixed length or variable length simple or complex (e.g., containing set valued attributes) Operations supported: – create, delete, open, close, alter, drop – read, insert, update, delete record – scan all records in a file • Issues Involved: – mapping records to pages – file organization: organization of records in a file.  Where to insert new records  what mechanism can be used to retrieve records ICS214A Notes 01 62 Index Management and Associative Access • Associative access: accessing records based on their attribute values. • Index Files – an index file declared over a (set of) attribute of the data file provides associative access to records in the data file. – Index file contains pointers to disk blocks where the record corresponding to the value appear. • Types of an Index: (let indexing attribute be A) – – – – primary: A is a key and data file stored sorted on A clustered: A is not a key but data file stored sorted on A secondary (key): A is a key but data file not sorted on A secondary (non-key): A is neither a key and nor is data file sorted on A. ICS214A Notes 01 Organization of Index File • B-tree Index: index file is organized as a B-tree – Advantages:  Supports range searches efficiently.  E.g., retrieve all employees with salary between 100K and 200K – Disadvantages:  Guaranteed good storage utilization  searching for a given record could take around 3-4 disk I/Os • Hash Index: index file maintained as a hash file. – Advantages:  Looking for a specific record very efficient -- 1 disk I/O – Disadvantages:  cannot support range searches • Multdimensional Access Methods – modern databases are beginning to support novel data structures like Rtrees, grid files, inverted lists to better serve emerging application requirements ICS214A Notes 01 64 Multidimensional Indexing Motivation • Many applications of databases are geographical = 2-d data. Others involve large number of dimensions • Examples: – location of restaurants in a city. – Map data: zones, county lines, rivers, lakes, etc. (Data has spatial extent) – Sales information described by store, day, item, color, size, etc. Sale = point in multidimensional space. – Student described by age, zipcode, marital status. • Queries: – Range Query: “ find all McDonald restaurant within a given region”. – Nearest Neighbor Query: Find the nearest McDonald to my house – partial match queries ICS214A Notes 01 65 Approach: Utilize Single Dimensional Index • • • • Index on attributes independently Project query range to each attribute determine pointers. Intersect pointers go to the database and retrieve objects in the intersection. May result in very high I/O cost ICS214A Notes 01 66 R-tree Data Structure • • • • • ICS214A Notes 01 Extension of B-tree to multidimensional space. Paginated, balanced, guaranteed storage utilization. Can support both point data and data with spatial extent Groups objects into possibly overlapping clusters (rectangles in our case) Search for range query proceeds along all paths that overlap with the query. 67 Split Node • • Given a node split it into two nodes which are each atleast half full Multiple Objectives: – – • • minimize overlap minimize covered area R-tree minimizes covered area What is an optimal criteria??? Minimize overlap ICS214A Notes 01 Minimize covered area 68 Minimizing Covered Area • Group objects into 2 parts such that the covered area is minimized • NP Hard!! • Hence use heuritics • Two heuristics explored – quadratic and linear ICS214A Notes 01 69 Other Multidimensional Data Structures • Many generalizations of R-tree – different splitting criteria – different shapes of clusters (e.g., d-dimensional spheres) – adding redundancy to reduce search cost:  • store objects in multiple rectangles instead of a single rectangle to reduce cost of retrieval. But now insert has to store objects in many clusters. This strategy also increases overlap causing search performance to detoriate. Space Partitioning Data Structures – unlike R-tree which group objects into possibly overlapping clusters, these methods attempt to partition space into non-overlapping regions. – E.g., KD tree, quad tree, grid files, KD-Btree, HB-tree, hybrid tree. • Space filling curves – superimpose an ordering on multidimensional space that preserves proximity in multidimensional space. (Z-ordering, hilbert ordering) – Use a B-tree as an index on that ordering ICS214A Notes 01 70 KD-tree • A main memory data structure based on binary search trees – can be adapted to block model of storage (KD-Btree) • Levels rotate among the dimensions, partitioning the space based on a value for that dimension • KD-tree is not necessarily balanced. ICS214A Notes 01 71 KD-Tree Example X=7 X=3 X=5 y=6 y=5 Y=6 x=3 x=8 x=7 Y=2 y=2 X=5 ICS214A X=8 Notes 01 72 Adapting KD Tree to Block Model • Similar to B-tree, tree nodes split many ways instead of two ways – Risk:  insertion becomes quite complex and expensive.  No storage utilization guarantee since when a higher level node splits, the split has to be propagated all the way to leaf level resulting in many empty blocks. • Pack many interior nodes (forming a subtree) into a block. – Risk  it may not be feasible to group nodes at lower level into a block productively.  Many interesting papers on how to optimally pack nodes into blocks recently published. ICS214A Notes 01 73 Quad Tree • Nodes split along all dimensions simultaneously • Division fixed: by quadrants • As with KD-tree we cannot make quadtree levels uniform ICS214A Notes 01 74 Quad Tree Example X=7 X=3 SW NW SE X=5 ICS214A NE X=8 Notes 01 75 Grid Files • Space Partitioning strategy but different from a tree. • Select dividers along each dimension. Partition space into cells • Unlike KD-tree dividers cut all the way. • Each cell corresponds to 1 disk page. • Many cells can point to the same page. • Cell directory potentially exponential in the number of dimensions ICS214A Notes 01 76 Space Filling Curve • Assumption – finite precision in representing each coordinate. B A 01 10 11 Z(A) = shuffle(x_A, y_A) = shuffle(00,11) = 0101 = 5 Z(B) = 11 = 3 00 (common prefix to all its blocks) 00 01 C 10 11 Z(C1) = 0010 = 2 Z(C2) = 1000 = 8 ICS214A Notes 01 77 Deriving Z-Values for a Region • Obtain a quad-tree decomposition of an object by recursively dividing it into blocks until blocks are homogeneous. 01 11 Objects representation 00 10 00 11 01 00 ICS214A is 0001, 0011,01 11 Notes 01 78 Generalized Search Trees • Motivation: – disparate applications require different data structures and access methods. – Requires separate code for each data structure to be integrated with the database code  too much effort.  Vendors will not spend time and energy unless application very important or data structure has general applicability. • Generalized search trees abstract the notion of data structure into a template. – Basic observation: most data structures are similar and a lot of book keeping and implementation details are the same. – Different data structures can be seen as refinements of basic GiST structure. Refinements specified by providing a registering a bunch of functions per data structure to the GiST. ICS214A Notes 01 79 GiST supports extensibility both in terms of data types and queries • GiST is like a “template” - it defines its interface in terms of ADT rather than physical elements (like nodes, pointers etc.) • The access method (AM) can customize GiST by defining his or her own ADT class i.e. you just define the ADT class, you have your access method implemented! • No concern about search/insertion/deletion, structural modifications like node splits etc. ICS214A Notes 01 80 Query Processing in DBMSs Internal relational algebra based representation of query Select … From … Where ... Parsing and Translation optimizer Statistics about data Sally 4000 Dick 9000 … … ... Evaluation engine Optimized execution plan Query results Data and index ICS214A Notes 01 81 Query Optimization • • Goals: to find the cheapest evaluation strategy for a query Stages of Optimization: – algebraic manipulations: heuristics used to convert query tree into an equivalent but more efficient representation.     perform selections and projections as early as possible. combine selections with cartesian products to make a join combine sequence of unary operations (selections and projections). look for common subexpressions in an expression. – Cost based Analysis: given optimized representation produced after algebraic manipulation:  generate all possible query plans and estimate their costs based on the statistical information and costs of each unary and binary operations.  Best possible query plan chosen as an execution strategy.  Number of plans considered even after heuristic are applied is exponential in the number of operators in query tree. It is important to choose a good plan since cost of generating plan amortized over multiple query executions. ICS214A Notes 01 Cost of Query Execution • Access to disk: cost of reading, writing, searching data blocks. (i/o cost) • Storage Costs: cost of storing intermediate files generated during query execution. (i/o cost) • Computation cost: cost of in memory execution of operations. (cpu cost) • Communication cost: cost of shipping the query and results from site to site or terminal where query originated. (communication cost) • Total cost = I/O cost + w1* CPU cost + w2 *Communication cost • Traditionally I/O cost considered most important ICS214A Notes 01 Transaction Management Applications in databases are modeled as transactions which provides ACID guarantees. • Atomicity: either all the effects of a transaction appear in database or none of the effects of a transaction appears in database. • Consistency: each transaction maps a database from consistent state to another consistent state • Isolation: concurrent execution of trasnactions is hidden from other concurrently executing transactions • Durability: if a transaction completes its effects are permanent and survive failures. ICS214A Notes 01 84 Transaction Model • Transactions provide a simple, powerful, and a natural programming model for writing database applications. • Transaction concept supports: – simple failure semantics: either all the effects of transaction appear in database or none do -- all or nothing – isolated view of the world: protection from partial effects of other concurrent applications. • Transactions allows applications to share data without having to explicitly deal with either fault-tolerance or synchronization • Transactions are the enabling technology for large distributed applications. ICS214A Notes 01 85 Isolation • • Isolation is implemented by using 2 phase locking protocol 2 Phase Locking Protocol: – Each transaction acquires a lock on a data item before accessing data – Locks are released when a transaction commits User 1 reads account = 1500 time User 2 reads account = 1500 User 1sets account value = 500 (withdraws 1000 dollars) User 2 sets account value = 700 (withdraws 800 dollars) The execution will be prevented by 2 phase locking since user 1’s transaction will not release the lock on account until user 1 transaction terminates ICS214A Notes 01 86 Atomicity • • • Atomicity is implemented by using a logging strategy. A transaction, before updating a data item writes a undo log record, using which its effects can be undone. If transaction aborts then undo log records used toreconstruct database state before transaction execution Old state New state Normal processing DO Undo log record New state Transaction rollback due to either user requested abort, system failure, consistency violation ICS214A Old state UNDO Undo log record Notes 01 87 Durability • • • Durability is implemented using logging strategy A transaction, before updating a data item, writes a redo log record using which its effects are redone If system fails before a committed transaction’s effects appear in database its effects are redone using redo log records on recovery. Old state New state Normal processing DO Redo Log record Old state New state Redo of committed transaction ICS214A REDO Redo log record Notes 01 88

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Slides 01 - University of California, Irvine