* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download 1.9 File Structure and Indexing - KV Institute of Management and
Open Database Connectivity wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Concurrency control wikipedia , lookup
Microsoft Jet Database Engine wikipedia , lookup
ContactPoint wikipedia , lookup
Relational model wikipedia , lookup
Clusterpoint wikipedia , lookup
Data Base Management System Unit 1 Introduction Topics to be covered – Database and DBMS-Characteristics – importance – advantage – evaluation - Codd rules – database architecture: data organization – file structure and indexing. Table of Contents Unit 1............................................................................................................................................................. 1 Introduction .................................................................................................................................................. 1 Topics to be covered – Database and DBMS-Characteristics – importance – advantage – evaluation Codd rules – database architecture: data organization – file structure and indexing. ........................ 1 1.1 Database and DBMS...................................................................................................................... 2 1.2 Characteristics of DBMS ................................................................................................................ 3 1.3 Importance of DBMS ..................................................................................................................... 3 1.4 Advantages of DBMS..................................................................................................................... 3 1.5 Evolution of DBMS ........................................................................................................................ 4 1.6 Codd Rules .................................................................................................................................... 6 1.7 Database Architecture ........................................................................................................................ 8 1.8 Data organization .......................................................................................................................... 9 1.9 File Structure and Indexing ............................................................................................................... 11 1.9.1 File structuring ........................................................................................................................... 11 1.9.2 Indexing ...................................................................................................................................... 12 1.10 Reference ........................................................................................................................................ 18 1.1 Database and DBMS File – is a two-dimensional table summarizing the multiple instance of a set of fields of an entity. Database – is a collection of interrelated files, while database management system is a collection of database, database utilities and data dictionary/directory, operated by user groups/ application developers and administered by database administrator. Database system is essentially a database management system which is free from all the drawbacks of the conventional file processing system. In the database system, data is independent of programs. Ex – Room Items Patients Physician Charges In – patient treatment Outpatient treatment Database is an ‘operational data’ different from input data, output data. A database is a collection of stored operational data used by the application systems of some particular organization. Input data is one, which comes into the system from outside world, i.e. from terminals. 1.2 Characteristics of DBMS (i) Shared – Data in a database are shared among different users and applications. (ii) Persistence – Data is a database exist permanently in the sense the data can live beyond the scope of the process that crested it. (iii) Validity / Integrity / Correctness – Data should be correct with respect to the real world entity that they represent. (iv) Security – Data should be protected from unauthorized access. (v) Consistency – Whenever more than one data element in a database represents related real-world values, the values should be consistence with respect to the relationship. (vi) Independence – The three levels in the schema (internal, conceptual and external) should be independent of each other so that the changes in the schema at one level should not affect others levels. 1.3 Importance of DBMS It helps make data management more efficient and effective Its query language allows quick answers to ad-hoc quires. It provides end users better access to more and better – managed data. It provides an integrated view of the organization’s operations “Big picture” It reduces the probability of inconsistent 1.4 Advantages of DBMS (i) Redundancy can be reduced In non-database systems, each application or department has its own private files resulting in considerable amount of redundancy o the stored data. Thus storage space is wasted. By having a centralized database should be eliminated. Sometimes there are sound business and technical reasons for maintaining multiple copies of the same data. (ii) Inconsistency can be avoided When the same data is duplicated and changes are made at one site, which is not propagated to the other sites, it gives rise to inconsistency. Then the two entities regarding the same data will not agree. At such times the data is said to be inconsistent. If redundancy is removed chances of having inconsistent data is also removed. (iii) Data can be shared The existing applications can share the data in a database. (iv) Standards can be enforced With the central control of the database, the database administrator can enforce standard. (v) Security restrictions can be applied Having complete authority over the operational data, enables the database administrator in ensuring that the only means of access to the database is through proper channels. The DBA can define authorization checks to be carried out whenever access to sensitive data is attempted. (vi) Integrity can be maintained Integrity – that the data in the database is accurate. Centralized control of the data helps in permitting the administrator to define integrity constraints to the data in the database. (vii) Conflicting requirements can be balanced Knowing the overall requirements helps the database designers in creating a database design that is best for the organization. 1.5 Evolution of DBMS (i) Late 1960s and 1970s: Widespread use of hard disks in the late 1960s changed the scenario for data processing greatly, since hard disks allowed direct access to data. The position of data on disk was immaterial, since any location on disk could be accessed in just tens of milliseconds. With disks, network and hierarchical databases could be created that allowed data structures such as lists and trees to be stored on disk. Programmers could construct and manipulate these data structures. Codd [1970] defined the relational model, and nonprocedural ways of querying data in the relational model, and relational databases were born. The simplicity of the relational model and the possibility of hiding implementation details completely from the programmer were enticing indeed. Codd later won the prestigious Association of Computing Machinery Turing Award for his work. (ii) 1980s: Although academically interesting, the relational model was not used in practice initially, because of its perceived performance disadvantages - relational databases could not match the performance of existing network and hierarchical databases. That changed with System R, a groundbreaking project at IBM Research that developed techniques for the construction of an efficient relational database system. The fully functional System R prototype led to IBM’s first relational database product, SQL/DS. Initial commercial relational database systems, such as IBM DB2, Oracle, Ingres, and DEC R Database, played a major role in advancing techniques for efficient processing of declarative queries. By the early 1980s, relational databases had become competitive with network and hierarchical database systems even in the area of performance. Relational databases were so easy to use that they eventually replaced network/hierarchical databases; programmers using such databases were forced to deal with many low-level implementation details, and had to code their queries in a procedural fashion. Most importantly, they had to keep efficiency in mind when designing their programs, which involved a lot of effort. In contrast, in a relational database, almost all these low-level tasks are carried out automatically by the database, leaving the programmer free to work at a logical level. Since attaining dominance in the 1980s, the relational model has reigned supreme among data models. The 1980s also saw much research on parallel and distributed databases, as well as initial work on object-oriented databases. (iii) Early 1990s: The SQL language was designed primarily for decision support applications, which are query intensive, yet the mainstay of databases in the 1980s was transaction processing applications, which are update intensive. Decision support and querying re-emerged as a major application area for databases. Tools for analyzing large amounts of data saw large growths in usage. Many database vendors introduced parallel database products in this period. Database vendors also began to add object-relational support to their databases. (iv) Late 1990s: The major event was the explosive growth of the World Wide Web. Databases were deployed much more extensively than ever before. Database systems now had to support very high transaction processing rates, as well as very high reliability and 24×7 availability. Database systems also had to support Web interfaces to data. 1.6 Codd Rules Codd ’s 12 rules are a set of thirteen rules (numbered zero to twelve) proposed by Edgar F. Codd, a pioneer of the relational model for databases, designed to define what is required from a database management system in order for it to be considered relational, i.e., an RDBMS. Codd produced these rules as part of a personal campaign to prevent his vision of the relational database being diluted, as database vendors scrambled in the early 1980s to repackage existing products with a relational veneer. Rule 12 was particularly designed to counter such a positioning. In fact, the rules are so strict that all popular so-called "relational" DBMSs fail on many of the criteria. The rules (i) Rule 0: The system must qualify as relational , as a database , and as a management system For a system to qualify as a relational database management system (RDBMS), that system must use its relational facilities (exclusively) to manage the database The information rule: All information in the database is to be represented in one and only one way, namely by values in column positions within rows of tables. The guaranteed access rule: All data must be accessible. This rule is essentially a restatement of the fundamental requirement for primary keys. It says that every individual scalar value in the database must be logically addressable by specifying the name of the containing table, the name of the containing column and the primary key value of the containing row. Systematic treatment of null values: The DBMS must allow each field to remain null (or empty). Specifically, it must support a representation of "missing information and inapplicable information" that is systematic, distinct from all regular values (for example, "distinct from zero or any other number", in the case of numeric values), and independent of data type. It is also implied that such representations must be manipulated by the DBMS in a systematic way. Active online catalog based on the relational model: The system must support an online, inline, relational catalog that is accessible to authorized users by means of their regular query language. That is, users must be able to access the database's structure (catalog) using the same query language that they use to access the database's data. The comprehensive data sublanguage rule: The system must support at least one relational language that Can be used both interactively and within application programs, Supports data definition operations (including view definitions), data manipulation operations (update as well as retrieval), security and integrity constraints, and transaction management operations (begin, commit, and rollback). The view updating rule: All views that are theoretically updatable must be updatable by the system. High-level insert, update, and delete: The system must support set-at-atime insert update, and delete operators. This means that data can be . (ii) Rule 1: (iii)Rule 2: (iv) Rule 3: (v) Rule 4: (vi) Rule 5: (vii) Rule 6: (viii) Rule 7: retrieved from a relational database in sets constructed of data from multiple rows and/or multiple tables. This rule states that insert, update, and delete operations should be supported for any retrievable set rather than just for a single row in a single table. (ix) Rule 8: Physical data independence: Changes to the physical level (how the data is stored, whether in arrays or linked lists etc.) must not require a change to an application based on the structure. Logical data independence: Changes to the logical level (tables, columns, rows, and so on) must not require a change to an application based on the structure. Logical data independence is more difficult to achieve than physical data independence. Integrity independence: Integrity constraints must be specified separately from application programs and stored in the catalog. It must be possible to change such constraints as and when appropriate without unnecessarily affecting existing applications. (x) Rule 9: (xi) Rule 10: (xii) Rule 11: Distribution independence: he distribution of portions of the database to various locations should be invisible to users of the database. Existing applications should continue to operate successfully : when a distributed version of the DBMS is first introduced; and When existing distributed data are redistributed around the system. (xiii) Rule 12: The non-subversion rule: If the system provides a low-level (record-at-a-time) interface, then that interface cannot be used to subvert the system, for example, bypassing a relational security or integrity constraint. 1.7 Database Architecture The goal of the three-schema architecture, is to separate the user applications and the physical database. In this architecture, schemas can be defined at the following three levels: 1. The internal level has an internal schema, which describes the physical storage structure of the database. The internal schema uses a physical data model and describes the complete details of data storage and access paths for the database. 2. The conceptual level has a conceptual schema, which describes the structure of the whole database for a community of users. The conceptual schema hides the details of physical storage structures and concentrates on describing entities, data types, relationships, user operations, and constraints. A high-level data model or an implementation data model can be used at this level. 3. The external or view level includes a number of external schemas or user views. Each external schema describes the part of the database that a particular user group is interested in and hides the rest of the database from that user group. A high-level data model or an implementation data model can be used at this level. 1.8 Data organization The choice of the organization and the methods used for direct and access depend on five characteristics of the data: o File Volatility o File activity o File query needs o File size o Data currency (i) File Volatility When data are frequently added to or deleted form a file, the file is said to have a high volatility. Airline and railway files are volatile files, as reservation transactions occur at the rate of hundreds or thousands per minute. But an employee master file for a company with a low employee turnover would have very low volatility is low, sequential or index sequential forms of file organization and access work well, especially when few queries are made against the file. (ii) File Activity The percentage of records in a file that is actually accessed during any one run is the file activity rate. In application, such as bank teller or hotel reservation systems, each transaction must be processed immediately. Other application, such as sending out invoices at the end of the month, transaction can be batched and then processed as a group. When the file activity rate is around 60 percent or higher – meaning that 60 percent of the records in the file may be accessed at any one time –sequential techniques are often considered more efficient. (iii)File Query When information must be retrieved very quickly, then some form of direct organization must be used. Railways must be retrieved systems, inventory systems, and automatic teller machine systems all fall into this category. (iv) File Size When the records in a large file must be accessed immediately, then direct organization must be used. But if the size of the file is small – under about 100KB the sequential file organization can be used. Files of the size can be read in their entirely into a computer’s main memory. (v) File Currency Data currency refers to the timeliness of data. If the data need to be up-to-the minute, then direct organization and processing will be required. Stock quote systems, airline reservation systems, and on-line shopping systems all depend on timely data and therefore depend on direct systems. 1.9 File Structure and Indexing 1.9.1 File structuring Magnetic disks are new the first choice as secondary storage medium because of their data access speeds and decreasing cost of the storage. Magnetic material in form of circular in form of circular disk and information is stored on the disk surface in concentric circles are called tracks. In case of disks packs (no of disks together) tracks with same diameter on various surface called cylinder. Cylinder are important because data stored on the same cylinder could be accessed much faster than the data that is distributed among the different cylinder. A tracks is divided into smaller units called sectors. Sectors further divided into blocks – (done by the operating system during disk formatting / installation. Block size fix once during formatting and can’t be changed until the disk is formatting again. Block ranges from 512 bytes – 4096 bytes. Blocks are separated by inter-block gaps. Transfer of data between main memory and disk takes place in units of blocks – the address of a block supplied to the disks input / output hardware. The address of the area in the main memory reserved to hold the contents of the block (called buffer). Read command used is issued – contents of the block from the disk are copied to the buffer. Write command issued- contest of buffer copied to disk. Size of the buffer in main memory is large enough to accommodate several blocks, called cluster access speeds. Access time consists of 4 factors seek time need switching time, rotating delay and data rime, Total time needed to locate and transfer data from a block is the sum of all the above four factor. Record types: Data is stored as records and each records is a collection of valves and each valves is formed by one or more bytes corresponding to a specific field in the record. Book record represents a book entity and each field value of the record specifies an attributed of the book entity like ISBN, AUTHOR, PUBLISHER, PRICE and so on. 1.9.2 Indexing Database system indices play the same role as book indices or card catalogs in libraries. For example, to retrieve an account record given the account number, the database system would look up an index to find on which disk block the corresponding record resides, and then fetch the disk block, to get the account record. Keeping a sorted list of account numbers would not work well on very large databases with millions of accounts, since the index would itself be very big; further, even though keeping the index sorted reduces the search time, finding an account can still be rather time-consuming. Instead, more sophisticated indexing techniques may be used. There are two basic kinds of indices: (1) Ordered indices. Based on a sorted ordering of the values. (2) Hash indices. Based on a uniform distribution of values across a range of buckets. The bucket to which a value is assigned is determined by a function, called a hash function. We shall consider several techniques for both ordered indexing and hashing. No one technique is the best. Rather, each technique is best suited to particular database applications. Each technique must be evaluated on the basis of these factors: (i) Access types: The types of access that are supported efficiently. Access types can include finding records with a specified attribute value and finding records whose attribute values fall in a specified range. (ii) Access time: The time it takes to find a particular data item, or set of items, using the technique in question. (iii) Insertion time: The time it takes to insert a new data item. This value includes the time it takes to find the correct place to insert the new data item, as well as the time it takes to update the index structure. (iv) Deletion time: The time it takes to delete a data item. This value includes the time it takes to find the item to be deleted, as well as the time it takes to update the index structure. (v) Space overhead: The additional space occupied by an index structure. Provided that the amount of additional space is moderate, it is usually worthwhile to sacrifice the space to achieve improved performance. (1) Ordered Indices To gain fast random access to records in a file, we can use an index structure. Each index structure is associated with a particular search key. Just like the index of a book or a library catalog, an ordered index stores the values of the search keys in sorted order, and associates with each search key the records that contain it. The records in the indexed file may themselves be stored in some sorted order, just as books in a library are stored according to some attribute such as the Dewey decimal number. A file may have several indices, on different search keys. If the file containing the records is sequentially ordered, a primary index is an index whose search key also defines the sequential order of the file. Primary indices are also called clustering indices. The search key of a primary index is usually the primary key, although that is not necessarily so. Indices whose search key specifies an order different from the sequential order of the file are called secondary indices, or non-clustering indices. Primary Index In this section, we assume that all files are ordered sequentially on some search key. Such files, with a primary index on the search key, are called index-sequential files. They represent one of the oldest index schemes used in database systems. They are designed for applications that require both sequential processing of the entire file and random access to individual records. Dense and Sparse Indices An index record, or index entry, consists of a search-key value, and pointers to one or more records with that value as their search-key value. The pointer to a record consists of the identifier of a disk block and an offset within the disk block to identify the record within the block. There are two types of ordered indices that we can use: o Dense index: An index record appears for every search-key value in the file. In a dense primary index, the index record contains the search-key value and a pointer to the first data record with that search-key value. The rest of the records with the same search key-value would be stored sequentially after the first record, since, because the index is a primary one, records are sorted on the same search key. Dense index implementations may store a list of pointers to all records with the same search-key value; doing so is not essential for primary indices. o Sparse index: An index record appears for only some of the search-key values. As is true in dense indices, each index record contains a search-key value and a pointer to the first data record with that search-key value. To locate a record, we find the index entry with the largest search-key value that is less than or equal to the search-key value for which we are looking. We start at the record pointed to by that index entry, and follow the pointers in the file until we find the desired record. Dense Index Sparse Index Index Update Regardless of what form of index is used, every index must be updated whenever a record is either inserted into or deleted from the file. We first describe algorithms for updating single-level indices. o Insertion. First, the system performs a lookup using the search-key value that appears in the record to be inserted. Again, the actions the system takes next depend on whether the index is dense or sparse: Dense indices: 1. If the search-key value does not appear in the index, the system inserts an index record with the search-key value in the index at the appropriate position. 2. Otherwise the following actions are taken: a. If the index record stores pointers to all records with the same searchkey value, the system adds a pointer to the new record to the index record. b. Otherwise, the index record stores a pointer to only the first record with the search-key value. The system then places the record being inserted after the other records with the same search-key values. Sparse indices: o We assume that the index stores an entry for each block. o If the system creates a new block, it inserts the first search-key value (in searchkey order) appearing in the new block into the index. o On the other hand, if the new record has the least search-key value in its block, the system updates the index entry pointing to the block; if not, the system makes no change to the index. Deletion. To delete a record, the system first looks up the record to be deleted. The actions the system takes next depend on whether the index is dense or sparse: Dense indices: 1. If the deleted record was the only record with its particular search-key value, then the system deletes the corresponding index record from the index. 2. Otherwise the following actions are taken: a. If the index record stores pointers to all records with the same searchkey value, the system deletes the pointer to the deleted record from the index record. b. Otherwise, the index record stores a pointer to only the first record with the search-key value. In this case, if the deleted record was the first record with the search-key value, the system updates the index record to point to the next record. Sparse indices: 1. If the index does not contain an index record with the search-key value of the deleted record, nothing needs to be done to the index. 2. Otherwise the system takes the following actions: a. If the deleted record was the only record with its search key, the system replaces the corresponding index record with an index record for the next search-key value (in search-key order). If the next search-key value already has an index entry, the entry is deleted instead of being replaced. b. Otherwise, if the index record for the search-key value points to the record being deleted, the system updates the index record to point to the next record with the same search-key value. Secondary Indices Secondary indices must be dense, with an index entry for every search-key value, and a pointer to every record in the file. A primary index may be sparse, storing only some of the search-key values, since it is always possible to find records with intermediate search-key values by a sequential access to a part of the file, as described earlier. If a secondary index stores only some of the search-key values, records with intermediate search-key values may be anywhere in the file and, in general, we cannot find them without searching the entire file. A secondary index on a candidate key looks just like a dense primary index, except that the records pointed to by successive values in the index are not stored sequentially. In general, however, secondary indices may have a different structure from primary indices. If the search key of a primary index is not a candidate key, it suffices if the index points to the first record with a particular value for the search key, since the other records can be fetched by a sequential scan of the file. In contrast, if the search key of a secondary index is not a candidate key, it is not enough to point to just the first record with each search-key value. The remaining records with the same search-key value could be anywhere in the file, since the records are ordered by the search key of the primary index, rather than by the search key of the secondary index. Therefore, a secondary index must contain pointers to all the records. (2) Hash Indices Hashing can be used not only for file organization, but also for index-structure creation. A hash index organizes the search keys, with their associated pointers, into a hash file structure. We construct a hash index as follows. o We apply a hash function on a search key to identify a bucket, and store the key and its associated pointers in the bucket (or in overflow buckets). o A secondary hash index on the account file, for the search key account-number. We use the term hash index to denote hash file structures as well as secondary hash indices. Strictly speaking, hash indices are only secondary index structures. A hash index is never needed as a primary index structure, since, if a file itself is organized by hashing, there is no need for a separate hash index structure on it. However, since hash file organization provides the same direct access to records that indexing provides, we pretend that a file organized by hashing also has a primary hash index on it. Dynamic Hashing o As we have seen, the need to fix the set B of bucket addresses presents a serious problem with the static hashing technique of the previous section. o Most databases grow larger over time. If we are to use static hashing for such a database, we have three classes of options: 1. Choose a hash function based on the current file size. This option will result in performance degradation as the database grows. 2. Choose a hash function based on the anticipated size of the file at some point in the future. Although performance degradation is avoided, a significant amount of space may be wasted initially. 3. Periodically reorganize the hash structure in response to file growth. Such a reorganization involves choosing a new hash function, re computing the hash function on every record in the file, and generating new bucket assignments. This reorganization is a massive, time-consuming operation. Furthermore, it is necessary to forbid access to the file during reorganization. Several dynamic hashing techniques allow the hash function to be modified dynamically to accommodate the growth or shrinkage of the database. We describe one form of dynamic hashing, called extendable hashing. 1.10 Reference 1. Database Management System – Alexis Leon and Mathews Leon 2. Database Management System – R. Panneerselvam 3. Database Management System – Rajesh Narang.