Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Serializability wikipedia , lookup
Open Database Connectivity wikipedia , lookup
Microsoft Jet Database Engine wikipedia , lookup
Concurrency control wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Extensible Storage Engine wikipedia , lookup
Clusterpoint wikipedia , lookup
INFORMATION MANAGEMENT UNIT III DATABASE MANAGEMENT SYSTEMS DBMS – HDBMS, NDBMS, RDBMS, OODBMS, Query Processing, SQL, Concurrency Management, Data warehousing and Data Mart Contents 3.1.Database Management System ................................................................................................................ 3 3.2.HDBMS: Hierarchical Database Management System .......................................................................... 11 3.3.NDBMS: Network Database Management System............................................................................... 13 3.4.RDBMS .................................................................................................................................................. 15 3.5. OODBMS .............................................................................................................................................. 18 3.6 Query Language .................................................................................................................................... 21 3.7 SQL ....................................................................................................................................................... 21 3.8. Concurrency Control ............................................................................................................................ 38 3.9. Data Warehouse ................................................................................................................................... 48 3.10.What Is a Data Mart? .......................................................................................................................... 54 3.1. Introduction Database is a collection of related data. Database management system is software designed to assist the maintenance and utilization of large scale collection of data. DBMS came into existence in 1960 by Charles. Integrated data store which is also called as the first general purpose DBMS. Again in 1960 IBM brought IMS-Information management system. In 1970 Edgor Codd at IBM came with new database called RDBMS. In 1980 then came SQL Architecture- Structure Query Language. In 1980 to 1990 there were advances in DBMS e.g. DB2, ORACLE. 1 3.1.1 In general data management consists of following tasks Data capture: Which is the task associated with gathering the data as and when they originate. Data classification: Captured data has to be classified based on the nature and intended usage. Data storage: The segregated data has to be stored properly. Data arranging: It is very important to arrange the data properly Data retrieval: Data will be required frequently for further processing, Hence it is very important to create some indexes so that data can be retrieved easily. Data maintenance: Maintenance is the task concerned with keeping the data up- to-date. Data Verification: Before storing the data it must be verified for any error. Data Coding: Data will be coded for easy reference. Data Editing: Editing means re-arranging the data or modifying the data for presentation. Data transcription: This is the activity where the data is converted from one form into another. Data transmission: This is a function where data is forwarded to the place where it would be used further. 3.1.2Database Database may be defined in simple terms as a collection of data A database is a collection of related data. The database can be of any size and of varying complexity. A database may be generated and maintained manually or it may be computerized 2 3.1.3 Database Management System A Database Management System (DBMS) is a collection of program that enables user to create and maintain a database. The DBMS is hence a general purpose software system that facilitates the process of defining, constructing and manipulating database for various applications. DBMS is a collection of data and user is not required to write the procedures for managing the database. DBMS provides an abstract view of data that hides the details. DBMS is efficient to use since there are wide varieties of sophisticated techniques to store and retrieve the data. DBMS takes care of Concurrent access using some form of locking. DBMS has crash recovery mechanism, DBMS protects user from the effects of system failures. DBMS has a good protection mechanism. 3.1.4 Characteristics of DBMS To incorporate the requirements of the organization, system should be designed for easy maintenance. Information systems should allow interactive access to data to obtain new information without writing fresh programs. System should be designed to co-relate different data to meet new requirements. An independent central repository, which gives information and meaning of available data is required. Integrated database will help in understanding the inter-relationships between data stored in different applications. The stored data should be made available for access by different users simultaneously. Automatic recovery feature has to be provided to overcome the problems with processing system failure. 3 3.1.5 Advantages of DBMS. Due to its centralized nature, the database system can overcome the disadvantages of the file system-based system 1. Data independency: Application program should not be exposed to details of data representation and storage DBMS provides the abstract view that hides these details. 2. Efficient data access.: DBMS utilizes a variety of sophisticated techniques to store and retrieve data efficiently. 3. Data integrity and security: Data is accessed through DBMS, it can enforce integrity constraints. E.g.: Inserting salary information for an employee. 4. Data Administration: When users share data, centralizing the data is an important task, Experience professionals can minimize data redundancy and perform fine tuning which reduces retrieval time. 5. Concurrent access and Crash recovery: DBMS schedules concurrent access to the data. DBMS protects user from the effects of system failure. 6. Reduced application development time. DBMS supports important functions that are common to many applications. 4 A database management system (DBMS) is a collection of programs that enables users to create and maintain database. The DBMS is a general purpose software system that facilitates the process of defining, constructing, manipulating and sharing databases among various users and applications. Defining a database specifying the database involves specifying the data types, constraints and structures of the data to be stored in the database. The descriptive information is also stored in the database in the form database catalog or dictionary; it is called meta-data. Manipulating the data includes the querying the database to retrieve the specific data. An application program accesses the database by sending the queries or requests for data to DBMS.The important function provided by the DBMS includes protecting the database and maintain the database. 5 3.1.6 Example of a Database (with a Conceptual Data Model) Mini-world for the example: Part of a UNIVERSITY environment. Some mini-world entities: STUDENTs COURSEs SECTIONs (of COURSEs) (academic) DEPARTMENTs INSTRUCTORs Example of a Database (with a Conceptual Data Model) Some mini-world relationships: SECTIONs are COURSEs STUDENTs take SECTIONs COURSEs have prerequisite of specific COURSEs INSTRUCTORs teach SECTIONs COURSEs are offered by DEPARTMENTs STUDENTs major in DEPARTMENT 6 Example of a simple Database 7 Example of a Student File 8 3.1.7 Architecture of DBMS A commonly used views of data approach is the three-level architecture suggested by ANSI/SPARC (American National Standards Institute/Standards Planning and Requirements Committee). ANSI/SPARC produced an interim report in 1972 followed by a final report in 1977. The reports proposed an architectural framework for databases. Under this approach, a database is considered as containing data about an enterprise. The three levels of the architecture are three different views of the data: External Level Conceptual View Internal View The three level database architecture allows a clear separation of the information meaning (conceptual view) from the external data representation and from the physical data structure layout. A database system that is able to separate the three different views of data is likely to be flexible and adaptable. This flexibility and adaptability is data independence. We now briefly discuss the three different views. 9 (A) External Level The external level is the view that the individual user of the database has. This view is often a restricted view of the database and the same database may provide a number of different views for different classes of users. In general, the end users and even the application programmers are only interested in a subset of the database. For example, a department head may only be interested in the departmental finances and student enrolments but not the library information. The librarian would not be expected to have any interest in the information about academic staff. The payroll office would have no interest in student enrolments. (B) Conceptual View The conceptual view is the information model of the enterprise and contains the view of the whole enterprise without any concern for the physical implementation. This view is normally more stable than the other two views. In a database, it may be desirable to change the internal view to improve performance while there has been no change in theconceptual view of the database. The conceptual view is the overall community view of the database and it includes all the information that is going to be represented in the database. The conceptual view is defined by the conceptual schema which includes definitions of each of the various types of data. (C) Internal View The internal view is the view about the actual physical storage of data. It tells us what data is stored in the database and how. At least the following aspects are considered at this level: Storage allocation Access paths Miscellaneous Efficiency considerations are the most important at this level and the data structures are chosen to provide an efficient database. The internal view does not deal with the physical devices directly. Instead it views a physical device as a collection of physical pages and allocates space in terms of logical pages. The separation of the conceptual view from the internal view enables us to provide a 10 logical description of the database without the need to specify physical structures. This is often called physical data independence. Separating the external views from the conceptual view enables us to change the conceptual view without affecting the external views. This separation is sometimes called logical data independence. Assuming the three level view of the database, a number of mappings are needed to enable the users working with one of the external views. For example, the payroll office may have an external view of the database that consists of the following information only: Staff number, name and address. Staff tax information e.g. number of dependents. Staff bank information where salary is deposited Staff employment statuses, salary level, leave information etc. The conceptual view of the database may contain academic staff, general staff, casual staff etc. A mapping will need to be created where all the staff in the different categories are combined into one category for the payroll office. The conceptual view would include information about each staff's position, the date employment started, full-time or part- time etc. This will need to be mapped to the salary level for the salary office. Also, if there is some change in the conceptual view, the external view can stay the same if the mapping is changed. 3.2. HDBMS: Hierarchical Database Management System The hierarchical structure was used in early mainframe DBMS. Records' relationships form a tree like model. This structure is simple but inflexible because the relationship is confined to a one-to-many relationship. The IBM Information Management System (IMS) and the RDM Mobile are examples of a hierarchical database system with multiple hierarchies over the same data. RDM Mobile is a newly designed embedded database for a mobile computer system. A hierarchical database model is a data model in which the data is organized into a tree-like structure. The structure allows representing information using parent/child relationships: each parent can have many children, but each child has only one parent (also known as a 1-to-many relationship). All attributes of a specific record are listed under an entity type. 11 Data is represented in a hierarchical structure, or upside down tree. In a hierarchical model, data is accessed by following the arrows, or path, beginning at the leftmost segment. This path is known as the hierarchical path or the preorder traversal. For example: Consider the following sample data In order to access the "Offering 4" data, the hierarchical path, beginning from the left, would be: Teacher 1 > Subject 1 > Offering 1 > Offering 2 > Teacher 2 > Subject 2 > Offering 3 > Subject 3 > Offering 4 3.2.1 Description of Hierarchical Database In the hierarchical structure, data is represented by a simple tree structure. The record type at the top of the tree is usually known as the "root." The simple hierarchical structure consists of a root 12 and a single dependent record type. In general, the root may have any number of dependent records, each of which may have any number of lower-level dependents, and so on, to any number of levels. The hierarchy view contains records of different types connected by links. Hierarchical relationships of records are explicitly defined in the data structure. A parent record can have many child records but a child record can have only one parent. There are no many-to-many relationships between records. No dependent record within a hierarchical data structure can exist without its parent record. For this reason, records must be seen in context. Strengths of Hierarchical Databases The advantages of a hierarchical database are: o Efficient representation of hierarchical structures, o Efficient single key search and access time (if the hierarchical structure corresponds to application views of the data), o Fast update performance where locality of reference exists (locality of reference states that performance is significantly enhanced when the processing is close to the data being processed). Weaknesses of Hierarchical Databases The disadvantages of a hierarchical database are: o Lack of flexibility (non-hierarchical relationships are awkward to represent; redundancy may be required), o Poor performance for non-hierarchical accesses, o Lack of maintainability (changing reorganization of data). relationships may require physical 3.3. NDBMS: Network Database Management System A DBMS is said to be Network DBMS, when it organizes the data in a network structure. A network may have as many connections as it can. 13 In our DBMS language, we can say that a parent can have many children and a child can have more than one parent. So a Network DBMS will have many-to-many relationship. There are some differences between hierarchical DBMS and Network DBMS. In hierarchical DBMS, we can have only one parent to a child. But in Network, we can have more than one. Unlike hierarchical, Network DBMS does not necessarily follow downward tree structure. In some cases it may follow upward tree structure. NDBMS is the product of IBM Corporation which support all types of relationships { one to one, one to many, many to many }and data is represented in the form of nodes and links.NDBMS is also a set of features which is maintained by IDMS software which implements the features of the NDBMS. Disadvantages of NDBMS:It cannot access the huge amount of data with speed and accuracy therefore it has complexity relationship occurs Example : ABC College has two Child. i.e. Department A and College library. It represents one to many relationship. Even though there is no relation between Department A and College library, a 14 student can be a member of both Department A and College library. This represents many to one relationship. So as per the above example, student has two parents which tell us this is the Network DBMS model. This is the simple and good example for Network DBMS. 3.4. RDBMS Identifying a record uniquely: – Identifying a record uniquely is primary key associated with that record. – 2 concepts gel with each other in the sense that both RDBMS and object technologies believe that records and objects have existence beyond their properties. – Student class – student_ID – Salesperson class – salesperson_ID Object ID (Implicit, hidden from the entire world) Primary Key (Explicit, visible to entire world) Mapping classes to table – If we need to store classes on a disk, there must be some way of mapping them to RDBMS structure. – The moment of the program is removed from the main memory of the computer, all the objects associated with that program also die. 15 – We think of mapping objects to tables One object maps exactly one table More than one objects map to one table. One objects maps to more than one table Student Class 2 attributes Student Name Marks Map to approximate table Outline the corresponding table model for this class – Write SQL code corresponding to the table model Transforming the object model to the table model, we add a record identifier (student ID), student ID is unique and not null.(primary key) – Mapping Binary Associations to tables – Binary associations can be largely classified into 2 types (a) many – to – many association (b) one to many association. – Ex – many students choosing many subjects for their courses obtaining some marks in each one (Many to many) and A school enrolling many students per standard (one to many) Super class and subclass tables – Map the super class (employees) to a table and 2 subclasses (ie manager and clerk) to two their respective tables. 16 Super Class Attributes Employee ID Employee Name Age Grade Attributes Employee ID Bonus Number of Subordinates Attributes Employee ID Number of Pending tasks Nulls allowed? N N Y N Sub Class Nulls allowed? N Y N Nulls allowed? N Y Sub Class Employee Table Manager Table Clerk table Create table employee – (Employee_ID Integer – Employee_Name Char(20) Not null, – Age – Grade – Primary Key (EmployeeID)); Not null, integer, Char(10) Not null, Create table manager – (EmployeeID – Bonus integer Not null, integer, 17 • – Number of subordinate – Primary key (EmployeeID) – Foreign key (EmployeeID) Reference Employee); integer, Create table clerk – (EmployeeID integer – No of pending tasks integer, – Primary key (EmployeeID), – Foreign key (EmployeeID) Reference Employee); Not null, Many subclass tables – No Super class table instead of that we need to obtain information from a sub class table. All sub class table must of self – sufficient One super class table – We have only one table in the design now, there is no super class table 3.5. OODBMS – Object Oriented Database Management systems (OODBMS) – OODBMS provides a persistent (permanent) storage for object. – OODBMS used in a multi user client / server environment – it controls the concurrent access to objects, provide locking mechanism and transactional features offers security features at the object level and also ensure object backup and restoration. – OODBMS generally uses class definitions and traditional OOP language, such as C++ and Java to define, manipulate and retrieve data. 18 Java Object C++ Object OODBMS OODBMS specialization OQL (Object Query Language) ODL (Object Definition Language) OML (Object Manipulation Language) • When should OODBMS be used? – Traditional business application require data to be stored in the form of rows and column for ex – payroll application would need employee details to be stored in one table, the payment details in another and on all in a tubular form. – conceptual understanding and retrieval of data quite easy. – In the application it is not wise to split data into rows and column instead of that data stored in the original form. – Object stored as a object not as rows and columns. 19 Object Object Tables Object RDBMS OODBMS Advantage of OODBMS – (1) Quicker access to information OODBMS keeps track of object via their unique object IDs. Search operation moves from one object to another via these IDs not through complex foreign key traversals. – (2) Creating new data types OODBMS does not restrict the type of data that can be stored. RDBMS provide fixed numbers of data types- integer and strings. – (3) Integer with OOP language OODBMS is actually an extension of OOP language such as java and C++ There is no importance between the language and the DBMS. (Impedance mismatch – when we execute a SELECT query via the C program which multiple rows. The C program stores them in a buffer and processes one by one. View of C program is on row at a time but RDBMS is multiple rows at one shot.) But OODBMS deals with one object where the original characteristics of an object. 20 3.6 Query Language The tools related to database management creating tables, querying the database for information, modifying the data in the database, deleting them, granting access to users and so on. 3.6.1History of SQL Dr. E. F. Codd published the paper, "A Relational Model of Data for Large Shared Data Banks", in June 1970 in the Association of Computer Machinery (ACM) journal, Communications of the ACM. Codd’s model is now accepted as the definitive model for relational database management systems (RDBMS). The language, Structured English Query Language (SEQUEL) was developed by IBM Corporation, Inc., to use Codd's model. SEQUEL later became SQL (still pronounced "sequel"). In 1979, Relational Software, Inc. (now Oracle) introduced the first commercially available implementation of SQL. Today, SQL is accepted as the standard RDBMS language. 3.6.2 Advantage of SQL High Speed o SQL Queries can be used to retrieve large amounts of records from a database quickly and efficiently. Well Defined Standards Exist o SQL databases use long-established standard, which is being adopted by ANSI & ISO. o Non-SQL databases do not adhere to any clear standard. No Coding Required o Using standard SQL it is easier to manage database systems without having to write substantial amount of code. Emergence of ORDBMS o Previously SQL databases were synonymous with relational database. o With the emergence of Object Oriented DBMS, object storage capabilities are extended to relational databases. 21 3.6.3 Disadvantages of SQL Difficulty in Interfacing o Interfacing an SQL database is more complex than adding a few lines of code. More Features Implemented in Proprietary way o Although SQL databases conform to ANSI & ISO standards, some databases go or proprietary extensions to standard SQL to ensure vendor lock-in. 3.6.4 SQL data types BOOLEAN – o A Boolean value either true, false or null. CHAR (size) or CHARACTER (size) – o A string of fixed length. The maximum size of a CHAR string is 1 billion characters. VARCHAR (size), LONGVARCHAR (size), CHARACTER VARYING (size), LONG CHARACTER VARYING (size), TEXT (size) or STRING (size) – o A string of variable length. The size constraint of these string types do not have to be given and defaults to the maximum possible size of strings that the database is able to store. The maximum size of these string types is 1 billion characters. TINYINT – o An 8-bit signed integer value. The range of TINYINT is -128 to 127. SMALLINT – o A 16-bit signed integer value. The range of SMALLINT is -32768 to 32767. INTEGER or INT – o A 32-bit signed integer value. The range of INTEGER is -2147483648 to 2147483647. BIGINT – o A 64-bit signed integer value. The range of BIGINT is -9223372036854775808 to 9223372036854775807. FLOAT or DOUBLE – o A 64-bit precision floating point value. These types are analogous to the Java double type. 22 REAL, NUMERIC or DECIMAL – o A higher precision numeric value. o These numeric types are represented by java. Math. Big Decimal and therefore can represent numeric values of any precision and scale. DATE – o A day/month/year value. o The DATE type does not have any near time bounding issues and is able to represent dates many millennia in the future and the past. o The TIMESTAMP type is internally represented by java. Until .Date. TIME – o A time of day value. TIMESTAMP – o A day/month/year and time of day value. o The TIMESTAMP type does not have any near time bounding issues and is able to represent dates many millennia in the future and the past. o The TIMESTAMP type is internally represented by java. Util. Date. BINARY (size), VARBINARY (size) or LONGVARBINARY (size) – o A variable sized binary object. The size constraint is optional and defaults to the maximum size. The maximum size of a binary object is 2 billion bytes. 3.6.5 SQL Data Creation Creating a basic table involves naming the table and defining its columns and each column's data type. The SQL CREATE TABLE statement is used to create a new table. Syntax – CREATE TABLE table_name( column1 datatype, column2 datatype, column3 data type, ..... Column N data type, PRIMARY KEY( one or more columns ) ); CREATE TABLE is the keyword telling the database system what you want to do.in this case, you want to create a new table. The unique name or identifier for the table follows the CREATE TABLE statement. 23 Then in brackets comes the list defining each column in the table and what sort of data type it is. The syntax becomes clearer with an example below. A copy of an existing table can be created using a combination of the CREATE TABLE statement and the SELECT statement. Ex – SQL> CREATE TABLE CUSTOMERS( ID INT NOT NULL, NAME VARCHAR (20) NOT NULL, AGE INT NOT NULL, ADDRESS CHAR (25) , SALARY DECIMAL (18, 2), PRIMARY KEY (ID) ); 3.6.6 Retrieval and manipulation of data SQL Logical Operators Operator ALL Description The ALL operator is used to compare a value to all values in another value set. AND The AND operator allows the existence of multiple conditions in an SQL statement's WHERE clause. ANY The ANY operator is used to compare a value to any applicable value in the list according to the condition. BETWEEN The BETWEEN operator is used to search for values that are within a set of values, given the minimum value and the maximum value. EXISTS The EXISTS operator is used to search for the presence of a row in a specified table that meets certain criteria. IN The IN operator is used to compare a value to a list of literal values that have been specified. LIKE The LIKE operator is used to compare a value to similar values using wildcard operators. NOT The NOT operator reverses the meaning of the logical operator with which it is used. OR The OR operator is used to combine multiple conditions in an SQL statement's WHERE clause. IS NULL The NULL operator is used to compare a value with a NULL value. UNIQUE The UNIQUE operator searches every row of a specified table for uniqueness (no duplicates). SQL Aggregate Functions SQL aggregate functions return a single value, calculated from values in a column. 24 Useful aggregate functions: AVG() - Returns the average value COUNT() - Returns the number of rows FIRST() - Returns the first value LAST() - Returns the last value MAX() - Returns the largest value MIN() - Returns the smallest value SUM() - Returns the sum SQL Scalar functions SQL scalar functions return a single value, based on the input value. Useful scalar functions: UCASE() - Converts a field to upper case LCASE() - Converts a field to lower case MID() - Extract characters from a text field LEN() - Returns the length of a text field ROUND() - Rounds a numeric field to the number of decimals specified NOW() - Returns the current system date and time FORMAT() - Formats how a field is to be displayed (a) Select o SQL SELECT Statement is used to fetch the data from a database table which returns data in the form of result table. These result tables are called result-sets. Syntax SELECT column1, column2, columnN FROM table_name; SELECT * FROM table_name; 25 ID NAME AGE ADDRESS SALARY 1 2 3 4 5 6 Ramesh Khilan kaushik Chaitali Hardik Komal 32 25 23 25 27 22 Ahmedabad Delhi Kota Mumbai Bhopal MP 2000.00 1500.00 2000.00 6500.00 8500.00 4500.00 o SQL> SELECT ID, NAME, SALARY FROM CUSTOMERS; ID NAME SALARY 1 2 3 4 5 6 Ramesh Khilan kaushik Chaitali Hardik Komal 2000.00 1500.00 2000.00 6500.00 8500.00 4500.00 (b) Where o The SQL WHERE clause is used to specify a condition while fetching the data from single table or joining with multiple table. o If the given condition is satisfied then only it returns specific value from the table. o You would use WHERE clause to filter the records and fetching only necessary records. o The WHERE clause not only used in SELECT statement, but it is also used in UPDATE, DELETE statement etc. which we would examine in subsequent chapters. o Specify a condition using comparison or logical operators like >, <, =, LIKE, NOT etc. Below examples would make this concept clear. Syntax SELECT column1, column2, columnN FROM table_name WHERE [condition] Ex – SQL> SELECT ID, NAME, SALARY 26 FROM CUSTOMERS WHERE SALARY > 2000; ID NAME SALARY 4 5 6 Chaitali Hardik Komal 6500.00 8500.00 4500.00 (c) Delete The SQL DROP TABLE statement is used to remove a table definition and all data, indexes, triggers, constraints, and permission specifications for that table. You have to be careful while using this command because once a table is deleted then all the information available in the table would also be lost forever. Syntax – o DROP TABLE table_name; Ex – DESC Field Type Extra ID int(11) NAME Varchar(20) AGE int(11) ADDRESS char(25) SALARY decimal(18,2) SQL> DROP TABLE CUSTOMERS; SQL> DROP DESC Default NULL NULL Null NO NO NO YES YES Key PRI (d) Insert a Table The SQL INSERT INTO Statement is used to add new rows of data to a table in the database Syntax INSERT INTO TABLE_NAME (column1, column2, column3,...columnN)] VALUES (value1, value2, value3,...valueN); o Here column1, column2,...columnN are the names of the columns in the table into which you want to insert data. 27 o You may not need to specify the column(s) name in the SQL query if you are adding values for all the columns of the table. But make sure the order of the values is in the same order as the columns in the table. The SQL INSERT INTO syntax would be as follows: Ex – o INSERT INTO CUSTOMERS (ID,NAME,AGE,ADDRESS,SALARY) VALUES (1, 'Ramesh', 32, 'Ahmedabad', 2000.00); o INSERT INTO CUSTOMERS (ID,NAME,AGE,ADDRESS,SALARY) VALUES (2, 'Khilan', 25, 'Delhi', 1500.00); o INSERT INTO CUSTOMERS (ID,NAME,AGE,ADDRESS,SALARY) VALUES (3, 'kaushik', 23, 'Kota', 2000.00); o INSERT INTO CUSTOMERS (ID,NAME,AGE,ADDRESS,SALARY) VALUES (4, 'Chaitali', 25, 'Mumbai', 6500.00); o INSERT INTO CUSTOMERS (ID,NAME,AGE,ADDRESS,SALARY) VALUES (5, 'Hardik', 27, 'Bhopal', 8500.00); o INSERT INTO CUSTOMERS (ID,NAME,AGE,ADDRESS,SALARY) VALUES (6, 'Komal', 22, 'MP', 4500.00); ID NAME AGE ADDRESS SALARY 1 2 3 4 5 6 Ramesh Khilan kaushik Chaitali Hardik Komal 32 25 23 25 27 22 Ahmedabad Delhi Kota Mumbai Bhopal MP 2000.00 1500.00 2000.00 6500.00 8500.00 4500.00 (e) Like The SQL LIKE clause is used to compare a value to similar values using wildcard operators. There are two wildcards used in conjunction with the LIKE operator: The percent sign (%) The underscore (_) 28 The percent sign represents zero, one, or multiple characters. The underscore represents a single number or character. The symbols can be used in combinations. Syntax: The basic syntax of % and _ is as follows: SELECT FROM table_name WHERE column LIKE 'XXXX%' Or SELECT FROM table_name WHERE column LIKE '%XXXX%' Or SELECT FROM table_name WHERE column LIKE 'XXXX_' Or SELECT FROM table_name WHERE column LIKE '_XXXX' Or SELECT FROM table_name WHERE column LIKE '_XXXX_' You can combine N number of conditions using AND or OR operators. Here XXXX could be any numeric or string value. Ex Here are number of examples showing WHERE part having different LIKE clause with '%' and '_' operators: Statement WHERE SALARY LIKE '200%' WHERE SALARY LIKE '%200%' WHERE SALARY LIKE '_00%' WHERE SALARY LIKE '2_%_%' WHERE SALARY LIKE '%2' WHERE SALARY LIKE '_2%3' WHERE SALARY LIKE '2___3' Description Finds any values that start with 200 Finds any values that have 200 in any position Finds any values that have 00 in the second and third positions Finds any values that start with 2 and are at least 3 characters in length Finds any values that end with 2 Finds any values that have a 2 in the second position and end with a3 Finds any values in a five-digit number that start with 2 and end with 3 29 (f) Order by The SQL ORDER BY clause is used to sort the data in ascending or descending order, based on one or more columns. Some database sorts query results in ascending order by default. Syntax: The basic syntax of ORDER BY clause is as follows: SELECT column-list FROM table_name [WHERE condition] [ORDER BY column1, column2, .. columnN] [ASC | DESC]; Example: ID NAME AGE ADDRESS SALARY 1 2 3 4 5 6 Ramesh Khilan kaushik Chaitali Hardik Komal 32 25 23 25 27 22 Ahmedabad Delhi Kota Mumbai Bhopal MP 2000.00 1500.00 2000.00 6500.00 8500.00 4500.00 SQL> SELECT * FROM CUSTOMERS ORDER BY NAME; ID NAME AGE ADDRESS SALARY 1 2 3 4 5 6 Mumbai Bhopal Kota Delhi MP Ahmedabad 6500.00 8500.00 2000.00 1500.00 4500.00 2000.00 Chaitali Hardik kaushik Khilan Komal Ramesh 25 27 23 25 22 32 (g) Group By The SQL GROUP BY clause is used in collaboration with the SELECT statement to arrange identical data into groups. The GROUP BY clause follows the WHERE clause in a SELECT statement and precedes the ORDER BY clause. 30 Syntax: SELECT column1, column2 FROM table_name WHERE [ conditions ] GROUP BY column1, column2 ORDER BY column1, column2 o Ex – ID NAME AGE ADDRESS SALARY 1 2 3 4 5 6 Ramesh Khilan kaushik Chaitali Hardik Komal 32 25 23 25 27 22 Ahmedabad Delhi Kota Mumbai Bhopal MP 2000.00 1500.00 2000.00 6500.00 8500.00 4500.00 o SQL> SELECT NAME, SUM (SALARY) FROM CUSTOMERS GROUP BY NAME; NAME Chaitali Hardik kaushik Khilan Komal Ramesh SUM (SALARY) 6500.00 8500.00 2000.00 1500.00 4500.00 2000.00 (h) Update o The SQL UPDATE Query is used to modify the existing records in a table. o You can use WHERE clause with UPDATE query to update selected rows otherwise all the rows would be effected. Syntax: UPDATE table_name SET column1 = value1, column2 = value2...., columnN = valueN WHERE [condition]; 31 Ex – ID NAME AGE ADDRESS SALARY 1 2 3 4 5 6 Ramesh Khilan kaushik Chaitali Hardik Komal 32 25 23 25 27 22 Ahmedabad Delhi Kota Mumbai Bhopal MP 2000.00 1500.00 2000.00 6500.00 8500.00 4500.00 SQL> UPDATE CUSTOMERS SET ADDRESS = 'Pune' WHERE ID = 6; ID NAME AGE ADDRESS SALARY 1 2 3 4 5 6 Ramesh Khilan kaushik Chaitali Hardik Komal 32 25 23 25 27 22 Ahmedabad Delhi Kota Mumbai Bhopal Pune 2000.00 1500.00 2000.00 6500.00 8500.00 4500.00 (i) AND & OR o The SQL AND & OR operators are used to compile multiple conditions to narrow data in an SQL statement. o These two operators are called conjunctive operators. o These operators provide a means to make multiple comparisons with different operators in the same SQL statement. Syntax SELECT column1, column2, columnN ROM table_name WHERE [condition1] AND [condition2]...AND [conditionN]; 32 Ex – ID NAME AGE ADDRESS SALARY 1 2 3 4 5 6 Ramesh Khilan kaushik Chaitali Hardik Komal 32 25 23 25 27 22 Ahmedabad Delhi Kota Mumbai Bhopal Pune 2000.00 1500.00 2000.00 6500.00 8500.00 4500.00 SQL> SELECT ID, NAME, SALARY FROM CUSTOMERS WHERE SALARY > 2000 AND age < 25; ID NAME SALARY 6 Komal 4500.00 The OR operator is used to combine multiple conditions in an SQL statement's WHERE clause. Syntax SQL> SELECT ID, NAME, SALARY FROM CUSTOMERS WHERE SALARY > 2000 OR age < 25; ID NAME SALARY 3 4 5 6 kaushik Chaitali Hardik Komal 2000.00 6500.00 8500.00 4500.00 (j) Sub query A Subquery or Inner query or Nested query is a query within another SQL query, and embedded within the WHERE clause. 33 A Subquery is used to return data that will be used in the main query as a condition to further restrict the data to be retrieved. Subqueries can be used with the SELECT, INSERT, UPDATE, and DELETE statements along with the operators like =, <, >, >=, <=, IN, BETWEEN etc. There are a few rules that Subqueries must follow: Subqueries must be enclosed within parentheses. A Subquery can have only one column in the SELECT clause, unless multiple columns are in the main query for the Subquery to compare its selected columns. An ORDER BY cannot be used in a Subquery, although the main query can use an ORDER BY. The GROUP BY can be used to perform the same function as the ORDER BY in a Subquery. Subqueries that return more than one row can only be used with multiple value operators, such as the IN operator. The SELECT list cannot include any references to values that evaluate to a BLOB, ARRAY, CLOB, or NCLOB. A Subquery cannot be immediately enclosed in a set function. The BETWEEN operator cannot be used with a Subquery; however, the BETWEEN can be used within the Subquery. ExSQL> SELECT * FROM CUSTOMERS WHERE ID IN (SELECT ID FROM CUSTOMERS WHERE SALARY > 4500) ; ID NAME AGE ADDRESS SALARY 4 5 Chaitali Hardik 25 27 Mumbai Bhopal 6500.00 8500.00 34 Subqueries also can be used with INSERT statements. The INSERT statement uses the data returned from the Subquery to insert into another table. The selected data in the Subquery can be modified with any of the character, date, or number functions. Syntax INSERT INTO table_name [ (column1 [, column2 ]) ] SELECT [ *|column1 [, column2 ] FROM table1 [, table2 ] WHERE VALUE OPERATOR ] Subqueries with the UPDATE Statement: The Subquery can be used in conjunction with the UPDATE statement. Either single or multiple columns in a table can be updated when using a Subquery with the UPDATE statement. Syntax UPDATE table SET column_name = new_value [ WHERE OPERATOR [ VALUE ] (SELECT COLUMN_NAME FROM TABLE_NAME) [ WHERE) ] Subqueries with the DELETE Statement: The Subquery can be used in conjunction with the DELETE statement like with any other statements mentioned above. Syntax DELETE FROM TABLE_NAME [ WHERE OPERATOR [ VALUE ] (SELECT COLUMN_NAME FROM TABLE_NAME) [ WHERE) ] 35 (k) Views A view is nothing more than a SQL statement that is stored in the database with an associated name. A view is actually a composition of a table in the form of a predefined SQL query. A view can contain all rows of a table or select rows from a table. A view can be created from one or many tables which depends on the written SQL query to create a view. Views which are kind of virtual tables, allow users to do the following: Structure data in a way that users or classes of users find natural or intuitive. Restrict access to the data such that a user can see and (sometimes) modify exactly what they need and no more. Summarize data from various tables which can be used to generate reports. Database views are created using the CREATE VIEW statement. Views can be created from a single table, multiple tables, or another view. To create a view, a user must have the appropriate system privilege according to the specific implementation. Syntax CREATE VIEW view_name AS SELECT column1, column2..... FROM table_name WHERE [condition]; ID NAME AGE ADDRESS SALARY 1 2 3 4 5 6 Ramesh Khilan kaushik Chaitali Hardik Komal 32 25 23 25 27 22 Ahmedabad Delhi Kota Mumbai Bhopal Pune 2000.00 1500.00 2000.00 6500.00 8500.00 4500.00 36 ExSQL > CREATE VIEW CUSTOMERS_VIEW AS SELECT name, age FROM CUSTOMERS; SQL > SELECT * FROM CUSTOMERS_VIEW; NAME Ramesh Khilan kaushik Chaitali Hardik Komal AGE 32 25 23 25 27 22 Updating a View: A view can be updated under certain conditions: The SELECT clause may not contain the keyword DISTINCT. The SELECT clause may not contain summary functions. The SELECT clause may not contain set functions. The SELECT clause may not contain set operators. The SELECT clause may not contain an ORDER BY clause. The FROM clause may not contain multiple tables. The WHERE clause may not contain Subqueries. The query may not contain GROUP BY or HAVING. Calculated columns may not be updated. All NOT NULL columns from the base table must be included in the view in order for the INSERT query to function. Ex – SQL > UPDATE CUSTOMERS_VIEW SET AGE = 35 WHERE name='Ramesh'; 37 ID NAME AGE ADDRESS SALARY 1 2 3 4 5 6 Ramesh Khilan kaushik Chaitali Hardik Komal 35 25 23 25 27 22 Ahmedabad Delhi Kota Mumbai Bhopal Pune 2000.00 1500.00 2000.00 6500.00 8500.00 4500.00 3.7. Concurrency Control Coordinates simultaneous transaction execution in multiprocessing database Ensure serializability of transactions in multiuser database environment Potential problems in multiuser environments Three main problems: lost updates, uncommitted data, and inconsistent retrievals (I)Lost updates Exo Assume that two concurrent transactions (T1, T2) occur in a PRODUCT table which records a product’s quantity on hand (PROD_QOH). The transactions are: Transaction Computation T1: Purchase 100 units T2: Sell 30 units PROD_QOH = PROD_QOH + 100 PROD_QOH = PROD_QOH - 30 Table 2 Normal Execution of Two Transactions Table 3 Lost Updates o The first transaction (T1) has not yet been committed when the second transaction (T2) is executed. o T2 still operates on the value 35, and its subtraction yields 5 in memory. o T1 writes the value 135 to disk, which is promptly overwritten by T2. 38 (II)Uncommitted Data o When two transactions, T1 and T2, are executed concurrently and the first transaction (T1) is rolled back after the second transaction (T2) has already accessed the uncommitted data – thus violating the isolation property of transactions. o The transactions are: o Transaction Computation T1: Purchase 100 units T2: Sell 30 units PROD_QOH = PROD_QOH + 100 (Rollback) PROD_QOH = PROD_QOH - 30 Table 4 Correct Execution of Two Transactions Table 5 an Uncommitted Data Problem (III)Inconsistent Retrievals When a transaction calculates some summary (aggregate) functions over a set of data while other transactions are updating the data. The transaction might read some data before they are changed and other data after they are changed, thereby yielding inconsistent results. T1 calculates the total quantity on hand of the products stored in the PRODUCT table. T2 updates PROD_QOH for two of the PRODUCT table’s products. Table 6 Retrieval During Update Table 7 Transaction Results: Data Entry Correction Table 8 Transaction Result: Data Entry Correction The transaction table in Table 8 demonstrates that inconsistent retrievals are possible during the transaction execution, making the result of T1’s execution incorrect. Unless the DBMS exercises concurrency control, a multi-user database environment can create chaos within the information system. 3.7.1The Scheduler – Schedule, Serializability, Recovery, Isolation Previous examples executed the operations within a transaction in an arbitrary order: As long as two transactions, T1 and T2, access unrelated data, there is no conflict, and the order of execution is irrelevant to the final outcome. 39 If the transactions operate on related (or the same) data, conflict is possible among the transaction components, and the selection of one operational order over another may have some undesirable consequences. Establishes order of concurrent transaction execution Interleaves execution of database operations to ensure serializability Bases actions on concurrency control algorithms Locking Time stamping Ensures efficient use of computer’s CPU First-come-first-served basis (FCFS) – executed for all transactions if no way to schedule the execution of transactions. Within multi-user DBMS environment, FCFS scheduling tends to yield unacceptable response times. READ and/or WRITE actions that can produce conflicts. Table 9 Read/Write Conflict Scenarios: Conflicting Database Operations Matrix Schedules – Sequences that indicate the chronological order in which instructions of concurrent transactions are executed a schedule for a set of transactions must consist of all instructions of those transactions must preserve the order in which the instructions appear in each individual transaction. Example of schedules 40 Schedule 1 (right figure): Let T1 transfer $50 from A to B, and T2 transfer 10% of the balance from A to B. The following is a serial schedule, in which T1 is followed by T2. Schedule 2 (right figure): Let T1 and T2 be the transactions defined previously. The following schedule is not a serial schedule, but it is equivalent to Schedule 1. Schedule 3 (lower right figure): The following concurrent schedule does not preserve the value of the sum A + B. Serializability A (possibly concurrent) schedule is serializable if it is equivalent to a serial schedule. Different forms of schedule equivalence give rise to the notions of: 1. conflict serializability 2. View serializability Conflict Serializability: Instructions li and lj of transactions Ti and Tj respectively, conflict if and only if there exists some item Q accessed by both li and lj, and at least one of these instructions wrote Q. o Ii = read (Q), Ij = read (Q). Ii and Ij don’t conflict. o Ii = read (Q), Ij = write (Q). They conflict. o Ii = write (Q), Ij = read (Q). They conflict o Ii = write (Q), Ij = write (Q). They conflict 41 If a schedule S can be transformed into a schedule S’ by a series of swaps of nonconflicting instructions, we say that S and S’ are conflict equivalent. We say that a schedule S is conflict serializable if it is conflict equivalent to a serial schedule. View Serializability: Let S and S´ be two schedules with the same set of transactions. S and S´ are view equivalent if the following three conditions are met: For each data item Q, if transaction Ti reads the initial value of Q in schedule S, then transaction Ti must, in schedule S’, also read the initial value of Q. For each data item Q if transaction Ti executes read(Q) in schedule S, and that value was produced by transaction Tj (if any), then transaction Ti must in schedule S´ also read the value of Q that was produced by transaction Tj. For each data item Q, the transaction (if any) that performs the final write(Q) operation in schedule S must perform the final write(Q) operation in schedule S’. As can be seen, view equivalence is also based purely on reads and writes alone. A schedule S is view serializable it is view equivalent to a serial schedule. Every conflict serializable schedule is also view serializable. Schedule in right figure – a schedule which is view-serializable but not conflict serializable. Every view serializable schedule that is not conflict serializable has blind writes. 42 Other Notions of Serializability Schedule in right figure given below produces same outcome as the serial schedule < T1, T5 >, yet is not conflict equivalent or view equivalent to it. Determining such equivalence requires analysis of operations other than read and write. Recoverability Need to address the effect of transaction failures on concurrently running transactions Recoverable schedule – if a transaction Tj reads a data items previously written by a transaction Ti, the commit operation of Ti appears before the commit operation of Tj. The schedule in right figure is not recoverable if T9 commits immediately after the read. If T8 should abort, T9 would have read (and possibly shown to the user) an inconsistent database state. Hence database must ensure that schedules are recoverable. Cascading rollback – a single transaction failure leads to a series of transaction rollbacks. Consider the following schedule where none of the transactions has yet committed (so the schedule is recoverable) If T10 fails, T11 and T12 must also be rolled back. Can lead to the undoing of a significant amount of work 43 Cascade less schedules — cascading rollbacks cannot occur; for each pair of transactions Ti and Tj such that Tj reads a data item previously written by Ti, the commit operation of Ti appears before the read operation of Tj. Every cascade less schedule is also recoverable It is desirable to restrict the schedules to those that are cascade less Implementation of Isolation Schedules must be conflict or view serializable, and recoverable, for the sake of database consistency, and preferably cascade less. A policy in which only one transaction can execute at a time generates serial schedules, but provides a poor degree of concurrency. Concurrency-control schemes tradeoff between the amount of concurrency they allow and the amount of overhead that they incur. Some schemes allow only conflict-serializable schedules to be generated, while others allow view-serializable schedules that are not conflict-serializable. 3.7.2Concurrency Control with Locking Methods Lock guarantees current transaction exclusive use of data item, i.e., transaction T2 does not have access to a data item that is currently being used by transaction T1. Acquires lock prior to access. Lock released when transaction is completed. DBMS automatically initiates and enforces locking procedures. All lock information is managed by lock manager. Lock Granularity Lock granularity indicates level of lock use: database, table, page, row, or field (attribute). Database-Level The entire database is locked. Transaction T2 is prevented to use any tables in the database while T1 is being executed. Good for batch processes, but unsuitable for online multi-user DBMSs. Table-Level 44 The entire table is locked. If a transaction requires access to several tables, each table may be locked. Transaction T2 is prevented to use any row in the table while T1 is being executed. Two transactions can access the same database as long as they access different tables. It causes traffic jams when many transactions are waiting to access the same table. Table-level locks are not suitable for multi-user DBMSs. Page-Level The DBMS will lock an entire disk page (or page), which is the equivalent of a disk block as a (referenced) section of a disk. A page has a fixed size and a table can span several pages while a page can contain several rows of one or more tables. Page-level lock is currently the most frequently used multi-user DBMS locking method. T2 must wait for using a locked page which locates a row, if T1 is using it. Row-Level With less restriction respect to previous discussion, it allows concurrent transactions to access different rows of the same table even if the rows are located on the same page. It improves the availability of data, but requires high overhead cost for management. Field-Level It allows concurrent transactions to access the same row, as long as they require the use of different fields (attributes) within a row. The most flexible multi-user data access, but cost extremely high level of computer overhead. Lock Types The DBMS may use different lock types: binary or shared/exclusive locks. A locking protocol is a set of rules followed by all transactions while requesting and releasing locks. Locking protocols restrict the set of possible schedules. 45 Binary Locks Two states: locked (1) or unlocked (0). Locked objects are unavailable to other objects. Unlocked objects are open to any transaction. Transaction unlocks object when complete. Every transaction requires a lock and unlock operation for each data item that is accessed. Shared/Exclusive Locks Shared (S Mode) Exists when concurrent transactions granted READ access Produces no conflict for read-only transactions Issued when transaction wants to read and exclusive lock not held on item Exclusive (X Mode) Exists when access reserved for locking transaction Used when potential for conflict exists (also refer Table 9) Issued when transaction wants to update unlocked data Lock-compatibility matrix A transaction may be granted a lock on an item if the requested lock is compatible with locks already held on the item by other transactions Any number of transactions can hold shared locks on an item, but if any transaction holds an exclusive on the item no other transaction may hold any lock on the item. If a lock cannot be granted, the requesting transaction is made to wait till all incompatible locks held by other transactions have been released. The lock is then granted. Reasons to increasing manager’s overhead The type of lock held must be known before a lock can be granted 46 Three lock operations exist: READ_LOCK (to check the type of lock), WRITE_LOCK (to issue the lock), and UNLOCK (to release the lock). The schema has been enhanced to allow a lock upgrade (from shared to exclusive) and a lock downgrade (from exclusive to shared). Problems with Locking Transaction schedule may not be serializable Managed through two-phase locking Schedule may create deadlocks Managed by using deadlock detection and prevention techniques Two-Phase Locking Two-phase locking defines how transactions acquire and relinquish (or revoke) locks. Growing phase – acquires all the required locks without unlocking any data. Once all locks have been acquired, the transaction is in its locked point. Shrinking phase – releases all locks and cannot obtain any new lock. Governing rules Two transactions cannot have conflicting locks No unlock operation can precede a lock operation in the same transaction No data are affected until all locks are obtained When the locked point is reached, the data are modified to conform to the transaction requirements. The transaction is completed as it released all of the locks it acquired in the first phase. Deadlocks Occurs when two transactions wait for each other to unlock data. For example: T1 = access data items X and Y T2 = access data items Y and X Deadly embrace – if T1 has not unlocked data item Y, T2 cannot begin; if T2 has not unlocked data item X, T1 cannot continue. 47 Control techniques Deadlock prevention – a transaction requesting a new lock is aborted if there is the possibility that a deadlock can occur. If the transaction is aborted, all the changes made by this transaction are rolled back, and all locks obtained by the transaction are released. 3.8. What is a Data Warehouse? A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources. In addition to a relational database, a data warehouse environment includes an extraction, transportation, transformation, and loading (ETL) solution, an online analytical processing (OLAP) engine, client analysis tools, and other applications that manage the process of gathering data and delivering it to business users. A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: Subject Oriented Integrated Nonvolatile Time Variant Subject Oriented Data warehouses are designed to help you analyze data. For example, to learn more about your company's sales data, you can build a warehouse that concentrates on sales. Using this warehouse, you can answer questions like "Who was our best customer for this item last year?" This ability to define a data warehouse by subject matter, sales in this case, makes the data warehouse subject oriented. 48 Integrated Integration is closely related to subject orientation. Data warehouses must put data from disparate sources into a consistent format. They must resolve such problems as naming conflicts and inconsistencies among units of measure. When they achieve this, they are said to be integrated. Nonvolatile Nonvolatile means that, once entered into the warehouse, data should not change. This is logical because the purpose of a warehouse is to enable you to analyze what has occurred. Time Variant In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. A data warehouse's focus on change over time is what is meant by the term time variant. 3.8.1Data Warehouse Architecture (Basic) Figure 1-2 shows a simple architecture for a data warehouse. End users directly access data derived from several source systems through the data warehouse. 49 Figure 1-2 Architecture of a Data Warehouse This illustrates four things: Data Sources (operational systems and flat files) Staging Area (where data sources go before the warehouse) Warehouse (metadata, summary data, and raw data) Users (analysis, reporting, and mining) In Figure 1-2, the metadata and raw data of a traditional OLTP system is present, as is an additional type of data, summary data. Summaries are very valuable in data warehouses because they pre-compute long operations in advance. For example, a typical data warehouse query is to retrieve something like August sales. A summary in Oracle is called a materialized view. Data Warehouse Architecture (with a Staging Area) In Figure 1-2, you need to clean and process your operational data before putting it into the warehouse. You can do this programmatically, although most data warehouses use a staging area instead. A staging area simplifies building summaries and general warehouse management. Figure 1-3 illustrates this typical architecture. 50 Figure 1-3 Architecture of a Data Warehouse with a Staging Area This illustrates four things: Data Sources (operational systems and flat files) Staging Area (where data sources go before the warehouse) Warehouse (metadata, summary data, and raw data) Users (analysis, reporting, and mining) Data Warehouse Architecture (with a Staging Area and Data Marts) Although the architecture in Figure 1-3 is quite common, you may want to customize your warehouse's architecture for different groups within your organization. You can do this by adding data marts, which are systems designed for a particular line of business. Figure 1-4 illustrates an example where purchasing, sales, and inventories are separated. In this example, a financial analyst might want to analyze historical data for purchases and sales. 51 Figure 1-4 Architecture of a Data Warehouse with a Staging Area and Data Marts This illustrates five things: Data Sources (operational systems and flat files) Staging Area (where data sources go before the warehouse) Warehouse (metadata, summary data, and raw data) Data Marts (purchasing, sales, and inventory) Users (analysis, reporting, and mining) Warehouse data modeling levels There are three levels of data modeling: conceptual, logical, and physical. Each level of data modeling has its own purpose in data warehouse design. Conceptual The high-level data model is a consistent definition of all of business subject areas and data elements common to the business, from a high-level business view to a generic logical data design. From this, you can derive the general scope and understanding of the business requirements. This conceptual data model is the basis for both current and future phases of data warehouse development. 52 Logical The logical data model contains much more detailed information about the business subject areas. It captures the detailed business requirements in the target business subject areas. It is the basis for the physical data modeling for the current project. Starting from this stage, this solution is adapting the bottom-up approach, which means that only the most important and urgent business subject areas are targeted in this logical data model. The features of the logical data model include: Specifications for all entities and relationships among them Specifications for each entity's attributes Specifications for all primary keys and foreign keys Normalization and aggregation Specification for multidimensional data structure Physical The physical data modeling applies physical constraints, such as space, performance, and the physical distribution of data. The physical data model is tightly related to the database system and data warehouse tools that you will use. The purpose of this phase is to design the actual physical implementation. It is particularly important to clearly separate logical modeling from physical modeling. Good logical modeling practice focuses on the essence of the problem domain. Logical modeling addresses the "what" question. Physical modeling addresses the question of "how" for the model, which represents implementation reality in a given computing environment. Since the business computing environment changes from time to time, the separation of logical and physical data modeling will help stabilize the logical models from phase to phase. 53 Figure 4. Data warehouse logical data model life cycle Once a data warehouse is implemented and your customers begin using it, they will often generate new requests and requirements. This will start another cycle of development, continuing the iterative and evolutionary process of building the data warehouse. As you can see, the logical data model is a living part of a data warehouse, used and maintained throughout the entire life cycle of the data warehouse. The process of data warehouse modeling can be truly endless. 3.9.What Is a Data Mart? A data mart is a simple form of a data warehouse that is focused on a single subject (or functional area), such as Sales, Finance, or Marketing. Data marts are often built and controlled by a single department within an organization. Given their single-subject focus, data marts usually draw data from only a few sources. The sources could be internal operational systems, a central data warehouse, or external data. 3.9.1.Dependent and Independent Data Marts There are two basic types of data marts: dependent and independent. The categorization is based primarily on the data source that feeds the data mart. Dependent data marts draw data from a central data warehouse that has already been created. Independent data marts, in contrast, are 54 standalone systems built by drawing data directly from operational or external sources of data, or both. The main difference between independent and dependent data marts is how you populate the data mart; that is, how you get data out of the sources and into the data mart. This step, called the Extraction-Transformation-and Loading (ETL) process, involves moving data from operational systems, filtering it, and loading it into the data mart. With dependent data marts, this process is somewhat simplified because formatted and summarized (clean) data has already been loaded into the central data warehouse. The ETL process for dependent data marts is mostly a process of identifying the right subset of data relevant to the chosen data mart subject and moving a copy of it, perhaps in a summarized form. With independent data marts, however, you must deal with all aspects of the ETL process, much as you do with a central data warehouse. The number of sources is likely to be fewer and the amount of data associated with the data mart is less than the warehouse, given your focus on a single subject. The motivations behind the creation of these two types of data marts are also typically different. Dependent data marts are usually built to achieve improved performance and availability, better control, and lower telecommunication costs resulting from local access of data relevant to a specific department. The creation of independent data marts is often driven by the need to have a solution within a shorter time. 3.9.2What Are the Steps in Implementing a Data Mart? Simply stated, the major steps in implementing a data mart are to design the schema, construct the physical storage, populate the data mart with data from source systems, access it to make informed decisions, and manage it over time. This section contains the following topics: "Designing" "Constructing" 55 "Populating" "Accessing" "Managing" Designing The design step is first in the data mart process. This step covers all of the tasks from initiating the request for a data mart through gathering information about the requirements, and developing the logical and physical design of the data mart. The design step involves the following tasks: Gathering the business and technical requirements Identifying data sources Selecting the appropriate subset of data Designing the logical and physical structure of the data mart Constructing This step includes creating the physical database and the logical structures associated with the data mart to provide fast and efficient access to the data. This step involves the following tasks: Creating the physical database and storage structures, such as tablespaces, associated with the data mart Creating the schema objects, such as tables and indexes defined in the design step Determining how best to set up the tables and the access structures Populating The populating step covers all of the tasks related to getting the data from the source, cleaning it up, modifying it to the right format and level of detail, and moving it into the data mart. More formally stated, the populating step involves the following tasks: Mapping data sources to target data structures Extracting data Cleansing and transforming the data 56 Loading data into the data mart Creating and storing metadata Accessing The accessing step involves putting the data to use: querying the data, analyzing it, creating reports, charts, and graphs, and publishing these. Typically, the end user uses a graphical frontend tool to submit queries to the database and display the results of the queries. The accessing step requires that you perform the following tasks: Set up an intermediate layer for the front-end tool to use. This layer, the metalayer, translates database structures and object names into business terms, so that the end user can interact with the data mart using terms that relate to the business function. Maintain and manage these business interfaces. Set up and manage database structures, like summarized tables, that help queries submitted through the front-end tool execute quickly and efficiently. Managing This step involves managing the data mart over its lifetime. In this step, you perform management tasks such as the following: Providing secure access to the data Managing the growth of the data Optimizing the system for better performance Ensuring the availability of data even with system failures 3.9.3 Difference between Data Warehousing and Data Mart It is important to note that there are huge differences between these two tools though they may serve same purpose. Firstly, data mart contains programs, data, software and hardware of a specific department of a company. There can be separate data marts for finance, sales, production or marketing. All these data marts are different but they can be coordinated. Data mart of one department is different from data mart of another 57 department, and though indexed, this system is not suitable for a huge data base as it is designed to meet the requirements of a particular department. Data Warehousing is not limited to a particular department and it represents the database of a complete organization. The data stored in data warehouse is more detailed though indexing is light as it has to store huge amounts of information. It is also difficult to manage and takes a long time to process. It implies then that data marts are quick and easy to use, as they make use of small amounts of data. Data warehousing is also more expensive because of the same reason 3.9.4 What Is Metadata? Metadata is information about the data. For a data mart, metadata includes: A description of the data in business terms Format and definition of the data in system terms Data sources and frequency of refreshing data The primary objective for the metadata management process is to provide a directory of technical and business views of the data mart metadata. Metadata can be categorized as technical metadata and business metadata. Technical metadata consists of metadata created during the creation of the data mart, as well as metadata to support the management of the data mart. This includes data acquisition rules, the transformation of source data into the format required by the target data mart, and schedules for backing up and refreshing data. Business metadata allows end users to understand what information is available in the data mart and how it can be accessed 3.9.5 Data modeling for data mart Since warehouse end users interact directly with data marts, the data mart modeling is one of the most effective tools in capturing end-user business requirements. The data mart modeling process depends on many factors. Three of the most important are described below. 58 Data mart modeling is end-user-driven. End users must be involved in the data mart modeling process, as they obviously are the ones who will use the data mart. Because you should expect that end users are not at all familiar with complex data models, the modeling techniques and the modeling process as a whole should be organized such that complexity is transparent to end users. Data mart modeling is driven by business requirements. Data mart models are useful for capturing the business requirements because they are often used directly by end users, and are easy to understand. Data mart modeling is greatly affected by data analysis technologies. The techniques of data analysis can impact the type of data models selected and their content. There are several techniques for data analysis that are in common use today: query and reporting, multidimensional analysis, and data mining. If the intent is simply to provide query and reporting capability, an ER model with a normalized or denormalized data structure would be most appropriate. A dimensional data model might also be a good choice because it is user-friendly and has better performance. If the objective is to perform multidimensional data analysis, a dimensional data model would be the only choice here. Data mining, however, usually works best with the lowest level of detail available. Thus, if the data warehouse is used for data mining, a low level of detailed data should be included in the model. 59