Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Enabling Decision Tree Intelligence in Materialized View Submitted By RIAZ AHMAD (MS-IT Session 2006-2008) A thesis submitted in partial fulfillment of the requirement for the degree of Master of Science in Information Technology In Databases and Data Warehousing Institute of Management Sciences, NWFP, Peshawar Pakistan October, 2008 DEDICATION I dedicate this struggle of mine to my loving Parents and to my sweet Alishba and Rohail. Certificate of Originality It is to certify that this Report “Enabling Decision Tree Intelligence in Materialized View “ submitted by the concerned student, is up to the requirements for the Degree of MASTER OF SCIENCE (Information Technology) at IM | Sciences. All the work done is solely the effort of the student and an adequate appreciation is given to work of others which is mentioned as a reference material. Supervisor: Mr. Nafees-Ur-Rehman Sign: __________________ External Examiner: Name: ______________________ Designation: _________________ Affiliation: __________________ Sign: ___________________ Research Coordinator: Mr. Nafees-Ur-Rehman Sign: __________________ ACKNOWLEDGMENTS I am indebted and grateful to all those people, who have helped me during the course of my research project in one way or another. In particular, I would like to acknowledge and express my deepest gratitude to the following: First and foremost I would like to express my gratitude to Mr Nafees-Ur-Rehman, my Supervisor to whom I am greatly indebted, for giving me the opportunity to undertake the research, under his supervision and especially for his encouragement throughout the course of the research. Mr Syed Akmal Shah, my classmate for his support and kind suggestions and fruitful discussion. Lab Administrators Mr.Mumtaaz, and Mr.Saqib for their kind help and patience in arranging all the necessary stuff I needed during my lab work for my dissertation. Last but not least I would like to thank my Parents and the rest of my family members for their patience and constant encouragement. Riaz Ahmad, IM | Sciences Hayatabad Peshawar, PAK. Oct, 2008 I List of Abbreviation OLTP Online Transaction processing DW Data Warehouse ERD Entity Relational Diagram ODS Operation Data Store OLAP Online Analytical Processing DTS Data Transformation Services RDBMS Relational Database Management System MV Materialized View DM Data Mining KDD Knowledge Discovery in Databases DC Data Cleaning DS Data selection DI Data integration DT Data Transformation PE Pattern Evaluation KP Knowledge Presentation ID3 Interactive Dichotomisor 3 II List of Figures Fig No Figure Caption Page No 2.1 Data warehouse users 9 2.2 Dimensional Modeling 11 5.1 KDD parts 31 5.2 Decision Tree 36 5.3 Decision Tree of the dataset 41 5.4 Decision Tree Generated in SQL Server 2000 42 III List of Tables Table No Table Title Page No 2.1 Comparison between OLAP and OLAP 8 Databases 5.1 Training Dataset for classification 38 5.2 Gain Information for Original set 40 5.3 Gain Information for Rain subset 40 5.4 Gain Information fro Sunny subset 41 5.5 Dependent Table 43 5.6 Resultant values in dependent table 44 IV Abstract Data Mining has attracted a great deal of attention in information industry in recent years due to the wide availability of huge amounts of data and the imminent need for turning such data into useful information and knowledge. Unlike the conventional model where data is taken to the data mining system, this has been proposed in this thesis that mining algorithms are placed inside data warehouse and database structures. This research is a step forward to integrate data mining with data warehouses & databases. In particular, decision tree is embedded with materialized views to reduce the tree construction and data classification time. For the construction of the decision tree, different calculations & computations are carried out repeatedly in a recursive manner. The initial dataset level entropy is calculated along with other values and is stored in a new storage structure. These values are referenced in the tree construction whenever a new decision tree is required. These calculations are performed once and updated whenever the source dataset is updated. The results of these calculations are used each time a new decision tree is constructed. This pre-calculation reduces the tree construction time as one does not have to re-calculate these values. V Table of Contents S.No Topic Titles Page No Chapter 1...................................................................................................................... 1 Introduction ............................................................................................................... 1 1.1 Background ........................................................................................................... 1 1.2 Scope ....................................................................................................................... 2 1.3 Objective ................................................................................................................ 2 1.4 Summary of Chapters ......................................................................................... 2 Chapter 2...................................................................................................................... 5 Data Warehousing ................................................................................................... 5 2.1 Benefits of Data Warehousing........................................................................... 5 2.2 OLAP Data Characteristics ............................................................................... 6 2.2.1. Consolidated and Consistent.......................................................................... 6 2.2.2. Subject Oriented ............................................................................................. 6 2.2.3. Historical .......................................................................................................... 6 2.2.4. Read Only ........................................................................................................ 7 2.2.5. Granular .......................................................................................................... 7 2.3 Database VS Data Warehouse .......................................................................... 8 2.4 Data Warehouse Users ........................................................................................ 9 2.5 Developing a Data warehouse ........................................................................... 9 2.5.1 Identification and collection of information ................................................ 9 2.5.2 Dimensional Modeling Design .................................................................... 10 2.5.3 Develop architecture contain Operation data store (ODS) ...................... 13 2.5.4 Design Relational Database and OLAP cubes. ......................................... 14 2.5.5 Develop Data warehouse maintenance applications ................................. 14 2.5.6 Develop Analysis application ...................................................................... 14 2.5.7 Test and install or organize the System ..................................................... 15 Chapter 3.................................................................................................................... 16 Materialized View .................................................................................................. 16 3.1 Materialized View in Different Environment .............................................. 17 3.1.1 Materialized Views for Distributed Computing.......................................... 17 3.1.2 Materialized Views for Mobile Computing ................................................. 17 3.2 The Need for Materialized Views ................................................................... 17 3.3 Uses of Materialized Views .............................................................................. 18 VI 3.4 How Materialized Views Work....................................................................... 18 3.5 Types of Materialized View ............................................................................. 19 3.5.1 Types of Materialized view on the basis of Tables ...................................... 19 3.5.2 Some other Types of Materialized view ....................................................... 20 3.6 Advantages and Disadvantages ...................................................................... 21 3.6.1 Advantages.................................................................................................... 21 3.6.2 Disadvantages ............................................................................................... 21 3.7 Materialized View Refresh Methods. ............................................................ 21 3.8 Creating a Materialized View ...................................................................... 22 3.9 Indexed View in SQL Sever 2000 ................................................................... 22 3.9.1 Restrictions on Creating Indexed Views .................................................... 23 3.9.2 Create the Indexed View or Materialized view ......................................... 23 Chapter 4.................................................................................................................... 25 Integration of Data warehouse and Data Mining....................................... 25 4.1 4.2 4.3 4.4 4.5 Introduction ...................................................................................................... 25 Data Integration ............................................................................................... 26 Schema Integration ......................................................................................... 27 Redundancy ...................................................................................................... 28 Inconsistencies .................................................................................................. 28 Chapter 5.................................................................................................................... 30 Data Mining ............................................................................................................. 30 5.1 Data Mining Definition. .................................................................................... 30 5.2 Data Mining History.......................................................................................... 31 5.3 Data Mining Techniques .................................................................................. 32 5.4 Classification in Data Mining .......................................................................... 33 5.4.1 Classification .................................................................................................. 33 5.4.2 Related issues with classification .................................................................. 34 5.5 Decision Tree Technique for Classification ................................................. 35 5.5.1 Decision Tree .................................................................................................. 35 5.5.2 Generating classification rules from a decision tree ................................... 36 5.5.3 ID3 Algorithms ............................................................................................... 37 5.6 Partial Integration of Decision Tree in Material View .............................. 42 5.6.1 Classification Experiment ............................................................................. 43 5.6.2 Conclusion ...................................................................................................... 45 VII Appendix A ............................................................................................................... 46 Code Section.............................................................................................................. 46 Appendix B ............................................................................................................... 52 Application Interface ............................................................................................... 52 Appendix C ............................................................................................................... 58 References .................................................................................................................. 58 VIII Chapter 1 Introduction 1.1 Background Data warehouse is the enterprise level repository of subject oriented, time variant, historical data used for informational retrieval and decision support. The DW stores atomic and summary data. Decision – Making means that the data warehouse is intended for knowledge worker or for those people who must analyze information provided by the warehouse to make business decisions. A warehouse is not intended for day-to-day transaction processing. Knowledge worker must access information to plan, forecast, and make financial decisions. They are often people who are reasonably authoritative or are in influential positions such as financial controllers, business analysts, or department mangers. Generally, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases. Data mining contains different types of techniques which are used for the classification, clustering, associations and for sequential patterns. Classification is a data mining (machine learning) technique used to predict group membership for data instances. For example, you may wish to use classification to predict whether the weather on a particular day will be “sunny”, “rainy” or “cloudy”. Popular classification techniques include decision trees and neural networks. Classification produced function that map a data item into one of the several predefined classes by inputting a training dataset and building a model of the class attribute based on the rest of attributes. The building model is used to classify a new data. 1 Decision tree is a classifier in the form of a tree structure. Decision trees are powerful and popular tools for classification and prediction. The attractiveness of decision trees is due to the fact that, in contrast to neural networks, decision trees represent rules. Rules can readily be expressed so that humans can understand them or even directly used in a database access language like SQL so that records falling into a particular category may be retrieved. There are a variety of algorithms for building decision trees. A decision tree can be used to classify an example by starting at the root of the tree and moving through it until a leaf node, which provides the classification of the instance. 1.2 Scope An ID3 algorithm is used behind the decision tree techniques. It is the recursive process. During each iteration the following three steps are occurring before the selection of test attribute. In the first step, ID3 calculates the Entropy of whole dataset and then in the second step, it calculates the Entropy and Gain of each input attribute in dataset. In the third step, ID3 algorithms select the maximum gain attribute for the classification of dataset. These three steps process is very much expensive with respect to time. 1.3 Objective This research work has mainly two objectives. The first objective is to integrate the data mining intelligence in the data warehouses and databases. Here, I try to integrate the decision tree techniques with the materialized view. The second objective is to reduce the time required for the construction of classification tree. The computational process for constructing tree is highly complex and recursive in nature. It includes calculating various values i.e. Entropy of dataset, Entropy and Gain values of each input attributes in dataset repeatedly. Here, I have pre-computed the results required at least for the selection and classification of root node. 1.4 Summary of Chapters Chapter 1, Introduction Chapter 1st provides background, scope and objectives of the thesis. 2 Chapter 2, Data Warehousing In this chapter we define the data warehouse, benefits of data warehouse. Furthermore, it shows the comparison of data warehouse and databases and at the end, it includes the developing process of data warehouse. Chapter 3, Materialized or Indexed View Chapter 3rd contains upon the definition of materialized view, types of MV, its usages and it also discusses the advantages and disadvantages along with its creation process. Chapter 4, Data Warehouse and Data Mining Integration In Chapter 4 provides detail discussion on integration and it also describes the different types of data mining integration with database; such as data integration, schema integration, redundancy and inconsistencies. Chapter 5, Data Mining Last chapter includes discussion about data mining, data mining techniques and its classification along with related issues. Furthermore, there is discussion about Decision Tree and its creation along the ID3 algorithms which use behind decision tree. Appendix A In appendix A we put the code section which divides into two parts. First parts contain the SQL server 2000 code while the second part contains the visual basic.net code which provides the interface for the work which performed in SQL server 2000. Appendix B Appendix B provides the output of the practical work. 3 Appendix C Appendix C contains references about the literature material which is used in this thesis. 4 Chapter 2 Data Warehousing Some people use the term data warehouse in a very general way. To them, any read-only collection of accumulated historical data is called a data warehouse. A data warehouse is a database specifically structured for query and analysis. A data warehouse typically contains data representing the business history of an organization. Data is usually less detailed and longer-lived than data from an online transaction processing (OLTP) system. For example, a data warehouse may store daily order totals by customer over the past five years, whereas an OLTP system would store every order processed but retain those records for only a few months. Some characteristics are common to all data warehouses: Data is collected from other sources; for example, OLTP system. Data is made consistent prior to storage in the data warehouse. Data is summarized. Data warehouses usually do not retain as much detail as transaction-oriented systems. Data is longer-lived. Transaction systems may retain data only until processing is complete, whereas data warehouses may retain data for years. Data is stored in a format that is convenient for querying and analysis. Data is usually considered read only.[3] 2.1 Benefits of Data Warehousing Companies build warehouses to help them make decisions and can use the information in a warehouse to spot trends, buying patterns, and relationships. Once a company builds a warehouse, company leaders have a consistent source for enterprise wide data that allows for fast answers to queries. The analysis phase of building a data warehouse might uncover previously hidden information’s that allows for better decisions. The following are the advantages of DW. 5 The ability to access enterprise wide data The ability to have consistent data The ability to perform analysis quickly Recognition of redundancy of effort Discovery of gaps in business knowledge or business processes Decreased administration costs Empowering all members of an enterprise by providing them with information necessary to perform effectively [3]. 2.2 OLAP Data Characteristics Data in a data warehouse has several attributes that differentiate it from data in a standard, online transaction processing system (OLTP) [3]. 2.2.1. Consolidated and Consistent The terms consolidated and consistent have particular meanings in a data warehouse. Consolidated means that the data is gathered from throughout the enterprise and stored in a central location. Consistent means that all users will get the same results to the same question, even if it is posed at different times. For example, the answer to the question, "What were the total sales for January 1997?" will be consistent whether the question is posed in 1997 or 2002. 2.2.2. Subject Oriented Data in a warehouse should include only key business information. Often, data in (OLTP) sources throughout the enterprise includes information that is not of use to decision makers in the company. Only subject-oriented data should be moved into a warehouse. Once in the warehouse, the data should be organized based on subject. 2.2.3. Historical Data warehouse data is historical, which means that it does not change over time unless a problem existed with the data at the source. Data in a warehouse represents a snapshot in time, so a warehouse is accurate only to a certain point in the past. Data in a warehouse 6 often covers a long period of time; OLTP systems have only current or very recent data. Data over a long period of time allows the analysis of trends over time, including seasonal and long-term trends. 2.2.4. Read Only Because data in a warehouse is historical, it is read only. Data in a warehouse changes only if errors are found in the original source data because if data is updated after it is in a warehouse, consistency is compromised. Because data in a warehouse will not be updated or deleted, the warehouse can be structured to allow maximum speed and flexibility for queries, such as an aggressive use of indexes. 2.2.5. Granular Data in an OLTP system is stored with maximum detail. Data in a data warehouse does not usually need to be stored with maximum detail. Instead, you can handle a certain level of summarization, so the data is stored with more or less granularity. The key of data warehouse design is to identify the appropriate level of summary. You can always summarize up, but you cannot drill down through a summary without the lower-level data. 7 2.3 Database VS Data Warehouse The following table shows differences between OLTP and OLAP databases. [4] Table 2.1 Comparison between OLTP Vs OLAP databases Databases Data Warehouses A database is a collection of related data and a database system is a database and database software together. Databases are transactional system. Traditional databases support on-line transaction processing (OLTP), which includes insertions, updates, and deletions. A data warehouse is also a collection of information as well as a supporting system. Multi databases provide access to disjoint and usually heterogeneous databases and are volatile Whereas a data warehouse is frequently a store of integrated data from multiple sources, processed for storage in a multidimensional model and nonvolatile. Data warehouses also support time-series and trend analysis, both of which require more historical data. Dimensional Modeling is used for DW. While in data warehouse data is in denormalized form. Optimized for bulk loads and large, complex, unpredictable queries that access many rows per table A data warehouse is typically optimized for access from a decision maker's needs. Data warehouses are designed specifically to support efficient extraction, processing and presentation for analytic and decisionmaking purposes. ERD model is used for databases. Data in databases exist in normalized form. Optimized for a common set of transactions, usually adding or retrieving a single row at a time per table Designed for real-time business operations Optimized for validation of incoming data during transactions; uses validation data tables Supports thousands of concurrent users Designed for analysis of business measures by categories and attributes Loaded with consistent, valid data; requires no real time validation Supports few concurrent users relative to OLTP 8 2.4 Data Warehouse Users Data warehouse users can be divided into four categories: Statisticians, Knowledge Workers, Information Consumers, and Executives. Each type makes up a portion of the user population as illustrated in this diagram [5] Figure 2.1 Data Warehouse User 2.5 Developing a Data warehouse The following phases are necessary for the development of data warehouse. These are similar to those of the most database project. 1. Identification and collection of information. 2. Dimensional Modeling Designing. 3. Develop architecture contain Operation data store (ODS). 4. Design Relational Database and OLAP cubes. 5. Develop Data warehouse maintenance applications. 6. Develop Analysis application. 7. Test and install or organize the System. 2.5.1 Identification and collection of information First of all understand the business before entering into discussions with users. Then interview and work with users, not the data. Learn about the needs of users and then turn these needs into project requirements. Data warehouse designer arrange or select the data which provide the suitable information. The most important part of discussion with users is that objectives and challenges as well as how they take a business decisions. The 9 business users should be tied with design team during the logical design process. They are the people who understand the meaning of data. After interview of several users then find out from the experts what data exists and where it resides, but only after you understand the basic business needs of the end users. 2.5.2 Dimensional Modeling Design On the basis of business requirements we can design dimensional model. This must address the business needs and grain of detail and what dimensions and facts to include. This model should be designed for ease of access, easy to maintenance and can adapt the future changes. The model for the relational databases that support the OLAP cubes to provide immediate query results to analysts. An OLTP system requires a normalized structure to minimize redundancy, provide validation of input data, and support a high volume of fast transactions. A transaction usually involves a single business event, such as placing an order or posting an invoice payment. An OLTP model often looks like a spider web of hundreds or even thousands of related tables. In contrast, a typical dimensional model uses a star or snowflake design that is easy to understand and relate to business needs, supports simplified business queries, and provides superior query performance by minimizing table joins. [5] STAR SCHEMA IN DATA WAREHOUSE DESIGN In the design of a data warehouse, the fundamental structure utilized in a relational system is the star schema. This schema has become the design of choice because of its compact and uncomplicated structure that facilitates the query responses on the data. This schema is simple to understand and provides a good introduction to the framework of a warehouse. [6] 10 Figure 2.2 Dimensional Modeling Fact Tables Each data warehouse or data mart includes one or more fact tables. Central to a star or snowflake schema, a fact table captures the data that measures the organization's business operations. Fact tables usually contain large numbers of rows, sometimes in the hundreds of millions of records when they contain one or more years of history for a large organization. A key characteristic of a fact table is that it contains numerical data (facts) that can be summarized to provide information about the history of the operation of the organization. Each fact table also includes a multipart index that contains as foreign keys the primary keys of related dimension tables, which contain the attributes of the fact records. Fact tables should not contain descriptive information or any data other than the numerical measurement fields and the index fields that relate the facts to corresponding entries in the dimension tables. 11 In the FoodMart 2000 sample database provided with Microsoft® SQL Server™ 2000 Analysis Services, one fact table, sales_fact_1998, contains the following columns. Column Description Product_id Foreign key for dimension table product. time_id Foreign key for dimension table time_by_day. customer_id Foreign key for dimension table customer. store_id Foreign key for dimension table store. store_sales Currency column containing the value of the sale. store_cost Currency column containing the cost to the store of the sale. unit_sales Numeric column containing the quantity sold. In this fact table, each entry represents the sale of a specific product on a specific day to a specific customer in accordance with a specific promotion at a specific store. The business measurements captured are the value of the sale, the cost to the store, and the quantity sold. The most useful measures to include in a fact table are numbers that are additive. Additive measures allow summary information to be obtained by adding various quantities of the measure, such as the sales of a specific item at a group of stores for a particular time period. [4] Dimension Tables Dimension tables contain attributes that describe fact records in the fact table. Some of these attributes provide descriptive information; others are used to specify how fact table data should be summarized to provide useful information to the analyst. Dimension tables contain hierarchies of attributes that aid in summarization. Dimensional modeling produces dimension tables in which each table contains fact attributes that are independent of those in other dimensions. For example, a customer dimension table contains data about customers, a product dimension table contains information about products, and a store dimension table contains information about stores. Queries use attributes in dimensions to specify a view into the fact information. [7] The records in a 12 dimension table establish one-to-many relationships with the fact table. For example, there may be a number of sales to a single customer, or a number of sales of a single product. The dimension table contains attributes associated with the dimension entry; these attributes are rich and user-oriented textual details, such as product name or customer name and address. [5] Hierarchies The data in a dimension is usually hierarchical in nature. Hierarchies are determined by the business need to group and summarize data into usable information. For example, a time dimension often contains the hierarchy elements: Year, Quarter, Month, Day, or Quarter, Week, and Day. A dimension may contain multiple hierarchies – a time dimension often contains both calendar and financial year hierarchies. Geography hierarchy for sales points is: (Country, Region, State or Province, City, Store). [5]. Surrogate Keys A critical part of data warehouse design is the creation and use of surrogate keys in dimension tables. A surrogate key is the primary key for a dimension table and is independent of any keys provided by source data systems. Surrogate keys provide the means to maintain data warehouse information when dimensions change. Special keys are used for date and time dimensions, but these keys differ from surrogate keys used for other dimension tables [5]. 2.5.3 Develop architecture contain Operation data store (ODS) The data warehouse architecture reflects the dimensional model developed to meet the business requirements. Dimension design largely determines dimension table design, and fact definitions determine fact table design. Data warehouse architectures must be designed to accommodate ongoing data updates, and to allow for future expansion with minimum impact on existing design. The historical nature of data warehouses means that records almost never have to be deleted from tables except to correct errors. Errors in source data are often detected in the extraction and transformation processes in the staging area and are corrected before the data is loaded into the data warehouse database. 13 The dimensional model also lends itself to easy expansion. New dimension attributes and new dimensions can be added, usually without affecting existing schemas other than by extension. An entirely new schema can be added to a data warehouse without affecting existing functionality. A new business subject area can be added by designing and creating a fact table and any dimensions specific to the subject area. The Operational Data Store (ODS) is an operational construct that has elements of both data warehouse and a transaction system. Like a data warehouse, the ODS typically contains data consolidated from multiple systems and grouped by subject area. Like a transaction system, the ODS may be updated by business users, and contains relatively little historical data. [5]. 2.5.4 Design Relational Database and OLAP cubes. In this phase, the star or snowflake schema is created in the relational database, surrogate keys are defined and primary and foreign key relationships are established. Views, indexes, and fact table partitions are also defined. OLAP cubes are designed that support the needs of the users. [5]. 2.5.5 Develop Data warehouse maintenance applications The data maintenance applications, including extraction, transformation, and loading processes, must be automated, often by specialized custom applications. Data Transformation Services (DTS) in SQL Server 2000 is a powerful tool for defining many transformations. [5]. 2.5.6 Develop Analysis application The applications that support data analysis by the data warehouse users are constructed in this phase of data warehouse development. OLAP cubes and data mining models are constructed using Analysis Services tools, and client access to analysis data is supported by the Analysis Server. Other analysis applications, such as Excel PivotTables, Predefined reports, Web sites are natural language applications using English Query. Specialized third-party analysis tools are also acquired and implemented or installed. 14 2.5.7 Test and install or organize the System It is important to involve users in the testing phase. After initial testing by development and test groups, users should load the system with queries and use it the way they intend to after the system is brought on line. Substantial user involvement in testing will provide a significant number of benefits. Among the benefits are: Discrepancies can be found and corrected Users become familiar with the system Index turning can be performed It is important that users exercise the system during the test phase with the kinds of queries they will be using in production. This can enable a considerable amount of empirical index tuning to take place before the system comes online. Additional tuning needs to take place after deployment, but starting with satisfactory performance is a key to success. Users who have participated in the testing and have seen performance continually improve as the system is exercised will be inclined to be supportive during the initial deployment phase as early issues are discovered and addressed. [5]. 15 Chapter 3 Materialized View A materialized view or index view is a special type of summary table that is constructed by aggregating one or more columns of data from a single table, or a series of tables that are joined together. When queries are executed at an aggregation level satisfied by a materialized view, the cost-based optimizer automatically rewrites the query to take advantage of the most appropriate materialized view. Materialized views can dramatically improve query performance, and significantly decrease the load on the system. This is because materialized views require fewer logical reads to satisfy the query than the same query running against the base tables. Materialized views are a powerful feature that has been part of the Oracle RDBMS since version 8.1. When they are effectively implemented across an entire data warehouse, the total number of logical reads can be reduced by well over 90%. Although materialized views are considered to be a data warehouse feature, they can also be employed in other environments, including Operational Data Stores (ODS), data marts, and reporting tables in OLTP environments, where end-users will perform rollup queries on the schema.[8]Materialized views within the data warehouse are transparent to the end user or to the database application. In SQL Server 2000 and 2005, a view that has a unique clustered index is referred to as an indexed view (MV in oracle). In the case of a non-indexed view, the portions of the view necessary to solve the query are materialized at run time. Any computations such as joins or aggregations are done during query execution for each query referencing the view1. After a unique clustered index is created on the view, the view's result set is materialized immediately and persisted in physical storage in the database, saving the overhead of performing this costly operation at execution time. [10] 16 3.1 Materialized View in Different Environment 3.1.1 Materialized Views for Distributed Computing In distributed environments, you can use materialized views to replicate data at distributed sites and to synchronize updates done at those sites with conflict resolution methods. The materialized views as replicas provide local access to data that otherwise would have to be accessed from remote sites. Materialized views are also useful in remote data marts. [9] 3.1.2 Materialized Views for Mobile Computing You can also use materialized views to download a subset of data from central servers to mobile clients, with periodic refreshes and updates between clients and the central servers. [9] 3.1.3 Materialized View for data warehouses In data warehouses, you can use materialized views to precompute and store aggregated data such as the sum of sales. Materialized views in these environments are often referred to as summaries, because they store summarized data. They can also be used to pre compute joins with or without aggregations. A materialized view eliminates the overhead associated with expensive joins and aggregations for a large or important class of queries. [9] 3.2 The Need for Materialized Views Use materialized views in data warehouses to increase the speed of queries on very large databases. Queries to large databases often involve joins between tables, aggregations such as SUM, or both. These operations are expensive in terms of time and processing power. The type of materialized view you create determines how the materialized view is refreshed and used by query rewrite. Materialized views improve query performance by pre calculating expensive join and aggregation operations on the database prior to execution and storing the results in the database. The query optimizer automatically recognizes when an existing materialized view can and should be used to satisfy a request. It then transparently rewrites the request 17 to use the materialized view. Queries go directly to the materialized view and not to the underlying detail tables. In general, rewriting queries to use materialized views rather than detail tables improves response. A materialized view can be partitioned, and you can define a materialized view on a partitioned table. You can also define one or more indexes on the materialized view. [9] 3.3 Uses of Materialized Views This is relatively straightforward and is answered in a single word - performance. By calculating the answers to the really hard questions, we will greatly reduce the load on the machine, we will experience: Less physical reads - There is less data to scan through. Less writes - We will not be sorting / aggregating as frequently. Decreased CPU consumption - We will not be calculating aggregates and functions on the data, as we will have already done that. Markedly faster response times - Our queries will return incredibly quickly when a summary is used, as opposed to the details. This will be a function of the amount of work we can avoid by using the materialized view. Materialized views will increase your need for one resource - more permanently allocated disk. We need extra storage space to accommodate the materialized views, of course, but for the price of a little extra disk space, we can pick a lot of benefit. Materialized views work best in a read-only or read-intensive environment. They are not designed for use in a high-end OLTP environment. They will add overhead to modifications performed on the base tables in order to capture the changes. [11] 3.4 How Materialized Views Work Materialized views may appear to be hard to work with at first. So, now that we can create a materialized view and show that it works, what are the steps Oracle will undertake to rewrite our queries? Normally, when QUERY REWRITE ENABLED is set to FALSE, Oracle will take your SQL as is, parse it, and optimize it. With query rewrites 18 enabled, Oracle will insert an extra step into this process. After parsing, Oracle will attempt to rewrite the query to access some materialized view, instead of the actual table that it references. If it can perform a query rewrite, the rewritten query (or queries) is parsed and then optimized along with the original query. The query plan with the lowest cost from this set is chosen for execution. If it cannot rewrite the query, the original parsed query is optimized and executed as normal. [11] 3.5 Types of Materialized View The following are the types of materialized view. 3.5.1 Types of Materialized view on the basis of Tables There are different types of materialized views. One is simple view while another is complex view. Simple materialized view can only be created on the basis of single table and does not perform set operations and joins or group by. E.g Create View inventory As Select isbn, title, retail_price From books With Read only. In complex materialized view include more than one table and also perform set operations and join or group by. E.g Create View balancedue As Select customer#, order#, Sum(quantity * retail) Amtdue From customers JOIN orders USING(customer#) JOIN orderitems USING(order#) JOIN books USING(isbn) Group by customer#,order#; 19 3.5.2 Some other Types of Materialized view The following are further types of Materialized view. Read only materialized view You can make a materialized view read-only during creation by omitting the FOR UPDATE clause In addition, using read-only materialized views eliminates the possibility of a materialized view introducing data conflicts at the master site or master materialized view site, although this convenience means that updates cannot be made at the remote materialized view site. The following is an example of a read-only materialized view: [12] CREATE MATERIALIZED VIEW hr.employees AS SELECT * FROM hr.employees; Updatable Materialized view You can make a materialized view updatable during creation by including the FOR UPDATE .For changes made to an updatable materialized view to be pushed back to the master during refresh, the updatable materialized view must belong to a materialized view group. Updatable materialized views enable you to decrease the load on master sites because users can make changes to the data at the materialized view site. The following is an example of an updatable materialized view:[12] CREATE MATERIALIZED VIEW hr.departments FOR UPDATE AS SELECT * FROM hr.departments Writeable Materialized view A writeable materialized view is one that is created using the FOR UPDATE clause but is not part of a materialized view group. Users can perform DML operations on a writeable materialized view, but if you refresh the materialized view, then these changes are not pushed back to the master and the changes are lost in the materialized view itself. 20 Writeable materialized views are typically allowed wherever fast-refreshable read-only materialized views are allowed. [12] Conventional Materialized view. A conventional materialized view blindly materializes and maintains all rows of a view, even rows that are never accessed. [14] Dynamic Materialized view. We propose a more flexible materialization strategy aimed at reducing storage space and view maintenance costs. A dynamic materialized view selectively materializes only a subset of rows, for example, the most frequently accessed rows. One or more control tables are associated with the view and define which rows are currently materialized. Dynamic materialized views greatly reduce storage requirements and maintenance costs while achieving better query performance with improved buffer pool efficiency. [14] 3.6 Advantages and Disadvantages 3.6.1 Advantages – Useful for summarizing, pre-computing, replicating and distributing data – Faster access for expensive and complex joins – Transparent to end-users – MVs can be added/dropped without invalidating coded SQL[13] 3.6.2 Disadvantages – Performance costs of maintaining the views – Storage costs of maintaining the views [13] 3.7 Materialized View Refresh Methods. The following types of refresh methods are supported by Oracle. Complete - build from scratch Fast - only apply the data changes 21 Force - try a fast refresh, if that is not possible, do a complete refresh Never - never refresh the materialized view [15] 3.8 Creating a Materialized View A materialized view can be created with the CREATE MATERIALIZED VIEW statement or using Oracle Enterprise Manager. The following command creates the materialized view store_sales_mv. CREATE MATERIALIZED VIEW store_sales_mv BUILD IMMEDIATE REFRESH COMPLETE ENABLE QUERY REWRITE AS SELECT s.store_name,SUM(dollar_sales) AS sum_dollar_sales FROM store s, fact f WHERE f.store_key = s.store_key GROUP BY s.store_name; 3.9 Indexed View in SQL Sever 2000 It is a view that stores its result data, so it can use them later in subsequent queries on that view. Means that next time you query this view, it doesn't have to go back to the underlying table, but instead get the data from the view's storage.If you have a query that is complicated and consumes lots of time, and resources, then it is better to store the result, and next time just go to the result.SQL Server 2000 Indexed Views are similar to Materialized Views in Oracle - the Result Set is stored in the Database. Query Performance can be dramatically enhanced using Indexed Views. Create an Indexed View by implementing a UNIQUE CLUSTERED index on the view. The results of the view are stored in the leaf-level pages of the clustered index. An indexed View automatically reflects modifications made to the data in the base tables after the index is created, the same way an index created on a base table does. As modifications are made to the data in the base tables, the data modifications are also reflected in the data stored in the indexed view. The requirement that the clustered index 22 of the view be unique improves the efficiency with which SQL Server 2000 can find the rows in the index that are affected by any data modification. The SQL Server 2000 Query Optimizer automatically determines whether a given query will benefit from using an Index View. Create Indexed Views when the performance gain of improved speed in retrieving results outweighs the increased maintenance cost. The underlying data is infrequently updated. Queries perform a significant amount of joins and aggregations that either process many rows or are performed frequently by many users. 3.9.1 Restrictions on Creating Indexed Views Consider the following guidelines: The first index that you create on the view must be a UNIQUE CLUSTRERD index You must create the view with the SCHEMABINDING option. The view can reference base tables, but it cannot reference other views. You must use two-part names to reference tables. 3.9.2 Create the Indexed View or Materialized view The following procedure is used for the creation of indexed view in sql server 2000. Before the creation of materialized view in sql server 2000 set the following options. SET NUMERIC_ROUNDABORT OFF GO SET ANSI_PADDING, ANSI_WARNINGS, CONCAT_NULL_YIELDS_NULL ON GO SET ARITHABORT, QUOTED_IDENTIFIER, ANSI_NULLS ON GO If exists (select name from sysobjects where name = 'scabbiesdata_view' and type = 'v') Drop view scabbiesdata_view Go CREATE VIEW scabbiesdata_view With schemabinding AS SELECT keycol, Age, Gender, residence, education, monthlyincome, and scabbies_class FROM dbo.scabbiesdata GO 23 CREATE UNIQUE CLUSTERED INDEX scabbiesindex ON scabbiesdata_view(keycol) GO You need it in data warehouse environment more that OLTP environment you need it in huge databases, and not a table that has 3 records. We don't want to use materialized view for select from customer where location is New York, right? We want to use materialized view for complicated view that does outer join, self-join, union and aggregation functions but all those Are not allowed in the index view .The second point that you think Sql server model has advantage over Oracle, is Sql server model is dynamic, which means changed data in underlying tables are immediately reflected in the view. 24 Chapter 4 Integration of Data warehouse and Data Mining 4.1 Introduction Data warehouse (DW) is a system that extracts, cleans, conforms, and delivers source data into a dimensional data store and then supports and implements querying and analysis for the purpose of decision making. Sophisticated OLAP tools, which facilitate multidimensional analysis, were used. Business trends are identified using data mining (DM) tools and applying complex business models. Warehouse is actually usable because the ETL process (extraction, transformation, loading) still needs to be completed. Data warehousing is the process of taking data from legacy and transaction database systems and transforming it into organized information in a user-friendly format to encourage data analysis and support fact-based business decision-making. The process that involves transforming data from its original format to a dimensional data store accounts for at least 70 percent of the time, effort, and expense of most data warehouse projects. As it is very costly and critical part of a data warehouse implementation there is a variety of data extraction and data cleaning tools, and load and refresh utilities for DW. Different data mining techniques are used to facilitate the integration of data in DW [22]. Data mining is the essential process where intelligent methods are applied in order to extract data patterns SAS defines data mining as the process of selecting, exploring, and modeling large amounts of data to uncover previously unknown patterns for a business advantage. Data Mining is the activity of extracting hidden information (patterns and relationships) from large databases automatically: that is, without benefit of human intervention or initiative in the knowledge discovery process. Data Mining is the step in the process of knowledge discovery in databases, that inputs predominantly cleaned, 25 transformed data, searches the data using algorithms, and outputs patterns and relationships to the interpretation/evaluation step of the KDD process. 4.2 Data Integration The integration is one of the most important characteristic of the data warehouse. Data is fed from multiple disparate sources into the data warehouse. As the data is fed it is converted, reformatted, summarized, and so forth. The result is that data—once it resides in the data warehouse—has a single physical corporate image. Many problems arise in this process. Designers of different applications made up their decisions over the years in different ways. In the past, when application designers built an application, they never considered that the data they were operating on would ever have to be integrated with other data. Such a consideration was only a wild theory. Consequently, across multiple applications there is no application consistency in encoding, naming conventions, physical attributes, measurement of attributes, and so forth. Each application designer has had free rein to make his or her own design decisions. The result is that any application is very different from any other application. One simple example of lack of integration is data that is not encoded consistently, as shown by the encoding of gender. In one application, gender is encoded as m or f. In another, it is encoded as 0 or 1. As data passes to the data warehouse, the applications’ different values must be correctly deciphered and recoded with the proper value. This consideration of consistency applies to all application design issues, such as naming conventions, key structure, measurement of attributes, and physical characteristics of data. Some of the same data exists in various places with different names, some data is labeled the same way in different places, some data is all in the same place with the same name but reflects a different measurement, and so on [22]. 26 4.3 Schema Integration The most important issue in data integration is the Schema integration. How can equivalent real-world entities from multiple data sources be matched up? This is referred to as entity identification process. Terms may be given different interpretations at different sources. For example, how can be data analyst is sure that customer_id in one database and cust_number in another refer the same entity? Data mining algorithms can be used to discover the implicit information about the semantics of the data structures of the information sources. Often, the exact meaning of an attribute cannot be deduced from its name and data type. The task of reconstructing the meaning of attributes would be optimally supported by dependency modeling using data mining techniques and mapping this model against expert knowledge, e.g., business models. Association rules are suited for this purpose. Other data mining techniques, e.g., classification tree and rule induction, and statistical methods, e.g., multivariate regression, probabilistic networks, can also produce useful hypotheses in this context. Data mining and statistical methods can be used to induce integrity constraint candidates from the data. These include, for example, visualization methods to identify distributions for finding domains of attributes or methods for dependency modeling. Other data mining methods can find intervals of attribute values, which are rather compact and cover a high percentage of the existing values. Data mining methods can discover functional relationships between different databases when they are not too complex. A linear regression method would discover the corresponding conversion factors. If the type of functional dependency (linear, quadratic, exponential etc.) is a priori not known, model search instead of parameter search has to be applied [22]. 27 4.4 Redundancy Redundancy is another important issue. An attribute may be redundant if it can be “derived” from another table, e.g. annual revenue. In addition to detecting redundancies between attributes, duplication can be detected at the tuple level (e.g., where there are two or more identical tuples for a given unique data entry case). Some redundancies can be detected by correlation analysis. For example, given two attributes, such analysis can measure how strongly one attribute implies the other, based on available data. Inconsistencies in attribute or dimension naming can also cause redundancies in the resulting data set [22]. 4.5 Inconsistencies Since a data warehouse is used for decision-making, it is important that the data in the warehouse are correct. However, since large volumes of data from multiple sources are involved, there is a high probability of errors and anomalies in the data. Real-world data tend to be incomplete, noisy and inconsistence. Data cleansing is a non-trivial task in data warehouse environments. The main focus is the identification of missing or incorrect data (noise) and conflicts between data of different sources and the correction of these problems. Data cleansing routines attempt to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data. Some examples where data cleaning becomes necessary are: inconsistent field lengths, inconsistent descriptions, inconsistent value assignments, missing entries and violation of integrity constraints. Typically, missing values are indicated by blank fields or special attribute values. A way to handle these records is to replace them by the mean of most frequent value or the value, which is most common to similar objects. Simple transformation rules can be specified; e.g., “replace the string gender by sex”. Missing values may be determined with regression, inference-based tools, using a Bayesian formalism, or decision tree induction. [22]. Although there have been many data-mining methodologies and systems developed in recent years, we contend that by and large, present mining models lack human 28 involvement, particularly in the form of guidance and user control. We believe that data mining is most effective when the computer does what it does best— like searching large databases or counting. This division of labor is best achieved through constraint-based mining, in which the user provides restraints that guide a search. Mining can also be improved by employing a multidimensional, hierarchical view of the data. Current data warehouse systems have provided a fertile ground for systematic development of this multidimensional mining. Together, constraint-based and multidimensional techniques can provide a more ad hoc, query-driven process that effectively exploits the semantics of data than those supported by current stand-alone data-mining systems. A data-mining system should support efficient processing and optimization of mining queries by providing a sophisticated mining-query optimizer. 29 Chapter 5 Data Mining 5.1 Data Mining Definition. Data mining (DM) is defined as the process of discovering patterns in data. The process must be automatic or (more usually) semiautomatic. The patterns discovered must be meaningful in that they lead to some advantage, usually an economic advantage [16]. We simply define; data mining refers to extracting or mining “knowledge from large amounts of data. The term is actually a misnomer. Remember that the mining of gold from rocks or sand is referred to as gold mining rather than rock or sand mining. Thus, data mining “should have been more appropriately named knowledge mining from data", which is unfortunately somewhat long. It is also called shortly knowledge mining. There are many other terms such as knowledge mining from databases, knowledge extraction, data/pattern analysis, data archaeology, and data dredging. Many people treat data mining as a synonym for another popularly used term, “Knowledge Discovery in Databases", or KDD. Alternatively, others view data mining as simply an essential step in the process of knowledge discovery in databases. Knowledge discovery as a process is depicted in Figure 5.1, and consists of an iterative sequence of the following steps: Data cleaning (DC): to remove noise or irrelevant data. Data integration (DI): where multiple data sources may be combined Data selection (DS) where data relevant to the analysis task are retrieved from the databases. Data transformation (DT): where data are transformed or consolidated into forms Appropriate for mining by performing summary or aggregation operations. Data mining (an essential process where intelligent methods are applied in order to extract data patterns). Pattern evaluation (PE): to identify the truly interesting patterns representing knowledge. 30 Knowledge presentation (KP): where visualization and knowledge representation Techniques are used to present the mined knowledge to the user [17]. Figure 5.1 Data mining as a process of knowledge discovery 5.2 Data Mining History The past decade has seen an explosive growth in database technology and the amount of data collected. Advances in data collection, use of bar codes in commercial outlets, and the computerization of business transactions have flooded us with lots of data. We have an unprecedented opportunity to analyze this data to extract more intelligent and useful information, and to discover interesting, useful, and previously unknown patterns from data. Due to the huge size of data and the amount of computation involved in knowledge discovery, parallel processing is an essential component for any successful large-scale data mining application. Data mining is concerned with finding hidden relationships present in business data to allow businesses to make predictions for future use. It is the process of data-driven extraction of not so obvious but useful information from large databases. Data mining has emerged as a key business intelligence technology. The explosive growth of stored data has generated an information glut, as the storage of data alone does not bring about knowledge that can be used: (a) to improve business and services and (b) to help develop new techniques and products. Data is the basic form of 31 information that needs to be managed, sifted, mined, and interpreted to create knowledge. Discovering the patterns, trends, and anomalies in massive data is one of the grand challenges of the Information Age. Data mining emerged in the late 1980s, made great progress during the Information Age and in the 1990s, and will continue its fast development in the years to come in this increasingly data-centric world. Data mining is a multidisciplinary field drawing works from statistics, database technology, artificial intelligence, pattern recognition, machine learning, information theory, knowledge acquisition, information retrieval, high-performance computing, and data visualization. The aim of data mining is to extract implicit, previously unknown and potentially useful (or actionable) patterns from data [18]. 5.3 Data Mining Techniques Data mining consists of many up-to-date techniques such as classification (decision trees, Naive Bayes classifier, k-nearest neighbor, neural networks), clustering (k-means, hierarchical clustering, density-based clustering), association (one-dimensional, multidimensional, multilevel association, constraint-based association). Many years of practice show that data mining is a process, and its successful application requires data preprocessing (dimensionality reduction, cleaning, noise/outlier removal), post processing (understandability, summary, presentation), good understanding of problem domains and domain expertise. Today’s competitive marketplace challenges even the most successful companies to protect and retain their customer base, manage supplier partnerships, and control costs while at the same time increasing their revenue. In a world of accelerating change, competitive advantage will be defined by the ability to leverage information to initiate effective business decisions before competition does. Hence in this age of global competition accurate information plays a vital role in the insurance business. Data is not merely a record of business operation – it helps in achieving competitive advantages in the insurance sector. Thus, there is growing pressure on MIS managers to provide information technology (IT) infrastructure to enable decision support mechanism. This would be possible provided the decision makers have online access to previous data. Therefore, there is a need for developing a data warehouse. Data mining as a tool for customer relationship management also has proved to be a means of 32 controlling costs and increase revenues. In the last decade, machine learning had come of age through a number of ways such as neural networks, statistical pattern recognition, fuzzy logic, and genetic algorithms. Among the most important applications for machine learning are classification, recognition, prediction, and data mining. Classification and recognition are very significant in a lot of domains such as multimedia, radar, sonar, optical character recognition, speech recognition, vision, agriculture, and medicine [18]. 5.4 Classification in Data Mining We only discuss here Classification because our selected topic is related to this functionality. Databases are rich with hidden information that can be used for making intelligent business decisions. Classification is data analysis which can be used to extract models describing important data classes or to predict future data trends. Classification predicts categorical labels (or discrete values). For example, a classification model may be built to categorize bank loan applications as either safe or risky. Many classification methods have been proposed by researchers in machine learning, expert systems, statistics, and neurobiology. Most algorithms are memory resident, typically assuming a small data size. Recent database mining research has built on such work, developing scalable classification techniques capable of handling large, disk resident data. These techniques often consider parallel and distributed processing. There are different basic techniques for data classification such as decision tree induction, Bayesian classification and Bayesian belief networks, and neural networks. Other approaches to classification, such as k-nearest neighbor classifiers, case-based reasoning, genetic algorithms, rough sets, and fuzzy logic techniques are introduced. 5.4.1 Classification Data classification is a two step process. In the first step, a model is built describing a predetermined set of data classes or concepts. The model is constructed by analyzing database rows described by attributes. Each row is assumed to belong to a predefined class, as determined by one of the attributes, called the class label attribute. In the context of classification, data rows are also referred to as samples, examples, or objects. The data 33 rows analyzed to build the model collectively form the training data set. The individual rows making up the training set are referred to as training samples and are randomly selected from the sample population. Since the class label of each training sample is provided, this step is also known as supervised learning (i.e., the learning of the model is 'supervised' in that it is told to which class each training sample belongs). Typically, the learned model is represented in the form of classification rules, decision trees, or mathematical formulae. In the second step, the model is used for classification. First, the predictive accuracy of the model (or classifier) is estimated. Then we use a test set of class-labeled samples. These samples are randomly selected and are independent of the training samples. 5.4.2 Related issues with classification Preparing the data for classification, the following preprocessing steps may be applied to the data in order to help improve the accuracy, efficiency, and scalability of the classification process [19]. Data cleaning. This refers to the preprocessing of data in order to remove or reduce noise (by applying smoothing techniques, for example), and the treatment of missing values (e.g., by replacing a missing value with the most commonly occurring value for that attribute, or with the most probable value based on statistics). Although most classification algorithms have some mechanisms for handling noisy or missing data, this step can help reduce confusion during learning. Relevance analysis. Many of the attributes in the data may be irrelevant to the classification task. For example, data recording the day of the week on which a bank loan application was filed is unlikely to be relevant to the success of the application. Furthermore, other attributes may be redundant. Hence, relevance analysis may be performed on the data with the aim of removing any irrelevant or redundant attributes from the learning process. In machine learning, this step is known as feature selection. Including such attributes may otherwise slow down, and possibly mislead, the learning step. Ideally, the time spent on relevance analysis, when added to the time spent on learning from the resulting “reduced" feature subset, should be less than the time that 34 would have been spent on learning from the original set of features. Hence, such analysis can help improve classification efficiency and scalability. Data transformation. The data can be generalized to higher-level concepts. Concept hierarchies may be used for this purpose. This is particularly useful for continuous-valued attributes. For example, numeric values for the attribute income may be generalized to discrete ranges such as low, medium, and high. Similarly, nominal-valued attributes, like street, can be generalized to higher-level concepts, like city. Since generalization compresses the original training data, fewer input/output operations may be involved during learning. The data may also be normalized, particularly when neural networks or methods involving distance measurements are used in the learning step. Normalization involves scaling all values for a given attribute so that they fall within a small specified range, such as -1.0 to 1.0, or 0 to 1.0. In methods which use distance measurements, for example, this would prevent attributes with initially large ranges (like, say income) from outweighing attributes with initially smaller ranges (such as binary attributes). 5.5 Decision Tree Technique for Classification In above different techniques of classification we only select the Decision Tree technique for our research area. In which we want to improve the performance of ID3 algorithm behind the decision tree techniques. First of all we discuss the process of decision tree below with the training set and also discuss the steps of ID3 [19]. 5.5.1 Decision Tree A decision tree is a tree in which each branch node represents a choice between a number of alternatives, and each leaf node represents a decision. Decision tree are commonly used for gaining information for the purpose of decision -making. Decision tree starts with a root node on which it is for users to take actions. From this node, users split each node recursively according to decision tree learning algorithm. The final result is a decision tree in which each branch represents a possible scenario of decision and its outcome [21]. A decision tree is a flow-chart-like tree structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and leaf nodes represent 35 classes or class distributions. The topmost node in a tree is the root node. A typical decision tree is shown in Figure 5.2. It represents the concept buys computer, that is, it predicts whether or not a customer at All Electronics is likely to purchase a computer. Internal nodes are denoted by rectangles, and leaf nodes are denoted by ovals. In order to classify an unknown sample, the attribute values of the sample are tested against the decision tree. A path is traced from the root to a leaf node which holds the class prediction for that sample. Decision trees can easily be converted to classification rules. Figure 5.2 a decision tree 5.5.2 Generating classification rules from a decision tree The decision tree of Figure 5.2 can be converted to classification IF-THEN rules by tracing the path from the root node to each leaf node in the tree. The rules extracted from Figure 5.2 are: [19]. IF age = “<30" AND student = no THEN buys computer = no IF age = “<30" AND student = yes THEN buys computer = yes IF age = “30-40" THEN buys computer = yes IF age = “>40" AND credit rating = excellent THEN buys computer = yes IF age = “>40" AND credit rating = fair THEN buys computer = no 36 5.5.3 ID3 Algorithms Originator of the ID3 Algorithm ID3 and its successors have been developed by Ross Quinlan, who discovered it while working with Earl Hunt in the 1970s. He subsequently worked at Sydney University, Rand Corporation in California, UTS, back to Sydney University, and several years at UNSW. He now runs his own company, Rule quest (www.rulequest.com) [20]. Implementation of ID3 Algorithm ID3 (Learning Sets S, Attributes Sets A, Attributes values V) Return Decision Tree. Begin Load learning sets first, create decision tree root node 'root Node', Add learning set S into Root node as its subset. For root Node, we compute Entropy (rootNode.subset) first If Entropy (rootNode.subset) = =0, then RootNode.subset consists of records all with the same value for the categorical Attribute, return a leaf node with decision attribute: attribute value; If Entropy (rootNode.subset)! =0, then Compute information gain for each attribute left (have not been used in splitting), Find attribute A with Maximum (Gain(S, A)). Create child nodes of this root Node and add to root Node in the decision tree. For each child of the root Node, apply ID3(S, A, V) recursively until reach Node that has entropy=0 or reach Leaf node. End ID3. We implement the above ID3 algorithm on the following training dataset which contain 14 records. Each record is called samples with predefined class value. In the above process the Entropy of dataset and the Gain value of each attribute are the important 37 information’s of this ID3 algorithms. The following mathematical formulae are used for the calculation of Entropy and Gain. Eq. 5.1 Entropy Equation Entropy is used to measure the homogeneity of S; S is sample of training samples. P is the proportionality measure which is also called relative frequency. Eq. 5.2 Information Gain Equation This is the training data set which is used for the classification process to generate decision tree. The set is denoted by S which is considered as a root node. The root node dataset contain fourteen records or rows. Each record is sample along with predefined value of class attribute. Table 5.1 (Training dataset used for classification) key_col 1 2 3 4 5 6 7 8 9 Outlook temerature HumidityWindy class sunny sunny overcast rain rain rain overcast sunny sunny no no yes yes yes no yes no yes hot hot hot mild cool cool cool mild cool high high high high normal normal normal high normal weak strong weak weak weak strong strong weak weak In above dataset the data about the player of tennis which decide play tennis or not. Dataset contains four input attributes such as outlook, temperature, Humidity and windy while is the predicate or class attribute. Outlook attribute contain three values such as sunny, overcast and rain. The temperature attribute also contains three values which are hot, mild and cool. The humidity attribute contain two values such as high and normal while the last input attribute also contain two values such as weak and strong. The predicate attribute contain two values yes and no which define that there are only two 38 type of classes yes and no. dataset contain fourteen records.Key_col is the primary key column.For the classification of above dataset first of all require the Entropy of dataset and then calculate the gain of each attribute of dataset. After then select the maximum gain attribute among the input attributes which is used for the classification purpose. Entropy calculation processing of dataset is shown below. First decides the number of records which have (No) class value which are five (5) while the (yes) class value records are Nine (9). Total number of records is fourteen (14). Relative frequency of No class: 5/14. Relative frequency of Yes class: 9/14. Entropy of S dataset is calculated by the above Entropy formula. Entropy (5, 9) = -5/14 log2 5/14 – 9/14 log2 9/14 = 0.9403 Calculate the Gain of Each input attributes 1- Calculate gain of outlook attribute. For the calculation of attribute gain first checked the number of values for this attribute, and then on the basis of each value the S dataset is classified. Outlook attribute have three values like rain, overcast and sunny. There are three subset is possible of S dataset on the basis of outlook attribute values. I. First subset s1 contains five (5) records on the basis of rain value of outlook attribute. II.Second subset s2 contains four (4) records on the basis of overcast value of outlook attribute. III. Thirds subset s3 contains five (5) records on the basis of sunny value. Proportionality measure for s1 is 5/14 Proportionality measure fro s2 is 4/14 Proportionality measure for s3 is 5/14 Calculates the Entropy of each subset In first set S1 the three (3) yes class and (2) No class. Total is five records (5) Entropy (3, 2) = -3/5 log2 3/5 – 2/5 log2 2/5 = 0.971 In second set S2 the four (4) yes class. Total is four records (4) Entropy (4) = -4/4 log2 4/4 39 =0 In third set S3 the three (3) NO class and two (2) Yes class. Total is five records (5). Entropy (3, 2) = -3/5 log2 3/5 – 2/5 log2 2/5 = 0.971 Calculate the attribute Entropy The following formula is used Entropy( Sv) sv s Entropy(sv) V A Eq. 5.3 Calculate attribute Entropy Entropy (S1, S2, S3) = S1/S * Entropy (S1) + S2/S * Entropy (S2) + S3/S * Entropy (S3) Entropy (5, 4, 5) or Entropy (outlook) = 5/14 * 0.971 + 4/14 * 0 + 5/14 * 0.971 = 0.694 Calculate the Gain of outlook Attribute by the above given formula. Gain (S, A) = Entropy(S) – Entropy (outlook) Gain(S, outlook) = 0.9403 - 0.694 =0.2463 The above Three steps Repeats for other three remaining input attributes. The following tables contain the Gain of attributes for Original set, Rain subset, and Sunny subset. Tables 5.2 Gain information for Original set Attribute Name Outlook Temperature Humidity Windy Gain 0.2463 0.0292 0.1518 0.0481 Tables 5.3 Gain information for Rain subset Attribute Temperature Humidity Windy 40 Gain 0.02 0.02 0.971 Tables 5.4 Gain information for sunny subset Attribute Temperature Humidity Windy Gain 0.571 0.971 0.02 Select Maximum Gain attribute for Classification of S dataset. In the above Table 4.2 the attribute which have the maximum gain value attribute is the outlook. The Gain value of outlook is 0.2463 which is the highest value. After this process we can split the dataset into three different subsets on the basis of outlook attribute values which are following Rain, overcast and sunny. The classification is show by the following diagram in figure 5.3. Outlook is Test Attribute Humidity is Test Attribute Windy is Test Attribute Rain Strong Sunny Overcast Weak normal Figure 5.3 Decision tree of the above dataset. 41 High ID3 is the recursive process which is repeated further for the child subset. In every repetition the Entropy of new set and Gain of each attribute of new set is calculated. The recursive process repeated until the particular class is obtained or there is no input attributes is remaining for the classification of dataset. The following decision tree is generated in SQL server which is created during my practical work of this dissertation. Figure 5.4 Figure 5.4 Decision Tree Generated in SQL server 2000 5.6 Partial Integration of Decision Tree in Material View Decision tree Algorithms are recursive in nature. At each iteration, apart from other tasks, the next best attribute, in the remaining attribute list, is selected as a test attribute. And then for each value of the attribute, a branch is grown from the test node. The process of figuring out the best attribute mostly involved tremendous amount of calculations. For example, considering the most basic algorithm of ID3, to select the best attribute, a whole lot of mathematics is carried out. ID3 used information Gain to select a split attribute. Entropy of an attribute is calculated using formula in figure 5.6 Carrying out these calculations on a data set where the data is updated frequently will obviously affect the efficiency of the algorithm. 42 Materialized View or indexed View is refreshed using various policies. But in order to have the latest data in hand to contract a classification model, the materialized view must be updated frequently. From a decision tree perspective, we know that even with a single record updating, the calculations have to be carried out again to present the exact statistics for entropy and gain. In our approach, we suggest to create a tabular structure in a data warehouse / database. This table will contain the information values required for the construction of the decision tree classifier. Moreover, this table will be dependent on the materialized view where the dataset for decision tree is stored. The structure that we have proposed for this dependent table is as under: Table 5.5 Dependent Table Att_name Comp_type Result ………….. ………………….. ……………….. …………. ………………….. ………………. …………. ………………….. ………………... Values required for creating a decision tree using ID3 and C4.5 algorithm will be stored in this structure. Each time, no matter whatever class label is used, the initial required values will be readily available for the algorithm. Instead of calculating values from within the algorithm, these values will be available with the associated dependent tabular structure with the Materialized view. 5.6.1 Classification Experiment Table 5.1 presents a training set of data tuples for classification. The class label attribute, CLASS, has two distinct values (namely, {yes, no}); therefore, there are two distinct classes (m=2). Let class C1 correspond to YES and class C2 correspond to NO. The outlook, temperature, Humidity and windy are the input attributes. In previous approach, for creating the decision tree, the selection of the best attribute at the root node involved calculations of heavy nature. Dataset and attribute level computations are carried out. 43 In our approach, this dataset will be stored in a Materialized view, and the other dependent table will contain the statistical values required for the selection of the best attribute at the root node. A significant efficiency is achieved making this algorithm faster. An important issue is that a single dataset can be used to construct various classification models. For this purpose a separate target / output class will be introduced. As soon as the new class attribute is introduced in the materialized view, a function will be triggered to redo the calculations according to the new target class values. And these values will be updated /stored in the dependent table, providing the values required to select best attribute at the root node. The dependent table will look like as: Table 5.6 Resultant values in Dependent table Attribute Comp_type Result Class Entropy 0.9403 Outlook Entropy 0.6935 Outlook Gain 0.2468 Temperature Entropy 0.9111 Temperature Gain 0.0292 Humidity Entropy 0.7885 Humidity Gain 0.1518 Windy Entropy 0.8922 Windy Gain 0.0481 The first row in this table contains the expected information required to split this dataset. The subsequent rows contain attribute-wise Entropy and Gain in the corresponding columns. The partial integration of the decision tree attribute selection measure with the materialized view, containing the training dataset for the construction of classification model will in no way effect the accuracy of the model. There is no change in the intelligence approach, only the values required are stored, instead of calculating them at run time in memory. However, this integration has given a jump start for the construction of the classification model, enhancing the overall efficiency of the model. 44 5.6.2 Conclusion Classification algorithms are memory resident, calculating various statistical values at runtime. Storage of these statistical values, even for the selection of the best attribute at the root node, greatly increases the performance of the classification algorithms. Materialized view will hold the input training dataset while these statistical values will be stored in a dependent table. This table will be updated according to the policy chosen. Modern data warehouses offer a many methods to update the materialized view. However, each time a new target class is introduced or new data is loaded in this containing the statistical values will be updated accordingly. The accuracy of the algorithm is in no way affected, not in a positive or negative direction. The significant improvement introduced is in the efficiency, in selection of the root level attribute. 45 Appendix A Code Section Test_proc is the procedure which is used to find the entropy of original training set as well as find the entropy and gain of each input attributes in training set. Such calculated results by Test_proc are stored inside the pre calculated structure. This calculated gain values are used for the classification purpose at root node to generate the decision tree. ALTER procedure Test_proc @dataset varchar(20) AS Begin declare @loopvar int,@flag int declare @keyatt int,@datasetname varchar(40) declare @newkeyvalue int, @get_key int set @flag = 0 delete from current_datasetEntropy declare datasetnames_table_cursor cursor local static for select * from datasetnames_table open datasetnames_table_cursor if @@cursor_rows = 0 begin insert into datasetnames_table values(1,@dataset) exec Test_CalEntropy @dataset ,1,@flag exec Test_CalAttGain @dataset ,1 end else begin set @loopvar = 1 while @loopvar <= @@cursor_rows begin fetch next from datasetnames_table_cursor into @keyatt,@datasetname if @datasetname = @dataset begin set @flag = 1 break end set @loopvar = @loopvar + 1 end if @flag = 1 46 begin exec Test_CalEntropy @dataset,@keyatt,@flag end else begin set @newkeyvalue = @@cursor_rows + 1 insert into datasetnames_table values(@newkeyvalue,@dataset) exec Test_CalEntropy @dataset,@newkeyvalue,@flag exec Test_CalAttGain @dataset,@newkeyvalue end end close datasetnames_table_cursor deallocate datasetnames_table_cursor exec Test_searchkey @dataset,@get_key output select Attribute,comp_type,result from calculated_data where keyattribute = @get_key End Test_Decisiontree is another procedure which is used to select the maximum gain value attribute from pre calculated structure. On the basis of this attribute we classified the original training set into subsets before the C_ID3 procedure is called. ALTER procedure Test_DecisionTree ( @dataset varchar(50), @minsplitsize int, @rec_Entgainvalue decimal, @rec_attributeNum int, @rec_leavesize int ) As begin declare @Entvalue float declare @Selatt varchar(100) declare @gainattribute varchar(80) declare @gainvalue float declare @str varchar(100) declare @size int,@loopvar int declare @sepattval varchar(100) declare @predicate_var varchar(100) declare @levelno1 int declare @attvalue varchar(100) 47 declare @dsize int declare @countattribute int declare @rec_attribute varchar(50),@rec_value varchar(50),@rec_records int,@rec_per float declare @getkey int set @countattribute = 0 exec Test_searchkey @dataset,@getkey output exec Test_GetEntropy @getkey, @dataset,@Entvalue OUTPUT,@Selatt OUTPUT,@attvalue OUTPUT exec C_datasetsize @dataset,@dsize output ---------------------------------------------------------------------------set @countattribute = @countattribute + 1 delete from Treedata insert into Treedata values(0,@dataset,null,@dsize,100) IF @dsize >= @minsplitsize begin IF @Entvalue = 0 begin print @Selatt+':'+@attvalue insert into Treedata values(1,@Selatt,@attvalue,@dsize,100) end else begin exec Test_maxgain @getkey,@gainattribute output,@gainvalue output exec C_Getattribute_val @dataset,@gainattribute,@str output,@size output IF @gainvalue >= @rec_Entgainvalue and @countattribute <= @rec_attributeNum Begin set @loopvar = 1 while @loopvar < = @size begin set @levelno1 = 1 exec C_sepattval @str,@sepattval output,@str output print 'RULE ***** print 'Level '+cast(@levelno1 as varchar)+':' + upper(@gainattribute) + ':('+@sepattval+ ') ' + cast(@gainvalue as varchar) insert into Treedata values(@levelno1,@gainattribute,@sepattval,@dsize,100) set @predicate_var = @gainattribute + ' = ' + "'" + @sepattval + "'" set @levelno1 = @levelno1 + 1 exec C_ID3 @dataset ,@gainattribute,@gainvalue ,@predicate_var,0,@levelno1,@dsize, @minsplitsize,@rec_Entgainvalue,@countattribute, @rec_attributeNum,@rec_leavesize 48 set @loopvar = @loopvar + 1 end End Else Begin exec Maxrecordvalue @dataset,@rec_attribute output,@rec_value output,@rec_records output,@rec_per output insert into Treedata values(1,@rec_attribute,@rec_value,@dsize,@rec_per) End end End Else begin exec Maxrecordvalue @dataset,@rec_attribute output,@rec_value output,@rec_records output,@rec_per output insert into Treedata values(1,@rec_attribute,@rec_value,@dsize,@rec_per) end End C_ID3 is the procedure which contains the recursive process. This procedure is used to generate the further decision tree after the classification at root node. ALTER procedure C_ID3 ( @dataset varchar(50),@gainatt varchar(70),@gainval float, @predicates varchar(200),@spacesize int,@levelno1 int, @rec_datasetsize float ,@rec_minsplits int,@rec_Entgain decimal, @rec_attcounter int,@rec_attributeNos int,@rec_leavessize int ) As begin declare @Entvalue float, @Selatt varchar(200) declare @gainattribute varchar(90),@sepattval varchar(70),@str varchar(70) declare @gainvalue float, @getspaces varchar(100) declare @spacelen int,@size int,@loopvar int,@reserve_levelno int declare @pred_var varchar(200),@reserve_pred varchar(200) declare @attvalue varchar(80), @datas int,@per float , @rec_flag int declare @rec_attribute varchar(60),@rec_attvalue varchar(50),@rec_records int, @rec_per float declare @tablestr varchar(5000) set @reserve_pred = @predicates set @reserve_levelno = @levelno1 exec C_newsubset @dataset, @gainatt ,@predicates,@tablestr output 49 exec (@tablestr) exec CalEntropydup 'temp' exec CalAttGaindup 'temp' exec C_datasetsize 'temp',@datas output exec C_GetEntropydup 'temp',@Entvalue OUTPUT,@Selatt OUTPUT,@attvalue Output exec C_space @spacesize,@getspaces output,@spacelen output set @spacesize = @spacelen set @per = round(cast(@datas as float)/@rec_datasetsize * 100,2) set @rec_attcounter = @rec_attcounter + 1 IF @datas >= @rec_minsplits and @datas >= @rec_leavessize Begin if @Entvalue = 0 begin print @getspaces + 'level'+cast(@levelno1 as varchar)+':' + @Selatt+':('+ @attvalue+')' insert into Treedata values(@levelno1,@Selatt,@attvalue,@datas,100) end Else Begin exec C_maxgaindup @gainattribute output,@gainvalue output exec C_Getattribute_val 'temp',@gainattribute,@str output,@size output set @loopvar = 1 if @gainvalue >= @rec_Entgain and @rec_attcounter < = @rec_attributeNos Begin while @loopvar < =@size BEGIN exec C_sepattval @str,@sepattval output,@str output if @gainvalue != 0 begin print @getspaces + 'Level '+cast(@levelno1 as varchar)+':' + upper(@gainattribute) + ':('+@sepattval+ ') ' + cast(@gainvalue as varchar) insert into Treedata values(@levelno1,@gainattribute,@sepattval,@datas,@per) end set @pred_var = @gainattribute + ' = ' + "'" + @sepattval + "'" set @predicates = @predicates + ' and ' +@pred_var set @levelno1 = @levelno1 + 1 if @gainvalue !=0 begin exec C_ID3 @dataset ,@gainattribute,@gainvalue,@predicates,@spacesize,@levelno1,@datas,@rec_minsplits ,@rec_Entgain,@rec_attcounter,@rec_attributeNos,@rec_leavessize end set @levelno1 = @reserve_levelno 50 set @pred_var ='' set @predicates = @reserve_pred set @loopvar = @loopvar + 1 end end else begin exec Maxrecordvalue 'temp',@rec_attribute output,@rec_attvalue output,@rec_records output,@rec_per output insert into Treedata values(@levelno1,@rec_attribute, @rec_attvalue,@datas,@rec_per) end end end else begin exec Maxrecordvalue 'temp',@rec_attribute output,@rec_attvalue output,@rec_records output,@rec_per output insert into Treedata values(@levelno1,@rec_attribute,@rec_attvalue,@datas ,@rec_per) end End 51 Appendix B Application Interface This is the main interface for the accessing of SQL server 2000 database. It contains the training dataset and also contains the pre-calculated structures which contain the Entropy of dataset and also contain the Entropy and gain information of each attribute in given dataset. It also shows the decision tree of given dataset. Decision tree is used for the classification of test dataset because it generates different classification rules. 52 1. Dataset Output of Dataset: 53 2. Dataset Output of dataset: 54 3. Dataset Output of dataset: 55 5. DataSet Output of dataset: 56 6. Dataset Output of dataset 57 Appendix C References [1]. C. Imhoff, N. Galemmo, J.G. Geigar Mastering Data warehousing design relational and Dimensional Techniques [2]. Abhishek Sugandhi , Data Warehouse Design Considerations, [3]. Sql Server 7.0 Data warehousing Training Kit by Microsoft [4]. http://www.peterindia.net/DataWarehousingView.html [5]. Sql server 2000 resource kit [6] Behrooz Seyed- Abbassi Teaching Effective Methodologies to Design a Data Warehouse, University of North Florida Jacksonville, Florida 32224, United States [7] http://technet.microsoft.com/en-us/library [8] www.ioug.org/client_files/members/select_pdf/05q2/SelectQ205_Maresh.pdf [9] http://www.cs.uvm.edu/oracle9doc/server.901/a90237/mv.htm#38255 [10] http://www.microsoft.com/technet/prodtechnol/sql/2005/impprfiv.mspx [11] http://www.akadia.com/services/ora_materialized_views.html [12] http://download.oracle.com/docs/cd/B10501_01/server.920/a96567/repmview.htm [13] www.nocoug.org/download/2003-05/materialized_v.ppt 58 [14] http://ieeexplore.ieee.org/Xplore [15] http://www.oracle.com/technology/products/oracle9i/daily/jul05.htm [16] Ian H.Witten & Eibe Frank. Data mining practical machine learning tools and techniques [17] Jiawei Han and Micheline Kamber Data Mining: Concepts and Techniques Simon Fraser University [18] S. Sumathi, S.N. Sivanandam Introduction to Data Mining and its Applications [19] Data Mining Concepts and Techniques by Morgan Kaufmann [20] Induction of Decision Tree.html [21] W. Peng, J. Chen and Haiping An Implementation of ID3 Decision Tree Learning Algorithm, Zhou University of New South Wales, School of Computer Science & Engineering, Sydney, NSW 2032, Australia [22] Kalinka Mihaylova Kaloyanova Improving Data Integration for Data Warehouse: A Data Mining Approach University of Sofia, 59