Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
1 MSIT-116C: Data Warehousing and Data Mining 2 _____________________________________________________________ Course Design and Editorial Committee Prof. M.G.Krishnan Vice Chancellor Karnataka State Open University Mukthagangotri, Mysore – 570 006 Prof. Vikram Raj Urs Dean (Academic) & Convener Karnataka State Open University Mukthagangotri, Mysore – 570 006 Head of the Department and Course Co-Ordinator Rashmi B.S Assistant Professor & Chairperson DoS in Information Technology Karnataka State Open University Mukthagangotri, Mysore – 570 006 Course Editor Ms. Nandini H.M Assistant Professor of Information Technology DoS in Information Technology Karnataka State Open University Mukthagangotri, Mysore – 570 006 Course Writers Dr. B. H. Shekar Dr. Manjaiah Associate Professor Professor Department of Computer Science Department of Computer Science Mangalagangothri Mangalagangothri Mangalore Mangalore Publisher Registrar Karnataka State Open University Mukthagangotri, Mysore – 570 006 Developed by Academic Section, KSOU, Mysore Karnataka State Open University, 2014 All rights reserved. No part of this work may be reproduced in any form, by mimeograph or any other means, without permission in writing from the Karnataka State Open University. Further information on the Karnataka State Open University Programmes may be obtained from the University‘s Office at Mukthagangotri, Mysore – 6. Printed and Published on behalf of Karnataka State Open University, Mysore-6 by the Registrar (Administration) 3 Karnataka State Open University Mukthagangothri, Mysore – 570 006 Third Semester M.Sc in Information Technology MSIT-116C: Data Warehousing and Data Mining Module 1 Unit-1 Basics of Data Mining and Data Warehousing 001-020 Unit-2 Data Warehouse and OLAP Technology: An Overview 021-060 Unit-3 Data Cubes and Implementation 061-083 Unit-4 Basics of Data Mining 084-102 Module 2 Unit-5 Frequent Patterns for Data Mining 103-117 Unit-6 FP Growth Algorithms 118-128 Unit-7 Classification and Prediction 129-138 Unit-8 Approaches for Classification 139-165 4 Module 3 Unit-9 Classification Techniques 166-191 Unit-10 Genetic Algorithms, Rough Set and Fuzzy Sets 192-212 Unit-11 Prediction Theory of Classifiers 213-236 Unit-12 Algorithms for Data Clustering 237-259 Module 4 Unit-13 Cluster Analysis 260-276 Unit-14 Spatial Data Mining 277-290 Unit-15 Text Mining 291-308 Unit-16 Multimedia Data Mining 309-334 5 PREFACE The objective of data mining is to extract the relevant information from a large collection of information. The large of amount of data exists due to advances in sensors, information technology, and high-performance computing which is available in many scientific disciplines. These data sets are not only very large, being measured in terabytes and peta bytes, but are also quite complex. This complexity arises as the data are collected by different sensors, at different times, at different frequencies, and at different resolutions. Further, the data are usually in the form of images or meshes, and often have both a spatial and a temporal component. These data sets arise in diverse fields such as astronomy, medical imaging, remote sensing, nondestructive testing, physics, materials science, and bioinformatics. This increasing size and complexity of data in scientific disciplines has resulted in a challenging problem. Many of the traditional techniques from visualization and statistics that were used for the analysis of these data are no longer suitable. Visualization techniques, even for moderate-sized data, are impractical due to their subjective nature and human limitations in absorbing detail, while statistical techniques do not scale up to massive data sets. As a result, much of the data collected are never even looked at, and the full potential of our advanced data collecting capabilities is only partially realized. Data mining is the process concerned with uncovering patterns, associations, anomalies, and statistically significant structures in data. It is an iterative and interactive process involving data preprocessing, search for patterns, and visualization and validation of the results. It is a multidisciplinary field, borrowing and enhancing ideas from domains including image understanding, statistics, machine learning, mathematical optimization, high-performance computing, information retrieval, and computer vision. Data mining techniques hold the promise of assisting scientists and engineers in the analysis of massive, complex data sets, enabling them to make scientific discoveries, gain fundamental insights into the physical processes being studied, and advance their understanding of the world around us. We introduce basic concepts and models of Data Mining (DM) system from a computer science perspective. The focus of the course will be on the study of different approaches for data mining, models used in the design of DM system, search issues, text and multimedia data clustering techniques. Different types of clustering and classification techniques are also discussed which find applications in diversified fields. This course will empower the students to know how to design data mining systems and in depth analysis is provided to design multimedia based data mining systems. This concise text book provides an accessible introduction to data mining and organization that supports a foundation or module course on data mining and data warehousing covering a broad 6 selection of the sub-disciplines within this field. The textbook presents concrete algorithms and applications in the areas of business data processing, multimedia data processing, text mining etc. Organization of the material: The book introduces its topics in ascending order of complexity and is divided into four modules, containing four units each. In the first module, we begin with an introduction to data mining highlighting its applications and techniques. The basics of data mining and data warehousing concepts along with OLAP technology is discussed in detail. In the second module, we discussed the approaches to data mining. The frequent pattern mining approach is presented in detail. The role of classification and association rule based classification is also presented. We have also presented the prediction model of classification and different approaches for classification. The third module contains basics of soft computing paradigms such as fuzzy theory, rough sets and genetic algorithms which are the basis for designing data mining algorithms. Algorithms of data clustering are presented in this unit in detail which is central to any data mining techniques. In the fourth module, metrics for cluster analysis are discussed. In addition, the data mining concept for spatial data, textual data and multimedia data are presented in detail in this module. Every module covers a distinct problem and includes a quick summary at the end, which can be used as a reference material while reading data mining and data warehousing. Much of the material found here is interesting as a view into how the data mining works, even if you do not need it for a specific works. Happy reading to all the students. 7 UNIT-1: BASICS OF DATA MINING AND DATA WAREHOUSING Structure 1.1 Objectives 1.2 Introduction 1.3 Data warehouse 1.4 Operational data store 1.5 Extraction transformation language 1.6 Data warehouse Meta data 1.7 Summary 1.8 Keywords 1.9 Exercises 1.10 References 1.1 Objectives The objectives covered under this unit include: The introduction data mining and data warehousing Techniques for data mining Basics of operational data stores (ODS) Basics of Extraction transformation loading (ETL) Building the data warehouses Role of metadata. 1.2 Introduction 8 What is data mining? The amount of data on collected by organizations grows by leaps and bounds. The amount of data is increasing year after year and there may be pay offs in uncovering hidden information behind these data. Data mining is a way to gain market intelligence from this huge amount of data. The problem today is not the lack of data, but how to learn from it. Data mining mainly deals with structured data organized in a database. It uncovers anomalies, exceptions, patterns, irregularities or trends that may otherwise remain undetected under the immense volumes of data. What is data warehousing? A data warehouse is a database designed to support decision making in an organization. Data from the production databases are copied to the data warehouse so that queries can be performed without disturbing the performance or the stability of the production systems. For data mining to occur, it is crucial that data warehousing is present. An example of how well data warehousing and data mining has been utilized is Walmart. Walmart maintains a 7.5 TB data warehouse. Retailers capture Point of Sale (POS) transaction data from over 2,900 stores across 6 countries and transmit them to Walmart‘s data warehouse. Walmart then allows their suppliers to access the data to collect information on their products to analyse how they can improve their sales. These suppliers will then better understand customer buying patterns and manage local store inventory, etc. Data mining techniques: What is it and how is it used? Data mining is not a method of attacking the data; on the contrary, it is a way of teaming from the data and then using that information. For that reason, we need a new mind-set in data mining. We must be open to finding relationships and patterns that we never imagined existed. We let data tell us the story rather than impose a model on the data that we feel will replicate the actual patterns. There are four categories of data mining techniques/tools (Keating, 2008): 1. Prediction 2. Classification 3. Clustering Analysis 4. Association Rules Discovery Prediction Tools: They are the methods derived from traditional statistical forecasting for predicting a variable‘s value. The most common and important applications in data mining involves prediction. This technique involves traditional statistics such as regression analysis, 9 multiple discriminant analysis, etc. Non-traditional methods used in prediction tools are Artificial Intelligence and Machine Learning. Classification Tools: Most commonly used in data mining. Classification tools attempt to distinguish different classes of objects or actions. For example, in a case of a credit card transaction, these tools could classify it as one or the other. This will save the credit card company a considerable amount of money. Clustering Analysis Tools: These are very powerful tools for clustering products into groups that naturally fall together. These groups are identified by the program and not by the researchers. Most of the clusters discovered may not have little use in business decision. However, one or two that are discovered may be extremely important and can be taken advantage of to give the business an edge over its competitors. The most common use for clustering tools is probably in what economists refer to as ―market segmentation.‖ Association Rules Discovery: Here the data mining tools discover associations; e.g., what kinds of books certain groups of people read, what products certain groups of people purchase, what movies certain groups of people watch, etc. Businesses can use this information to target their markets. Online retailers like Netflix and Amazon use these tools quite intensively. For example, Netflix recommends movies based on movies people have watched and rated in the past. Amazon does something similar in recommending books when you re-visit their website. The two major pieces of software used at the moment for data mining are PASW Modeller (formerly known as SPSS Clementine) and SAS Enterprise Miner. Both software packages include an array of capabilities that enables data mining tools/ mentioned above. Newbies in data mining can use an Excel add-in called XLMiner available from Resampling Stats, Inc. This Excel add-in lets potential data miners not only examine the usefulness of such a program but also get familiar with some of the data mining techniques. Although Excel is quite limited in the number of observations it can handle, it can give the use a taste of how valuable data mining can be – without expensing too much cost first. Examples of use of information extracted from data mining exercises Data mining has been used to help in credit scoring of customers in the financial industry (Peng, 2004). Credit scoring can be defined as a technique that helps credit providers decide whether to grant credit to customers. It‘s most common use is in making credit decisions for loan applications. Credit scoring is also applied in decisions on personal loan applications – the setting of credit limits, manage existing accounts and forecast the profitability of consumers and customers (Punch, 2000). 10 Data mining and data warehousing has been particularly successful in the realm of customer relationship management. By utilizing a data warehouse, retailers can embark on customerspecific strategies like customer profiling, customer segmentation, and cross-selling. By using the information in the data warehouse, the business can divide its customers into four quadrants of customer segmentation: (1) customers that should be eliminated (i.e., they cost more than what they generate in revenues); (2) customers with whom the relationship should be re-engineered (i.e., those that have the potential to be valuable, but may require the company‘s encouragement, cooperation, and/ or management); (3) customers that the company should engage; and (4) customers in which the company should in est (Buttle, 1999; Verhoef & Donkers, 2001). The company then could use the corresponding strategies, to manage the customer relationships (Cunningham et al, 2006) Data mining can also help in the detection of spam in electronic mail (email) (Shih et al, 2008). Data mining has also been used healthcare and acute care. A medical center in the US used data mining technology to help its physicians work more efficiently and reduce mistakes (Veluswamy, 2008). There are other examples which we will not deal with here that have been flagship success stories of data mining – the beer and diaper association; Harrah; Amazon and Netflix. Essentials before you data mine Apart from management buy in and financial backing, there are certain basics before you embark on a data mining project. As data mining can only uncover patterns already present in the data, the target dataset – you must already have the data and the data resides in a data warehouse or a data mart — which must be large enough to contain these patterns while remaining concise enough to be mined in an acceptable timeframe. The target set then needs to be ―cleaned‖. This process removes the observations with noise and missing data. The cleaned data is then reduced into feature vectors, one vector per observation. A feature vector is a summarised version of the raw data observation. Limitations of data mining The quality of data mining applications depends on the quality and availability of data. As the data set that needs to be mined should be of a certain quality, time and expense may be needed to ―clean‖ the data that need to be mined. Not to mention that the amount of data to be mined should be sufficiently large for the software to extract meaningful patterns and association. 11 Also, as data mining requires huge amounts of resources – man hours, and financially — the user must be a domain specialist and must understand business problems and be familiar with data mining tools and techniques, so that resources are not wasted on a data mining project that will fail at the start. Also, once data have been mined, it is up to the management and decision makers to use the information that has been extracted. Data mining is not the end all and the magic wand that points the organization to what it should do. Human intellect and business acumen of the decision makers is still very much required to make any sense out of the information that is extracted from a data mining exercise. Some issues surrounding data mining and data warehousing 1. You’ve data mined – do you think that the bosses will take the proper and appropriate action – the dichotomy between use of sophisticated data mining software and techniques and the conventionality of how organizations make decisions Brydon and Gemino (2008) highlighted the dichotomy between the use of sophisticated data mining software and techniques as opposed to the conventionality of how organisations make decisions. They believed, rightly so, that ―tools and techniques for data mining and decision making integration are still in their infancy. Firms must be willing to reconsider the ways in which they make decisions if they are to realize a payoff from their investments in data mining technology.‖ 2. One size fits all data mining packages for industry. Does this fit the purpose of data mining at all? There are now available ―one size fits all‖ vertical applications for certain industries/ industry segments developed by consultants. The consultants market these packages to all competitors within that segment. This poses a potential risk for companies who are new to data mining as when they explore the technique and these vertical ―off the shelf‖ solutions that their competitors can also easily obtain. Nevertheless, having said that the application of this technology is limited only by our imagination, so that it is up to the companies to show and why they wish to use the technology. They should also be aware of the fact that data mining is a long and resource intensive exercise which an ―off the shelf‖ solution deceptively presents as easy and affordable. Only companies that learn to be comfortable in utilising these tools on all varieties of company data will benefit. 3. The use of data mining for prediction – use in non-commercial and ―problematic‖ areas. E.g. prediction of terrorist acts 12 In 2002, the US government embarked on a massive data mining effort. Called the Total Information Awareness The basic idea to collect as much data on everyone and sift this through massive computers and investigate patterns that might indicate terrorist plots (Schneier, 2006). However, a backlash of public opinion drove the US Congress to stop funding the programme. Nevertheless, there is belief that the programme just changed its name and moved inside the walls of the US Defence Department (Harris, 2006) According to Schneier (2006), why data mining for use in such a situation will fail because Terrorist plots are different from credit card fraud. Terrorist acts have no well-defined profile and attacks are very rare. ―Taken together, these facts mean that data-mining systems won‘t uncover any terrorist plots until they are very accurate, and that even very accurate systems would be so flooded with false alarms that they will be useless.‖ This highlights the principle pointed earlier on in this paper – data mining is not a panacea of all information problems and is not a magic wand to guide anyone out of the wilderness. 4. Ethical concerns over data warehousing and data mining – do you have any? Should companies be concerned? Data mining produces results only if it works with higher volumes of information at its disposal. With the higher amounts of data that needs to be gathered, should we also be concerned with the ethics behind the collection and use of that data. As highlighted by Linstedt (2004), the implementers of the technology are simply told to integrate data and the project manager builds a project to make it happen – these people simply do not have the time to ponder whether the data had been handled ethically. Linstedt proposes a checklist for project managers and technology implementers to address ethical concerns over data: Develop SLA‘s with end users that define who has access to what levels of information Have end-users involved in defining the ethical standards of use for the data that will be delivered. Define the bounds around the integration efforts of public data, where it will be integrated and where it will not – so as to avoid conflicts of interest. Do not use ―live‖ or real data for testing purposes – or lock down the test environment; too often test environments are left wide-open and accessible to too many individuals. Define where, how, and who will be using Data Mining – restrict the mining efforts to specific sets of information. Build a notification system to monitor data mining usage. 13 Allow customers to ―block‖ the integration of their own information (this one is questionable) depending on if the customer information after integration will be made available on the web. Remember that any efforts made are still subject to governmental laws. Nothing is sacred. If a government wants access to the information, they will get it. 1.3 Data warehouse In computing, a data warehouse (DW, DWH), or an enterprise data warehouse (EDW), is a database used for reporting (1) and data analysis (2). Integrating data from one or more disparate sources creates a central repository of data, a data warehouse (DW). Data warehouses store current and historical data and are used for creating trending reports for senior management reporting such as annual and quarterly comparisons. The data stored in the warehouse is uploaded from the operational systems (such as marketing, sales, etc., shown in the figure to the right). The data may pass through an operational data store for additional operations before it is used in the DW for reporting. The typical extract transform load (ETL)-based data warehouse uses staging, data integration, and access layers to house its key functions. The staging layer or staging database stores raw data extracted from each of the disparate source data systems. The integration layer integrates the disparate data sets by transforming the data from the staging layer often storing this transformed data in an operational data store (ODS) database. The integrated data are then moved to yet another database, often called the data warehouse database, where the data is arranged into hierarchical groups often called dimensions and into facts and aggregate facts. The combination of facts and dimensions is sometimes called a star schema. The access layer helps users retrieve data.[1] A data warehouse constructed from integrated data source systems does not require ETL, staging databases, or operational data store databases. The integrated data source systems may be considered to be a part of a distributed operational data store layer. Data federation methods or data virtualization methods may be used to access the distributed integrated source data systems to consolidate and aggregate data directly into the data warehouse database tables. 14 Unlike the ETL-based data warehouse, the integrated source data systems and the data warehouse are all integrated since there is no transformation of dimensional or reference data. This integrated data warehouse architecture supports the drill down from the aggregate data of the data warehouse to the transactional data of the integrated source data systems. A data mart is a small data warehouse focused on a specific area of interest. Data warehouses can be subdivided into data marts for improved performance and ease of use within that area. Alternatively, an organization can create one or more data marts as first steps towards a larger and more complex enterprise data warehouse. This definition of the data warehouse focuses on data storage. The main source of the data is cleaned, transformed, catalogued and made available for use by managers and other business professionals for data mining, online analytical processing, market research and decision support (Marakas & O'Brien 2009). However, the means to retrieve and analyze data, to extract, transform and load data, and to manage the data dictionary are also considered essential components of a data warehousing system. Many references to data warehousing use this broader context. Thus, an expanded definition for data warehousing includes business intelligence tools, tools to extract, transform and load data into the repository, and tools to manage and retrieve metadata. Difficulties of Implementing Data Warehouses Some significant operational issues arise with data warehousing: construction, administration, and quality control. Project management—the design, construction, and implementation of 15 the warehouse—is an important and challenging consideration that should not be underestimated. The building of an enterprise-wide warehouse in a large organization is a major undertaking, potentially taking years from conceptualization to implementation. Because of the difficulty and amount of lead time required for such an undertaking, the widespread development and deployment of data marts may provide an attractive alternative, especially to those organizations with urgent needs for OLAP, DSS, and/or data mining support. The administration of a data warehouse is an intensive enterprise, proportional to the size and complexity of the warehouse. An organization that attempts to administer a data warehouse must realistically understand the complex nature of its administration. Although designed for read access, a data warehouse is no more a static structure than any of its information sources. Source databases can be expected to evolve. The warehouse‘s schema and acquisition component must be expected to be updated to handle these evolutions. A significant issue in data warehousing is the quality control of data. Both quality and consistency of data are major concerns. Although the data passes through a cleaning function during acquisition, quality and consistency remain significant issues for the database administrator. Melding data from heterogeneous and disparate sources is a major challenge given differences in naming, domain definitions, identification numbers, and the like. Every time a source database changes, the data warehouse administrator must consider the possible interactions with other elements of the warehouse. Usage projections should be estimated conservatively prior to construction of the data warehouse and should be revised continually to reflect current requirements. As utilization patterns become clear and change over time, storage and access paths can be tuned to remain optimized for support of the organization‘s use of its warehouse. This activity should continue throughout the life of the warehouse in order to remain ahead of demand. The warehouse should also be designed to accommodate the addition and attrition of data sources without major redesign. Sources and source data will evolve, and the warehouse must accommodate such change. Fitting the available source data into the data model of the warehouse will be a continual challenge, a task that is as much art as science. Because there is continual rapid change in technologies, both the requirements and capabilities of the warehouse will change considerably over time. Additionally, data warehousing technology itself will continue to evolve for some time so that component structures and functionalities will continually be upgraded. This certain change is excellent motivation for having fully modular design 16 of components. Administration of a data warehouse will require far broader skills than are needed for traditional database administration. A team of highly skilled technical experts with overlapping areas of expertise will likely be needed, rather than a single individual. Like database administration, data warehouse administration is only partly technical; a large part of the responsibility requires working effectively with all the members of the organization with an interest in the data warehouse. However difficult that can be at times for database administrators, it is that much more challenging for data warehouse administrators, as the scope of their responsibilities is considerably broader. Design of the management function and selection of the management team for a database warehouse are crucial. Managing the data warehouse in a large organization will surely be a major task. Many commercial tools are available to support management functions. Effective data warehouse management will certainly be a team function, requiring a wide set of technical skills, careful coordination, and effective leadership. Just as we must prepare for the evolution of the warehouse, we must also recognize that the skills of the management team will, of necessity, evolve with it. Data Warehouse Guidelines: Building Data Warehouses Embarking on a data warehouse project is a daunting task. Many data warehouse projects are underfunded, unfocused, end-users are not trained to access data effectively, or there are organizational issues that cause them to fail. In fact, a large number of data warehousing projects which fail during the first year. According to Mitch Kramer, consulting editor at Patricia Seybold Group, strategic technologies, best practices, and business solutions consulting group based in Boston, there are many ways to make a data warehouse successful. Here are a few of the areas to be aware of when creating and implementing a data warehouse: 1. Keep things focused. "Try not to create a global solution." Kramer suggests that a good practice is to "focus on what you need. A small data warehouse or data mart which addresses a single subject or that is focused on a single department is much more efficient than a large data warehouse. You will see measurable results much faster from a data mart than a data warehouse. A focused data mart will get funding and gain organizational consensus a lot easier, too." 2. Don't worry about integration, keep things small. "Integration can be an issue, but it has always been a problem when organizations try to take a small filing system and integrate it into an organizational system. There are always 17 coding problems of some sort." Kramer then added, "Global systems always tend to fold, so keep it small." 3. Spend the extra money if you need help designing your system. Kramer commented, "Systems designing is the best place to spend the money on hiring consultants. They know the problems, and know how to deal with them. It is possible to design your own data warehouse system, but it is a lot less frustrating to hire out the design process." 4. Keep things simple. "Buy one single product from one vendor. This minimizes, or possibly eliminates any tool integration issues," Kramer advised. 5. Be in tune with the users. "Know your users," Kramer warned. "If you are not careful, you will wind up giving the right users the wrong tools, and that only leads one place - frustration. Find out who your end-users are, and work backward to the operational data. This will tell you what tools your data warehouse needs." 6. Consider your platforms. Kramer said "there really are no right platforms out there. You can start with a UNIX system or NT. Keep in mind that the NT has a ceiling in terms of scalability, but it works well with data marts, and most other small warehouses, just not global data warehouses." 7. Think before you data mine. "Data mining is a solution in search of a problem," Kramer said. "Know what you want to find before you select the tool. Data mining software simply relieves some of the burden from the analyst." 1.4 Operational Data Stores (ODL) An operational data store (ODS) is a type of database that's often used as an interim logical area for a data warehouse. While in the ODS, data can be scrubbed, resolved for redundancy and checked for compliance with the corresponding business rules. An ODS can be used for integrating disparate data from multiple sources so that business operations, analysis and reporting can be carried out while business operations are occurring. This is the place where most of the data used in current operation is housed before it's transferred to the data warehouse for longer term storage or archiving. 18 An ODS is designed for relatively simple queries on small amounts of data (such as finding the status of a customer order), rather than the complex queries on large amounts of data typical of the data warehouse. An ODS is similar to your short term memory in that it stores only very recent information; in comparison, the data warehouse is more like long term memory in that it stores relatively permanent information. Operational data store (ODS) fact build During the ETL process, the builds extract data from the operational system and map the data to the operational data store area in the data warehouse. Extracting data The source data is extracted through the XML ODBC driver from data services or XML data files. In most cases, data is loaded directly from the data sources into the operational data store area of the data warehouse. In some cases, however, data is extracted through staging: small ETL builds extract the data, and store it into temporary tables. Other ETL builds retrieve the data, transform it, and map it to the operational data store area of the data warehouse. For products that support delta loads, extraction from data services is through delta loads. The structure of source data is specific to the data source. The attributes are extracted according to the measurement objectives. Therefore, not all attributes of the data sources are loaded to the data warehouse. Transforming data The transformation models do not contain complex business rules, or aggregations and calculations. The transformation of attributes happens in the following manner: Attributes that describe the entity itself are loaded directly to the data warehouse with the Attribute element. Attributes that describe a relationship between an entity and another are transformed, using lookup dimensions and derivations, into the surrogate key of the associated entity. For example, in the case of the dbid attribute of a defect in a ClearQuest® project, the lookup dimension takes the natural key (dbid of the project) and searches the PROJECT table in operational data store area in the data warehouse to find a matching record. The derivation checks the result of the lookup dimension. If a match is found, the derivation returns the surrogate key of the project record. If a match is not found, which indicated that no project is associated with this defect, the derivation returns a value of -1. The result of the derivation is delivered to the data warehouse. 19 Delivering data Similar data from different data sources is mapped to the same table in the data warehouse. The data is stored according to the subject or business domain. For example, a defect from Rational® ClearQuest and a defect from Rational Team Concert™ are mapped to the same REQUEST table in the operational data store. The most common mappings are: Record identity: This control attribute provided by Data Manager is for a unique number for each row and must be mapped to the surrogate key column in the data warehouse table. Last update date This control attribute provided by Data Manager is for the date on which an existing row was updated and must be mapped to the REC_TIMESTAMP column in the data warehouse table. SOURCE_ID This column in the data warehouse must be used to store the GUID of the data source, which can be used for differentiating data of different sources. For data sources where the data is extracted through the XML ODBC driver, a GUID is automatically assigned to each resource group and the value is put in each table in the column DATASOURCE_ID, which must be mapped to the SOURCE_ID column in the data warehouse table. For other data sources where the XML ODBC driver is not used, the value needs to be supplied manually. EXTERNAL_KEY1/EXTERNAL_KEY2 An attribute to store the integer or character type of the natural key from the data source. REFERENCE_ID An attribute to store a user-visible identifier, if the data source has one. URL An attribute to store the URL of an XML resource of a data source Classification ID An attribute for some commonly used artifacts such as projects, requests, requirements, tasks, activities, and components. This attribute is used for further classifying the data in these tables. For each artifact, a table with _CLASSIFICATION in the name is defined in the data warehouse and the IDs and values are predefined when the data warehouse is created. The ETL builds that deliver these artifacts into 20 the data warehouse must specify the value of the classification ID and map it to the corresponding column with _CLASS_ID in the name. 1.5 Extraction Transformation Loading (ETL) You must load your data warehouse regularly so that it can serve its purpose of facilitating business analysis. To do this, data from one or more operational systems must be extracted and copied into the data warehouse. The challenge in data warehouse environments is to integrate, rearrange and consolidate large volumes of data over many systems, thereby providing a new unified information base for business intelligence. The process of extracting data from source systems and bringing it into the data warehouse is commonly called ETL, which stands for extraction, transformation, and loading. Note that ETL refers to a broad process, and not three well-defined steps. The acronym ETL is perhaps too simplistic, because it omits the transportation phase and implies that each of the other phases of the process is distinct. Nevertheless, the entire process is known as ETL. The methodology and tasks of ETL have been well known for many years, and are not necessarily unique to data warehouse environments: a wide variety of proprietary applications and database systems are the IT backbone of any enterprise. Data has to be shared between applications or systems, trying to integrate them, giving at least two applications the same picture of the world. This data sharing was mostly addressed by mechanisms similar to what is now called ETL. ETL Basics in Data Warehousing What happens during the ETL process? The following tasks are the main actions in the process. Extraction of Data During extraction, the desired data is identified and extracted from many different sources, including database systems and applications. Very often, it is not possible to identify the specific subset of interest, therefore more data than necessary has to be extracted, so the identification of the relevant data will be done at a later point in time. Depending on the source system's capabilities (for example, operating system resources), some transformations may take place during this extraction process. The size of the extracted data varies from hundreds of kilobytes up to gigabytes, depending on the source system and the business situation. The same is true for the time delta between two (logically) identical extractions: the 21 time span may vary between days/hours and minutes to near real-time. Web server log files, for example, can easily grow to hundreds of megabytes in a very short period. Transportation of Data After data is extracted, it has to be physically transported to the target system or to an intermediate system for further processing. Depending on the chosen way of transportation, some transformations can be done during this process, too. For example, a SQL statement which directly accesses a remote target through a gateway can concatenate two columns as part of the SELECT statement. The emphasis in many of the examples in this section is scalability. Many long-time users of Oracle Database are experts in programming complex data transformation logic using PL/SQL. These chapters suggest alternatives for many such data manipulation operations, with a particular emphasis on implementations that take advantage of Oracle's new SQL functionality, especially for ETL and the parallel query infrastructure. ETL Tools for Data Warehouses Designing and maintaining the ETL process is often considered one of the most difficult and resource-intensive portions of a data warehouse project. Many data warehousing projects use ETL tools to manage this process. Oracle Warehouse Builder, for example, provides ETL capabilities and takes advantage of inherent database abilities. Other data warehouse builders create their own ETL tools and processes, either inside or outside the database. Besides the support of extraction, transformation, and loading, there are some other tasks that are important for a successful ETL implementation as part of the daily operations of the data warehouse and its support for further enhancements. Besides the support for designing a data warehouse and the data flow, these tasks are typically addressed by ETL tools such as Oracle Warehouse Builder. Oracle is not an ETL tool and does not provide a complete solution for ETL. However, Oracle does provide a rich set of capabilities that can be used by both ETL tools and customized ETL solutions. Oracle offers techniques for transporting data between Oracle databases, for transforming large volumes of data, and for quickly loading new data into a data warehouse. Daily Operations in Data Warehouses The successive loads and transformations must be scheduled and processed in a specific order. Depending on the success or failure of the operation or parts of it, the result must be tracked and subsequent, alternative processes might be started. The control of the progress as 22 well as the definition of a business workflow of the operations are typically addressed by ETL tools such as Oracle Warehouse Builder. Evolution of the Data Warehouse As the data warehouse is a living IT system, sources and targets might change. Those changes must be maintained and tracked through the lifespan of the system without overwriting or deleting the old ETL process flow information. To build and keep a level of trust about the information in the warehouse, the process flow of each individual record in the warehouse can be reconstructed at any point in time in the future in an ideal case. 1.6 Data Warehouse Metadata Metadata is simply defined as data about data. The data that are used to represent other data is known as metadata. For example the index of a book serves as metadata for the contents in the book. In other words we can say that metadata is the summarized data that leads us to the detailed data. In terms of data warehouse we can define metadata as following. Metadata is a road map to data warehouse. Metadata in data warehouse define the warehouse objects. The metadata act as a directory. This directory helps the decision support system to locate the contents of data warehouse. Categories of Metadata The metadata can be broadly categorized into three categories: Business Metadata - This metadata has the data ownership information, business definition and changing policies. Technical Metadata - Technical metadata includes database system names, table and column names and sizes, data types and allowed values. Technical metadata also includes structural information such as primary and foreign key attributes and indices. Operational Metadata - This metadata includes currency of data and data lineage.Currency of data means whether data is active, archived or purged. Lineage of data means history of data migrated and transformation applied on it. 23 Role of Metadata Metadata has very important role in data warehouse. The role of metadata in warehouse is different from the warehouse data yet it has very important role. The various roles of metadata are explained below. The metadata act as a directory. This directory helps the decision support system to locate the contents of data warehouse. Metadata helps in decision support system for mapping of data when data are transformed from operational environment to data warehouse environment. Metadata helps in summarization between current detailed data and highly summarized data. Metadata also helps in summarization between lightly detailed data and highly summarized data. Metadata are also used for query tools. Metadata are used in reporting tools. Metadata are used in extraction and cleansing tools. Metadata are used in transformation tools. Metadata also plays important role in loading functions. Diagram to understand role of Metadata. Metadata Respiratory The Metadata Respiratory is an integral part of data warehouse system. The Metadata Respiratory has the following metadata: 24 Definition of data warehouse - This includes the description of structure of data warehouse. The description is defined by schema, view, hierarchies, derived data definitions, and data mart locations and contents. Business Metadata - This metadata has the data ownership information, business definition and changing policies. Operational Metadata - This metadata includes currency of data and data lineage. Currency of data means whether data is active, archived or purged. Lineage of data means history of data migrated and transformation applied on it. Data for mapping from operational environment to data warehouse - This metadata includes source databases and their contents, data extraction, data partition cleaning, transformation rules, data refresh and purging rules. The algorithms for summarization - This includes dimension algorithms, data on granularity, aggregation, summarizing etc. Challenges for Metadata Management The importance of metadata cannot be overstated. Metadata helps in driving the accuracy of reports, validates data transformation and ensures the accuracy of calculations. The metadata also enforces the consistent definition of business terms to business end users. With all these uses of Metadata it also has challenges for metadata management. The some of the challenges are discussed below. The Metadata in a big organization is scattered across the organization. This metadata is spreaded in spreadsheets, databases, and applications. The metadata could present in text file or multimedia file. To use this data for information management solution, this data need to be correctly defined. There are no industry wide accepted standards. The data management solution vendors have narrow focus. There are no easy and accepted methods of passing metadata. 1.7 Summary We have presented in this unit about basics of data mining and data warehousing. The following concepts have been presented in brief. The amount of data collected by organizations grows by leaps and bounds. The amount of data is increasing year after year and there may be pay offs in uncovering hidden information behind these data. Data mining is a way to gain market 25 intelligence from this huge amount of data. There are four categories of data mining techniques/tools: Prediction, Classification, Clustering Analysis, and Association Rules Discovery. A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data organized in support of management decision making. Several factors distinguish data warehouses from operational databases. Because the two systems provide quite different functionalities and require different kinds of data, it is necessary to maintain data warehouses separately from operational databases. Data warehouse metadata are data defining the warehouse objects. An operational data store (ODS) is a type of database that's often used as an interim logical area for a data warehouse. The process of extracting data from source systems and bringing it into the data warehouse is commonly called ETL, which stands for extraction, transformation, and loading. Metadata is simply defined as data about data. Metadata has very important role in data warehouse. A metadata repository provides details regarding the warehouse structure, data history, the algorithms used for summarization, mappings from the source data to warehouse form, system performance, and business terms and issues. 1.8 Keywords Data mining, Prediction, Classification, Clustering Analysis, Operational data store (ODS), Extraction Transformation Loading, Data Warehouses, Metadata 1.9 Exercises a) What is data mining? b) What is data warehousing? c) What are data mining techniques? How is it used? d) Explain issues in data mining and data warehousing? e) Define Data warehouse? f) What are the Difficulties in Implementing Data Warehouses? g) Explain process of building Data Warehouses? h) Briefly explain Operational Data Stores (ODL)? 26 i) Briefly explain Extraction Transformation Loading (ETL)? j) What is Data Warehouse Metadata? What are its Categories? k) Explain role of Metadata? l) Write a note on challenges for Metadata Management? 1.10 References 1. Data Mining: Concepts and Techniques by Jiawei Han and Micheline Kamber, Morgan Kaufmann Publisher, Second Edition, 2006. 2. Research and Trends in Data Mining Technologies and Applications, edited by David Taniar, Idea Group Publications. 3. Data Mining Techniques by Arun K Pujari, University Press, Second Edition, 2009. 27 Unit-2: Data Warehouse and OLAP Technology: An Overview Structure 2.1 Objectives 2.2 Introduction 2.3 Data Warehouse and OLAP Technology 2.4 A Multidimensional Data Model 2.5 Data Warehouse Architecture 2.6 Data Warehouse Implementation 2.7 Data Warehousing to Data Mining 2.8 Summary 2.9 Keywords 2.10 Exercises 2.11 References 2.1 Objectives The objectives covered under this unit include: The introduction to Data Warehouse OLAP Technology A Multidimensional Data Model Data Warehouse Architecture Data Warehouse Implementation Data Warehousing to Data Mining. 28 2.2 Introduction What is a Data Warehouse? Data warehouses generalize and consolidate data in multidimensional space. The construction of data warehouses involves data cleaning, data integration and data transformation and can be viewed as an important preprocessing step for data mining. Moreover, data warehouses provide on-line analytical processing (OLAP) tools for the interactive analysis of multidimensional data of varied granularities, which facilitates effective data generalization and data mining. Many other data mining functions, such as association, classification, prediction, and clustering, can be integrated with OLAP operations to enhance interactive mining of knowledge at multiple levels of abstraction. Hence, the data warehouse has become an increasingly important platform for data analysis and on-line analytical processing and will provide an effective platform for data mining. Therefore, data warehousing and OLAP form an essential step in the knowledge discovery process. Data warehousing provides architectures and tools for business executives to systematically organize, understand, and use their data to make strategic decisions. Data warehouses have been defined in many ways, making it difficult to formulate a rigorous definition. Loosely speaking, a data warehouse refers to a database that is maintained separately from an organization‘s operational databases. Data warehouse systems allow for the integration of a variety of application systems. They support information processing by providing a solid platform of consolidated historical data for analysis. According to William H. Inmon, a leading architect in the construction of data warehouse systems, ―A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management‘s decision making process‖ This short, but comprehensive definition presents the major features of a data warehouse. The four keywords, subject-oriented, integrated, time-variant, and nonvolatile, distinguish data warehouses from other data repository systems, such as relational database systems, transaction processing systems, and file systems. Let‘s take a closer look at each of these key features. Subject-oriented: A data warehouse is organized around major subjects, such as customer, supplier, product, and sales. Rather than concentrating on the day-to-day operations and transaction processing of an organization, a data warehouse focuses on the modeling and analysis of data for decision makers. Hence, data warehouses 29 typically provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process. Integrated: A data warehouse is usually constructed by integrating multiple heterogeneous sources, such as relational databases, flat files, and on-line transaction records. Data cleaning and data integration techniques are applied to ensure consistency in naming conventions, encoding structures, attribute measures, and so on. Time-variant: Data are stored to provide information from a historical perspective (e.g., the past 5–10 years). Every key structure in the data warehouse contains, either implicitly or explicitly, an element of time. Nonvolatile: A data warehouse is always a physically separate store of data transformed from the application data found in the operational environment. Due to this separation, a data warehouse does not require transaction processing, recovery, and concurrency control mechanisms. It usually requires only two operations in data accessing: initial loading of data and access of data. Based on this information, we view data warehousing as the process of constructing and using data warehouses 2.3 Data Warehouse and OLAP Technology The construction of a data warehouse requires data cleaning, data integration, and data consolidation. The utilization of a data warehouse often necessitates a collection of decision support technologies. This allows ―knowledge workers‖ (e.g., managers, analysts, and executives) to use the warehouse to quickly and conveniently obtain an overview of the data, and to make sound decisions based on information in the warehouse Some authors use the term ―data warehousing‖ to refer only to the process of data warehouse construction, while the term ―warehouse DBMS‖ is used to refer to the management and utilization of data warehouses .Data warehousing is also very useful from the point of view of heterogeneous database integration. Many organizations typically collect diverse kinds of data and maintain large databases from multiple, heterogeneous, autonomous, and distributed information sources. The traditional database approach to heterogeneous database integration is to build wrappers and integrators (or mediators), on top of multiple, heterogeneous databases. When a query is posed to a client site, a metadata dictionary is used to translate the query into queries appropriate for the individual heterogeneous sites involved. These queries are then mapped 30 and sent to local query processors. The results returned from the different sites are integrated into a global answer set. This query-driven approach requires complex information filtering and integration processes, and competes for resources with processing at local sources. It is inefficient and potentially expensive for frequent queries, especially for queries requiring aggregations. Data warehousing employs an update-driven approach in which information from multiple, heterogeneous sources is integrated in advance and stored in a warehouse for direct querying and analysis. Unlike on-line transaction processing databases, data warehouses do not contain the most current information. However, a data warehouse brings high performance to the integrated heterogeneous database system because data are copied, preprocessed, integrated, annotated, summarized, and restructured into one semantic data store. Furthermore, query processing in data warehouses does not interfere historical information and support complex multidimensional queries. As a result, data warehousing has become popular in industry with the processing at local sources. Differences between Operational Database Systems and Data Warehouses Because most people are familiar with commercial relational database systems, it is easy to understand what a data warehouse is by comparing these two kinds of systems. The major task of on-line operational database systems is to perform on-line transaction and query processing. These systems are called on-line transaction processing (OLTP) systems. They cover most of the day-to-day operations of an organization, such as purchasing, inventory, manufacturing, banking, payroll, registration, and accounting. Data warehouse systems, on the other hand, serve users or knowledge workers in the role of data analysis and decision making. Such systems can organize and present data in various formats in order to accommodate the diverse needs of the different users. These systems are known as on-line analytical processing (OLAP) systems. The major distinguishing features between OLTP and OLAP are summarized as follows: Users and system orientation: An OLTP system is customer-oriented and is used for transaction and query processing by clerks, clients, and information technology professionals. An OLAP system is market-oriented and is used for data analysis by knowledge workers, including managers, executives, and analysts. Data contents: An OLTP system manages current data that, typically, are too detailed to be easily used for decision making. An OLAP system manages large amounts of historical data, provides facilities for summarization and aggregation, and stores and 31 manages information at different levels of granularity. These features make the data easier to use in informed decision making. Database design: An OLTP system usually adopts an entity-relationship (ER) data model and an application-oriented database design. An OLAP system typically adopts either a star or snowflake model (to be discussed in Section 2.2.2) and a subject oriented database design. View: An OLTP system focuses mainly on the current data within an enterprise or department, without referring to historical data or data in different organizations. In contrast, an OLAP system often spans multiple versions of a database schema, due to the evolutionary process of an organization. OLAP systems also deal with information that originates from different organizations, integrating information from many data stores. Because Access patterns: The access patterns of an OLTP system consist mainly of short, atomic transactions. Such a system requires concurrency control and recovery mechanisms. However, accesses to OLAP systems are mostly read-only operations Table 2.1 Comparison between OLTP and OLAP systems. 32 But, Why Have a Separate Data Warehouse? Because operational databases store huge amounts of data, you may wonder, ―why not perform on-line analytical processing directly on such databases instead of spending additional time and resources to construct a separate data warehouse?‖ A major reason for such a separation is to help promote the high performance of both systems. An operational database is designed and tuned from known tasks and workloads, such as indexing and hashing using primary keys, searching for particular records, and optimizing ―canned‖ queries. On the other hand, data warehouse queries are often complex. They involve the computation of large groups of data at summarized levels, and may require the use of special data organization, access, and implementation methods based on multidimensional views. Processing OLAP queries in operational databases would substantially degrade the performance of operational tasks. . 2.4 A Multidimensional Data Model Data warehouses and OLAP tools are based on a multidimensional data model. This model views data in the form of a data cube. From Tables and Spreadsheets to Data Cubes ―What is a data cube?‖ A data cube allows data to be modeled and viewed in. It is defined by dimensions and facts. In general terms, dimensions are the perspectives or entities with respect to which an organization wants to keep records. For example, AllElectronics may create a sales data warehouse in order to keep records of the store‘s sales with respect to the dimensions time, item, branch, and location. These dimensions allow the store to keep track of things like monthly sales of items and the branches and locations. A 2-D view of sales data for AllElectronics according to the dimensions time and item, where the sales are from branches located in the city of Vancouver. The measure displayed is dollars sold (in thousands). At which the items were sold. Each dimension may have a table associated with it, called a dimension table, which further describes the dimension. For example, a dimension table for item may contain the attributes item name, brand, and type. Dimension tables can be specified by users or experts, or automatically generated and adjusted based on data distributions 33 A multidimensional data model is typically organized around a central theme, like sales, for instance. This theme is represented by a fact table. Facts are numerical measures. Think of themes the quantities by which we want to analyze relationships between dimensions. Examples of facts for a sales data warehouse include dollars sold (sales amount in dollars), units sold (number of units sold), and amount budgeted. The fact table contains the names of the facts, or measures, as well as keys to each of the related dimension tables. You will soon get a clearer picture of how this works when we look at multidimensional schemas. Table 2.2 A 2-D view of sales data for AllElectronics according to the dimensions time and item, where the sales are from branches located in the city of Vancouver. The measure displayed is dollars sold (in thousands). Although we usually think of cubes as 3-D geometric structures, in data warehousing the data cube is n-dimensional. To gain a better understanding of data cubes and the multidimensional data model, let‘s start by looking at a simple 2-D data cube that is, in fact, a table or spreadsheet for sales data from AllElectronics. In particular, we will look at the AllElectronics sales data for items sold per quarter in the city of Vancouver. These data are shown in Table 2.2. In this 2-D representation, the sales for Vancouver are shown with respect to the time dimension (organized in quarters) and the item dimension (organized according to the types of items sold). The fact or measure displayed is dollars sold (in thousands). Now, suppose that we would like to view the sales data with a third dimension. For instance, suppose we would like to view the data according to time and item, as well as location for the cities Chicago, New York, Toronto, and Vancouver. These 3-D data are shown in Table 2.3. The 3-D data of Table 2.3 are represented as a series of 2-D tables. Conceptually, we may also represent the same data in the form of a 3-D data cube, as in Figure 2.1. As a cuboid. Given a set of dimensions, we can generate a cuboid for each of the possible subsets of the given dimensions. The result would form a lattice of cuboids, each showing the 34 data at a different level of summarization, or group by. The lattice of cuboids is then referred to as a data cube. Figure 2.3 shows a lattice of cuboids forming a data cube for the dimensions time, item, location, and supplier. The cuboid that holds the lowest level of summarization is called the base cuboid. For example, the 4-D cuboid in Figure 2.2 is the base cuboid for the given time, item, location, and supplier dimensions. Figure 2.1 is a 3-D (non base) cuboid for time, item, and location, summarized for all suppliers. The 0-D cuboid, which holds the highest level of summarization, is called the apex cuboid. In our example, this is the total sales, or dollars sold, summarized over all four dimensions. The apex cuboid is typically denoted by all. Table 2.3 A 3-D view of sales data for AllElectronics, according to the dimensions time, item, and location. The measure displayed is dollars sold (in thousands). Figure 2.1 A 3-D data cube representation of the data in Table 2.3, according to the dimensions time, item, and location. The measure displayed is dollars sold (in thousands). Suppose that we would now like to view our sales data with an additional fourth dimension, such as supplier. Viewing things in 4-D becomes tricky. However, we can think of a 4-D 35 cube as being a series of 3-D cubes, as shown in Figure 2.2. If we continue in this way, we may display any n-D data as a series of (n-1)-D ―cubes Figure 2.2 A 4-D data cube representation of sales data, according to the dimensions time, item, location, and supplier. The measure displayed is dollars sold (in thousands). For improved readability, only some of the cube values are shown. Figure 2.3 Lattice of cuboids, making up a 4-D data cube for the dimensions time, item, location, and supplier. Each cuboid represents a different degree of summarization. Stars, Snowflakes, and Fact Constellations: Schemas for Multidimensional Databases The entity-relationship data model is commonly used in the design of relational databases, where a database schema consists of a set of entities and the relationships between them. Such a data model is appropriate for on-line transaction processing. A data warehouse, however, requires a concise, subject-oriented schema that facilitates on-line data analysis. 36 The most popular data model for a data warehouse is a multidimensional model. Such a model can exist in the form of a star schema, a snowflake schema, or a fact constellation schema. Let‘s look at each of these schema types. Star schema: The most common modeling paradigm is the star schema, in which the data warehouse contains (1) a large central table (fact table) containing the bulk of the data, with no redundancy, and (2) a set of smaller attendant tables (dimension tables), one for each dimension. The schema graph resembles a starburst, with the dimension tables displayed in a radial pattern around the central fact table Example 2.1 Star schema: A star schema for All Electronics sales is shown in Figure 2.4. Sales are considered along four dimensions, namely, time, item, branch, and location. The schema contains a central fact table for sales that contains keys to each of the four dimensions, along with two measures: dollars sold and units sold. To minimize the size of the fact table, dimension identifiers (such as time key and item key) are system-generated identifiers Notice that in the star schema, each dimension is represented by only one table, and each table contains a set of attributes. For example, the location dimension table contains the attribute set flocation key, street, city, province or state, countryg. This constraint may introduce some redundancy. For example, ―Vancouver‖ and ―Victoria‖ are both cities in the Canadian province of British Columbia. Entries for such cities in the location dimension table will create redundancy among the attributes province or state and country, that is, (..., Vancouver, British Columbia, Canada) and (..., Victoria, British Columbia, Canada).Moreover, the attributes within a dimension table may form either a hierarchy (total order) or a lattice (partial order). 37 Figure 2.4 Star schema of a data warehouse for sales. Snowflake schema: The snowflake schema is a variant of the star schema model, where some dimension tables are normalized, thereby further splitting the data into additional tables. The resulting schema graph forms a shape similar to a snowflake The major difference between the snowflake and star schema models is that the dimension tables of the snowflake model may be kept in normalized form to reduce redundancies. Such a table is easy to maintain and saves storage space. However, this saving of space is negligible in comparison to the typical magnitude of the fact table. Furthermore, the snowflake structure can reduce the effectiveness of browsing, since more joins will be needed to execute a query. Consequently, the system performance may be adversely impacted. Hence, although the snowflake schema reduces redundancy, it is not as popular as the star schema in data warehouse design. Example 2.2 Snowflake schema: A snowflake schema for All Electronics sales is given in Figure 2.5. Here, the sales fact table is identical to that of the star schema in Figure 2.4. The main difference between the two schemas is in the definition of dimension tables. The single dimension table for item in the star schema is normalized in the snowflake schema, resulting in new item and supplier tables. For example, the item dimension table now contains the attributes item key, item name, brand, type, and supplier key, where supplier key is linked to the supplier dimension table, containing supplier key and supplier type information. 38 Similarly, the single dimension table for location in the star schema can be normalized into two new tables: location and city. The city key in the new location table links to the city dimension. Notice that further normalization can be performed on province or state and country in the snowflake schema shown in Figure 2.5, when desirable. Figure 2.5 Snowflake schema of a data warehouse for sales. Fact constellation: Sophisticated applications may require multiple fact tables to share dimension tables. This kind of schema can be viewed as a collection of stars, and hence is called a galaxy schema or a fact constellation. Example 2.3 Fact constellation: A fact constellation schema is shown in Figure 2.6. This schema specifies two fact tables, sales and shipping. The sales table definition is identical to that of the star schema (Figure 2.4). The shipping table has five dimensions, or keys: item key, time key, shipper key, from location, and to location, and two measures: dollars cost and units shipped. A fact constellation schema allows dimension tables to be shared between fact tables. For example, the dimensions tables for time, item, and location are shared between both the sales and shipping fact tables. In data warehousing, there is a distinction between a data warehouse and a data mart. A data warehouse collects information about subjects that span the entire organization, such as customers, items, sales, assets, and personnel, and thus its scope is enterprise-wide. For data a warehouse, the fact constellation schema is commonly used, since it can model multiple, interrelated subjects. A data mart, on the other hand, is a department subset of the data 39 warehouse that focuses on selected subjects, and thus its scope is department wide. For data marts, the star or snowflake schemas are commonly used, since both are geared toward modeling single subjects, although the star schema is more popular and efficient. Figure 2.6 Fact constellation schema of a data warehouse for sales and shipping. Measures: Their Categorization and Computation ―How are measures computed?‖ To answer this question, we first study how measures can be categorized. Note that a multidimensional point in the data cube space can be defined by a set of dimension-value pairs, for example, (time = ―Q1‖, location = ―Vancouver‖, item =―computer‖). A data cube measure is a numerical function that can be evaluated at each point in the data cube space. A measure value is computed for a given point by aggregating the data corresponding to the respective dimension-value pairs defining the given point. We will look at concrete examples of this shortly. Measures can be organized into three categories (i.e., distributive, algebraic, holistic), based on the kind of aggregate functions used. Distributive: An aggregate function is distributive if it can be computed in a distributed manner as follows. Suppose the data are partitioned into n sets. We apply the function to each partition, resulting in n aggregate values. If the result derived by applying the function to the n aggregate values is the same as that derived by applying the function to the entire data set (without partitioning), the function can be computed in a distributed manner. For example, count() can be computed for a data cube by first partitioning the cube into a set of sub cubes, 40 computing count() for each sub cube, and then summing up the counts obtained for each sub cube. Hence, count() is a distributive aggregate function. Algebraic: An aggregate function is algebraic if it can be computed by an algebraic function with M arguments (where M is a bounded positive integer), each of which is obtained by applying a distributive aggregate function. For example, avg() (average) can be computed by sum()/count(), where both sum() and count() are distributive aggregate functions. Holistic: An aggregate function is holistic if there is no constant bound on the storage size needed to describe a sub aggregate. That is, there does not exist an algebraic function with M arguments (where M is a constant) that characterizes the computation .Common examples of holistic functions include median(), mode(), and rank(). A measure is holistic if it is obtained by applying a holistic aggregate function. Example 2.7 Interpreting measures for data cubes. Many measures of a data cube can be computed by relational aggregation operations. In Figure 2.4, we saw a star schema for AllElectronics sales that contains two measures, namely, dollars sold and units sold. In Example 2.4, the sales star data cube corresponding to the schema was defined using DMQL commands. ―But how are these commands interpreted in order to generate the specified data cube?‖ Suppose that the relational database schema of AllElectronics is the following: time(time key, day, day of week, month, quarter, year) item(item key, item name, brand, type, supplier type) branch(branch key, branch name, branch type) location(location key, street, city, province or state, country) sales(time key, item key, branch key, location key, number of units sold, price) The DMQL specification of Example 2.4 is translated into the following SQL query, which generates the required sales star cube. Here, the sum aggregate function is used to compute both dollars sold and units sold: select s.time key, s.item key, s.branch key, s.location key, sum(s.number of units sold _ s.price), sum(s.number of units sold) from time t, item i, branch b, location l, sales s, where s.time key = t.time key and s.item key = i.item key and s.branch key = b.branch key and s.location key = l.location key group by s.time key, s.item key, s.branch key, s.location key The cube created in the above query is the base cuboid of the sales star data cube. It contains all of the dimensions specified in the data cube definition, where the granularity of each dimension is at the join key level. A join key is a key that links a fact table and a dimension 41 table. The fact table associated with a base cuboid is sometimes referred to as the base fact table. Most of the current data cube technology confines the measures of multidimensional databases data, such as spatial, multimedia, or text data. However, measures can also be applied to other kinds of data, such as spatial, multimedia, or text data. Concept Hierarchies A concept hierarchy defines a sequence of mappings from a set of low-level concepts to higher-level, more general concepts. Consider a concept hierarchy for the dimension location. City values for location include Vancouver, Toronto, Newyork, and Chicago. Each city, however, can be mapped to the province or state to which it belongs. For example, Vancouver can be mapped to British Columbia, and Chicago to Illinois. The provinces and states can in turn be mapped to the country to which they belong, such as Canada or the USA. These mappings forma concept hierarchy for the dimension location, mapping a set of lowlevel concepts (i.e., cities) to higher-level, more general concepts (i.e., countries). The concept hierarchy described above is illustrated in Figure 2.7. Many concept hierarchies are implicit within the database schema. For example, suppose that the dimension location is described by the attributes number, street, city, province or state, zipcode, and country. These attributes are related by a total order, forming a concept hierarchy such as ―street < city < province or state < country‖. This hierarchy is shown in Figure 2.8(a). Alternatively, the attributes of a dimension may be organized in a partial order, forming a lattice. An example of a partial order for the time dimension based on the attributes day, week, month, quarter, and year is ―day < {month <quarter; week} < year‖. This lattice structure is shown in Figure 2.8(b). A concept hierarchy 42 Figure 2.7 A concept hierarchy for the dimension location. Due to space limitations, not all of the nodes of the hierarchy are shown (as indicated by the use of ―ellipsis‖ between nodes). Figure 2.8 Hierarchical and lattice structures of attributes in warehouse dimensions: OLAP Operations in the Multidimensional Data Model In the multidimensional model, data are organized into multiple dimensions, and each dimension contains multiple levels of abstraction defined by concept hierarchies. This organization provides users with the flexibility to view data from different perspectives. A number of OLAP data cube operations exist to materialize these different views, allowing interactive querying and analysis of the data at hand. Hence, OLAP provides a user-friendly environment for interactive data analysis. Roll-up: The roll-up operation (also called the drill-up operation by some vendors) performs aggregation on a data cube, either by climbing up a concept hierarchy for a dimension or by dimension reduction. Figure 2.10 shows the result of a roll-up operation performed on the central cube by climbing up the concept hierarchy for location. This hierarchy was defined as the total order ―street < city < province or state < country.‖ The roll-up operation shown aggregates the data by ascending the location hierarchy from the level of city to the level of country. In other words, rather than grouping the data by city, the resulting cube groups the data by country. 43 When roll-up is performed by dimension reduction, one or more dimensions are removed from the given cube. For example, consider a sales data cube containing only the two dimensions location and time. Roll-up may be performed by removing, say, the time dimension, resulting in an aggregation of the total sales by location, rather than by location and by time. Drill-down: Drill-down is the reverse of roll-up. It navigates from less detailed data to more detailed data. Drill-down can be realized by either stepping down a concept hierarchy for a dimension or introducing additional dimensions. Figure 2.10 shows the result of a drill-down operation performed on the central cube by stepping down a concept hierarchy for time defined as ―day < month < quarter < year.‖ Drill-down occurs by descending the time hierarchy from the level of quarter to the more detailed level of month. The resulting data cube details the total sales per month rather than summarizing them by quarter. Because a drill-down adds more detail to the given data, it can also be performed by adding new Figure 2.10 can occur by introducing an additional dimension, such as customer group. Slice and dice: The slice operation performs a selection on one dimension of the given cube, resulting in a sub cube. Figure 2.10 shows a slice operation where the sales data are selected from the central cube for the dimension time using the criterion time = ―Q1‖. The dice operation defines a sub cube by performing a selection on two or more dimensions. Figure 2.10 shows a dice operation on the central cube based on the following selection criteria that involve three dimensions: (location = ―Toronto‖ or ―Vancouver‖) and (time = ―Q1‖ or ―Q2‖) and (item = ―home entertainment‖ or ―computer‖). 44 Figure 2.10 Examples of typical OLAP operations on multidimensional data Pivot (rotate): Pivot (also called rotate) is a visualization operation that rotates the data axes in view in order to provide an alternative presentation of the data. Figure 2.10 shows a pivot operation where the item and location axes in a 2-D slice are rotated. Other examples include rotating the axes in a 3-D cube, or transforming a 3-D cube into a series of 2-D planes. 45 Other OLAP operations: Some OLAP systems offer additional drilling operations. For example, drill-across executes queries involving (i.e., across) more than one fact table. The drill-through operation uses relational SQL facilities to drill through the bottom level of a data cube down to its back-end relational tables. Other OLAP operations may include ranking the top N or bottom or/and items in lists, as well as computing moving averages, growth rates, interests, internal rates of return, depreciation, currency conversions, and statistical functions. OLAP offers analytical modeling capabilities, including a calculation engine for deriving ratios, variance, and so on, and for computing measures across multiple dimensions. It can generate summarizations, aggregations, and hierarchies at each granularity level and at every dimension intersection. OLAP also supports functional models for forecasting, trend analysis, and statistical analysis. In this context, an OLAP engine is a powerful data analysis tool. OLAP Systems versus Statistical Databases Many of the characteristics of OLAP systems, such as the use of a multidimensional data model and concept hierarchies, the association of measures with dimensions, and the notions of roll-up and drill-down, also exist in earlier work on statistical databases (SDBs). A statistical database is a database system that is designed to support statistical applications. Similarities between the two types of systems are rarely discussed, mainly due to differences in terminology and application domains. OLAP and SDB systems, however, have distinguishing differences. While SDBs tend to focus on socioeconomic applications, OLAP has been targeted for business applications. Privacy issues regarding concept hierarchies are a major concern for SDBs. For example, given summarized socioeconomic data, it is controversial to allow users to view the corresponding low-level data. Finally, unlike SDBs, OLAP systems are designed for handling huge amounts of data efficiently. A Starnet Query Model for Querying Multidimensional Databases The querying of multidimensional databases can be based on a starnet model. A starnet model consists of radial lines emanating from a central point, where each line represents a concept hierarchy for a dimension. Each abstraction level in the hierarchy is called a footprint. These represent the granularities available for use by OLAP operations such as drill-down and roll-up. 46 Example 2.9 Starnet. A starnet query model for the AllElectronics data warehouse is shown in Figure 2.11. This starnet consists of four radial lines, representing concept hierarchies Figure 2.11 Modeling business queries: a starnet model. For the dimensions location, customer, item, and time, respectively. Each line consists of footprints representing abstraction levels of the dimension. For example, the time line has four footprints: ―day,‖ ―month,‖ ―quarter,‖ and ―year.‖ A concept hierarchy may involve a single attribute (like date for the time hierarchy) or several attributes (e.g., the concept hierarchy for location involves the attributes street, city, province or state, and country). 2.5 Data Warehouse Architecture 2.3.1 Steps for the Design and Construction of Data Warehouses This subsection presents a business analysis framework for data warehouse design. The basic steps involved in the design process are also described. The Design of a Data Warehouse: A Business Analysis Framework First, having a data warehouse may provide a competitive advantage by presenting relevant information from which to measure performance and make critical adjustments in order to help win over competitors. Second, a data warehouse can enhance business productivity because it is able to quickly and efficiently gather information that accurately describes the 47 organization. Third, a data warehouse facilitates customer relationship management because it provides a consistent view of customers and items across all lines of business, all departments, and all markets. Finally, a data warehouse may bring about cost reduction by tracking trends, patterns, and exceptions over long periods in a consistent and reliable manner. Four different views regarding the design of a data warehouse must be considered: the topdown view, the data source view, the data warehouse view, and the business query view. The top-down view allows the selection of the relevant information necessary for the data warehouse. This information matches the current and future business needs. The data source view exposes the information being captured, stored, and managed by operational systems. This information may be documented at various levels of detail and accuracy, from individual data source tables to integrated data source tables. Data sources are often modeled by traditional data modeling techniques, such as the entityrelationship model or CASE (computer-aided software engineering) tools. The data warehouse view includes fact tables and dimension tables. It represents the information that is stored inside the data warehouse, including pre calculated totals and counts, as well as information regarding the source, date, and time of origin, added to provide historical context. Finally, the business query view is the perspective of data in the data warehouse from the viewpoint of the end user. Building and using a data warehouse is a complex task because it requires business skills, technology skills, and program management skills. Regarding business skills, building a data warehouse involves understanding how such systems store and manages their data, how to build extractors that transfer data from the operational system to the data warehouse, and how to build warehouse refresh software that keeps the data warehouse reasonably up-to-date with the operational system‘s data. Regarding technology skills, data analysts are required to understand how to make assessments from quantitative information and derive facts based on conclusions from historical information in the data warehouse. These skills include the ability to discover patterns and trends, to extrapolate trends based on history and look for anomalies or paradigm shifts, and to present coherent managerial recommendations based on such analysis. Finally, program management skills involve the need to interface with many technologies, vendors, and end users in order to deliver results in a timely and cost-effective manner. 48 The Process of Data Warehouse Design A data warehouse can be built using a top-down approach, a bottom-up approach, or a combination of both. The top-down approach starts with the overall design and planning. It is useful in cases where the technology is mature and well known, and where the business problems that must be solved are clear and well understood. The bottom-up approach starts with experiments and prototypes. This is useful in the early stage of business modeling and technology development. It allows an organization to move forward at considerably less expense and to evaluate the benefits of the technology before making significant commitments. In the combined approach, an organization can exploit the planned and strategic nature of the top-down approach while retaining the rapid implementation and opportunistic application of the bottom-up approach. From the software engineering point of view, the design and construction of a data warehouse may consist of the following steps: planning, requirements study, problem analysis, warehouse design, data integration and testing, and finally deployment of the data warehouse. In general, the warehouse design process consists of the following steps: 1. Choose a business process to model, for example, orders, invoices, shipments, inventory, account administration, sales, or the general ledger. If the business process is organizational and involves multiple complex object collections, a data warehouse model should be followed. However, if the process is departmental and focuses on the analysis of one kind of business process, a data mart model should be chosen. 2. Choose the grain of the business process. The grain is the fundamental, atomic level of data to be represented in the fact table for this process, for example, individual transactions, individual daily snapshots, and so on. 3. Choose the dimensions that will apply to each fact table record. Typical dimensions are time, item, customer, supplier, warehouse, transaction type, and status. 4. Choose the measures that will populate each fact table record. Typical measures are numeric additive quantities like dollars sold and units sold Because data warehouse construction is a difficult and long-term task, its implementation scope should be clearly defined. The goals of an initial data warehouse implementation should be specific, achievable, and measurable. This involves determining the time and budget allocations, the subset of the organization that is to be modeled, the number of data sources selected, and the number and types of departments to be served. 49 Once a data warehouse is designed and constructed, the initial deployment of the warehouse includes initial installation, roll-out planning, training, and orientation. Platform upgrades and maintenance must also be considered. Data warehouse administration includes data refreshment, data source synchronization, planning for disaster recovery, managing access control and security, managing data growth, managing database performance, and data warehouse enhancement and extension. Scope management includes controlling the number and range of queries, dimensions, and reports; limiting the size of the data warehouse; or limiting the schedule, budget, or resources. Various kinds of data warehouse design tools are available. Data warehouse development tools provide functions to define and edit metadata repository contents (such as schemas, scripts, or rules), answer queries, output reports, and ship metadata to and from relational database system catalogues. Planning and analysis tools study the impact of schema changes and of refresh performance when changing refresh rates or time windows. 2.3.2. Three-Tier Data Warehouse Architecture Data warehouses often adopt three-tier architecture, as presented in Figure 2.12. 1. The bottom tier is a warehouse database server that is almost always a relational database system. Back-end tools and utilities are used to feed data into the bottom tier from operational databases or other external sources (such as customer profile information provided by external consultants). These tools and utilities perform data extraction, cleaning, and transformation (e.g., to merge similar data from different Sources into a unified format), as well as load and refresh functions to update the data warehouse (Section 2.3.3). The data are extracted using application program interfaces known as gateways. A gateway is supported by the underlying DBMS and allows client programs to generate SQL code to be executed at a server. Examples of gateways include ODBC (Open Database Connection) and OLEDB (Open Linking and Embedding for Databases) by Microsoft and JDBC (Java Database Connection). This tier also contains a metadata repository, which stores information about the data warehouse and its contents. The metadata repository is further described in Section 2.3.4. 2. The middle tier is an OLAP server that is typically implemented using either (1) a relational OLAP (ROLAP) model, that is, an extended relational DBMS that maps operations on multidimensional data to standard relational operations; or (2) a 50 multidimensional OLAP (MOLAP) model, that is, a special-purpose server that directly implements multidimensional data and operations. OLAP servers are discussed in Section 2.3.5. 3. The top tier is a front-end client layer, which contains query and reporting tools, analysis tools, and/or data mining tools (e.g., trend analysis, prediction, and so on). From the architecture point of view, there are three data warehouse models: the enterprise warehouse, the data mart, and the virtual warehouse. Figure 2.12 A three-tier data warehousing architecture. Enterprise warehouse: An enterprise warehouse collects all of the information about subjects spanning the entire organization. It provides corporate-wide data integration, usually from one or more operational systems or external information providers, and is crossfunctional in scope. It typically contains detailed data as well as summarized data, and can range in size from a few gigabytes to hundreds of gigabytes, terabytes, or beyond. An enterprise data warehouse may be implemented on traditional mainframes, computer super 51 servers, or parallel architecture platforms. It requires extensive business modeling and may take years to design and build. Data mart: A data mart contains a subset of corporate-wide data that is of value to a specific group of users. The scope is confined to specific selected subjects. For example, a marketing data mart may confine its subjects to customer, item, and sales. The data contained in data marts tend to be summarized. Depending on the source of data, data marts can be categorized as independent or dependent. Independent data marts are sourced from data captured from one or more operational systems or external information providers, or from data generated locally within a particular department or geographic area. Dependent data marts are sourced directly from enterprise data warehouses. Virtual warehouse: A virtual warehouse is a set of views over operational databases. For efficient query processing, only some of the possible summary views may be materialized. A virtual warehouse is easy to build but requires excess capacity on operational database servers. The top-down development of an enterprise warehouse serves as a systematic solution and minimizes integration problems. However, it is expensive, takes a long time to develop, and lacks flexibility due to the difficulty in achieving consistency and consensus for a common data model for the entire organization. The bottom-up approach to the design, development, and deployment of independent data marts provides flexibility, low cost, and rapid return of investment. It, however, can lead to problems when integrating various disparate data marts into a consistent enterprise data warehouse. 52 Figure 2.13 A recommended approach for data warehouse development. A recommended method for the development of data warehouse systems is to implement the warehouse in an incremental and evolutionary manner, as shown in Figure 2.13. First, a highlevel corporate data model is defined within a reasonably short period (such as one or two months) that provides a corporate-wide, consistent, integrated view of data among different subjects and potential usages. 2.3.3 Data Warehouse Back-End Tools and Utilities Data warehouse systems use back-end tools and utilities to populate and refresh their data (Figure 2.12). These tools and utilities include the following functions: Data extraction, which typically gathers data from multiple, heterogeneous, and external sources Data cleaning, which detects errors in the data and rectifies them when possible Data transformation, which converts data from legacy or host format to warehouse format Load, which sorts, summarizes, consolidates, computes views, checks integrity, and builds indices and partitions Refresh, which propagates the updates from the data sources to the warehouse Besides cleaning, loading, refreshing, and metadata definition tools, data warehouse systems usually provide a good set of data warehouse management tools. 53 Data cleaning and data transformation are important steps in improving the quality of the data and, subsequently, of the data mining results. 2.3.4 Metadata Repository Metadata are data about data. When used in a data warehouse, metadata are the data that define warehouse objects. Figure 2.12 showed a metadata repository within the bottom tier of the data warehousing architecture. Metadata are created for the data names and definitions of the given warehouse. Additional metadata are created and captured for time stamping any extracted data, the source of the extracted data, and missing fields. That have been added by data cleaning or integration processes. A metadata repository should contain the following: A description of the structure of the data warehouse, which includes the warehouse schema, view, dimensions, hierarchies, and derived data definitions, as well as data mart locations and contents Operational metadata, which include data lineage (history of migrated data and the sequence of transformations applied to it), currency of data (active, archived, or purged), and monitoring information (warehouse usage statistics, error reports, and audit trails) The algorithms used for summarization, which include measure and dimension definition algorithms, data on granularity, partitions, subject areas, aggregation, summarization, and predefined queries and reports The mapping from the operational environment to the data warehouse, which includes source databases and their contents, gateway descriptions, data partitions, data extraction, cleaning, transformation rules and defaults, data refresh and purging rules, and security (user authorization and access control) Data related to system performance, which include indices and profiles that improve data access and retrieval performance, in addition to rules for the timing and scheduling of refresh, update, and replication cycles Business metadata, which include business terms and definitions, data ownership information, and charging policies A data warehouse contains different levels of summarization, of which metadata is one type. Other types include current detailed data (which are almost always on disk), older detailed data (which are usually on tertiary storage), lightly summarized data and highly summarized data (which may or may not be physically housed). 54 Metadata play a very different role than other data warehouse data and are important for many reasons. For example, metadata are used as a directory to help the decision support system analyst locate the contents of the data warehouse, as a guide to the mapping of data when the data are transformed from the operational environment to the data warehouse environment, and as a guide to the algorithms used for summarization between the current detailed data and the lightly summarized data, and between the lightly summarized data and the highly summarized data. Metadata should be stored and managed persistently (i.e., on disk). Types of OLAP Servers: ROLAP versus MOLAP versus HOLAP Logically, OLAP servers present business users with multidimensional data from data warehouses or data marts, without concerns regarding how or where the data are stored. However, the physical architecture and implementation of OLAP servers must consider data storage issues. Implementations of a warehouse server for OLAP processing include the following: Relational OLAP (ROLAP) servers: These are the intermediate servers that stand in between a relational back-end server and client front-end tools. They use a relational or extended relational DBMS to store and manage warehouse data, and OLAP middleware to support missing pieces. ROLAP servers include optimization for each DBMS back end, implementation of aggregation navigation logic, and additional tools and services. ROLAP technology tends to have greater scalability than MOLAP technology. The DSS server of Micro strategy, for example, adopts the ROLAP approach. Multidimensional OLAP (MOLAP) servers: These servers support multidimensional views of data through array-based multidimensional storage engines. They map multidimensional views directly to data cube array structures. The advantage of using a data cube is that it allows fast indexing to pre computed summarized data. Notice that with multidimensional data stores, the storage utilization may be low if the data set is sparse. In such cases, sparse matrix compression techniques should be explored. Many MOLAP servers adopt a two-level storage representation to handle dense and sparse data sets: denser sub cubes are identified and stored as array structures, whereas sparse sub cubes employ compression technology for efficient storage utilization. 55 Hybrid OLAP (HOLAP) servers: The hybrid OLAP approach combines ROLAP and MOLAP technology, benefiting from the greater scalability of ROLAP and the faster computation of MOLAP. For example, a HOLAP server may allow large volumes of detail data to be stored in a relational database, while aggregations are kept in a separate MOLAP store. The Microsoft SQL Server 2000 supports a hybrid OLAP server. Specialized SQL servers: To meet the growing demand of OLAP processing in relational databases, some database system vendors implement specialized SQL servers that provide advanced query language and query processing support for SQL queries over star and snowflake schemas in a read-only environment. Example 2.10 A ROLAP data store. Table 2.4 shows a summary fact table that contains both base fact data and aggregated data. The schema of the table is ―hrecord identifier (RID), item, day, month, quarter, year, dollars soldi‖, where day, month, quarter, and year define the date of sales, and dollars sold is the sales amount. Consider the tuples with an RID of 1001 and 1002, respectively. The data of these tuples are at the base fact level, where the dateof sales is October 15, 2003, and October 23, 2003, respectively. Consider the tuple with an RID of 5001. This tuple is at a more general level of abstraction than the tuples 1001 and 1002. The day value has been generalized to all, so that the corresponding time value is October 2003. That is, the dollars sold amount shown is an aggregation representing the entire month of October 2003, rather than just October 15 or 23, 2003. The special value all is used to represent subtotals in summarized data. MOLAP uses multidimensional array structures to store data for on-line analytical processing. This structure is discussed in the following section on data warehouse implementation Most data warehouse systems adopt a client-server architecture. A relational data store always resides at the data warehouse/data mart server site. A multidimensional data store can reside at either the database server site or the client site. 56 2.6 Data Warehouse Implementation At the core of multidimensional data analysis is the efficient computation of aggregations across many sets of dimensions. In SQL terms, these aggregations are referred to as groupby. Each group-by can be represented by a cuboid, where the set of group-by forms a lattice of cuboids defining a data cube. In this section, we explore issues relating to the efficient computation of data cubes. The compute cube Operator and the Curse of Dimensionality One approach to cube computation extends SQL so as to include a compute cube operator. The compute cube operator computes aggregates over all subsets of the dimensions specified in the operation. This can require excessive storage space, especially for large numbers of dimensions. We start with an intuitive look at what is involved in the efficient computation of data cubes. Example 2.11 A data cube is a lattice of cuboids. Suppose that you would like to create a data cube for AllElectronics sales that contains the following: city, item, year, and sales in dollars. You would like to be able to analyze the data, with queries such as the following: ―Compute the sum of sales, grouping by city and item.‖ ―Compute the sum of sales, grouping by city.‖ ―Compute the sum of sales, grouping by item.‖ What is the total number of cuboids, or group-by‘s that can be computed for this data cube? Taking the three attributes, city, item, and year, as the dimensions for the data cube, and sales in dollars as the measure, the total number of cuboids, or group by‘s, that can be computed for this data cube is 23 = 8. The possible group-by‘s are the following: {(city, item, year), (city, item), (city, year), (item, year), (city), (item), (year), ()}, where () means that the groupby is empty (i.e., the dimensions are not grouped). These group-by‘s form a lattice of cuboids for the data cube, as shown in Figure 2.14. The base cuboid contains all three dimensions, city, item, and year. It can return the total sales for any combination of the three dimensions. The apex cuboid, or 0-D cuboid, refers to the case where the group-by is empty. 57 Figure 2.14 Lattice of cuboids, making up a 3-D data cube. Each cuboid represents a different groupby. The base cuboid contains the three dimensions city, item, and year. Partial Materialization: Selected Computation of Cuboids There are three choices for data cube materialization given a base cuboid: 1. No materialization: Do not pre compute any of the ―nonbase‖ cuboids. This leads to computing expensive multidimensional aggregates on the fly, which can be extremely slow. 2. Full materialization: Pre compute all of the cuboids. The resulting lattice of computed cuboids is referred to as the full cube. This choice typically requires huge amounts of memory space in order to store all of the pre computed cuboids. 3. Partial materialization: Selectively compute a proper subset of the whole set of possible cuboids. Alternatively, we may compute a subset of the cube, which contains only those cells that satisfy some user-specified criterion, such as where the tuple count of each cell is above some threshold. We will use the term sub cube to refer to the latter case, where only some of the cells may be pre computed for various cuboids. Partial materialization represents an interesting trade-off between storage space and response time. 2.4.2 Indexing OLAP Data 58 To facilitate efficient data accessing, most data warehouse systems support index structures and materialized views (using cuboids). The bitmap indexing method is popular in OLAP products because it allows quick searching in data cubes. The bitmap index is an alternative representation of the record ID (RID) list. In the bitmap index for a given attribute, there is a distinct bit vector, Bv, for each value v in the domain of the attribute. If the domain of a given attribute consists of n values, then n bits are needed for each entry in the bitmap index (i.e., there are n bit vectors). If the attribute has the value v for a given row in the data table, then the bit representing that value is set to 1 in the corresponding row of the bitmap index. All other bits for that row are set to 0. Example2.12 Bitmap indexing. In the AllElectronics data warehouse, suppose the dimension item at the top level has four values (representing item types): ―home entertainment,‖ ―computer,‖ ―phone,‖ and ―security.‖ Each value (e.g., ―computer‖) is represented by a bit vector in the bitmap index table for item. Suppose that the cube is stored as a relation table with 100,000 rows. Because the domain of item consists of four values, the bitmap index table requires four bit vectors (or lists), each with 100,000 bits. Figure 2.15 shows a base (data) table containing the dimensions item and city, and its mapping to bitmap index tables for each of the dimensions. Figure 2.15 Indexing OLAP data using bitmap indices. The join indexing method gained popularity from its use in relational database query processing. Traditional indexing maps the value in a given column to a list of rows having that value. In contrast, join indexing registers the joinable rows of two relations from a relational database. For example, if two relations R(RID, A) and S(B, SID) join on the attributes A and B, then the join index record contains the pair (RID, SID), where RID and SID are record identifiers from the R and S relations, respectively. Hence, the join index 59 records can identify joinable tuples without performing costly join operations. Join indexing is especially useful for maintaining the relationship between a foreign key3 and its matching primary keys, from the joinable relation. Example 2.13 Join indexing. In Example 2.4, we defined a star schema for AllElectronics of the form ―sales star [time, item, branch, location]: dollars sold = sum (sales in dollars)‖. An example of a join index relationship between the sales fact table and the dimension tables for location and item is shown in Figure 2.16. For example, the ―Main Street‖ value in the location dimension table joins with tuples T57, T238, and T884 of the sales fact table. Similarly, the ―Sony-TV‖ value in the item dimension table joins with tuples T57 and T459 of the sales fact table. The corresponding join index tables are shown in Figure 2.17. Figure 2.16 Linkages between a sales fact table and dimension tables for location and item. Suppose that there are 360 time values, 100 items, 50 branches, 30 locations, and 10 million sales tuples in the sales star data cube. If the sales fact table has recorded sales for only 30 items, the remaining 70 items will obviously not participate in joins. If join indices are not used, additional I/Os have to be performed to bring the joining portions of the fact table and dimension tables together. 60 2.4.3 Efficient Processing of OLAP Queries The purpose of materializing cuboids and constructing OLAP index structures is to speed up query processing in data cubes. Given materialized views, query processing should proceed as follows: 1. Determine which operations should be performed on the available cuboids: This involves transforming any selection, projection, roll-up (group-by), and drill-down operations specified in the query into corresponding SQL and/or OLAP operations. For example, slicing and dicing a data cube may correspond to selection and/or projection operations on a materialized cuboid. 2. Determine to which materialized cuboid(s) the relevant operations should be applied: This involves identifying all of the materialized cuboids that may potentially be used to answer the query, pruning the above set using knowledge of ―dominance‖ relationships among the cuboids, estimating the costs of using the remaining materialized cuboids, and selecting the cuboid with the least cost. Example 2.14 OLAP query processing. Suppose that we define a data cube for AllElectronics of the form ―sales cube [time, item, location]: sum(sales in dollars)‖. The dimension hierarchies used are ―day < month < quarter < year‖ for time, ―item name < brand < type‖ for item, and ―street < city < province or state < country‖ for location. 61 Suppose that the query to be processed is on fbrand, province or stateg, with the selection constant ―year = 2004‖. Also, suppose that there are four materialized cuboids available, as follows: cuboid 1: fyear, item name, cityg cuboid 2: fyear, brand, countryg cuboid 3: fyear, brand, province or stateg cuboid 4: fitem name, province or stateg where year = 2004 2.5 Data Warehousing to Data Mining 2.5.1 Data Warehouse Usage Data warehouses and data marts are used in a wide range of applications. Business executives use the data in data warehouses and data marts to perform data analysis and make strategic decisions. In many firms, data warehouses are used as an integral part of a plan-executeassess ―closed-loop‖ feedback system for enterprise management. Data warehouses are used extensively in banking and financial services, consumer goods and retail distribution sectors, and controlled manufacturing, such as demand based production. The data warehouse is used for strategic purposes, performing multidimensional analysis and sophisticated slice-and-dice operations. Finally, the data warehouse may be employed for knowledge discovery and strategic decision making using data mining tools. In this context, the tools for data warehousing can be categorized into access and retrieval tools, database reporting tools, data analysis tools, and data mining tools. Business users need to have the means to know what exists in the data warehouse (through metadata), how to access the contents of the data warehouse, how to examine the contents using analysis tools, and how to present the results of such analysis. There are three kinds of data warehouse applications: information processing, analytical processing, and data mining: Information processing supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts, or graphs. A current trend in data warehouse information processing is to construct low-cost Web-based accessing tools that are then integrated with Web browsers. Analytical processing supports basic OLAP operations, including slice-and-dice, drill-down, roll-up, and pivoting. It generally operates on historical data in both summarized and detailed forms. The major strength of on-line analytical processing 62 over information processing is the multidimensional data analysis of data warehouse data. Data mining supports knowledge discovery by finding hidden patterns and associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization tools. On-line analytical processing comes a step closer to data mining because it can derive information summarized at multiple granularities from user-specified subsets of a data warehouse. The functionalities of OLAP and data mining can be viewed as disjoint: OLAP is a data summarization/aggregation tool that helps simplify data analysis, while data mining allows the automated discovery of implicit patterns and interesting knowledge hidden in large amounts of data. OLAP tools are targeted toward simplifying and supporting interactive data analysis, whereas the goal of data mining tools is to automate as much of the process as possible, while still allowing users to guide the process. In this sense, data mining goes one step beyond traditional on-line analytical processing. Data mining is not confined to the analysis of data stored in data warehouses. It may analyze data existing at more detailed granularities than the summarized data provided in a data warehouse. It may also analyze transactional, spatial, textual, and multimedia data that are difficult to model with current multidimensional database technology. In this context, data mining covers a broader spectrum than OLAP with respect to data mining functionality and the complexity of the data handled. 2.5.2. From On-Line Analytical Processing to On-Line Analytical Mining In the field of data mining, substantial research has been performed for data mining on various platforms, including transaction databases, relational databases, spatial databases, text databases, time-series databases, flat files, data warehouses, and so on. On-line analytical mining (OLAM) (also called OLAP mining) integrates on-line analytical processing (OLAP) with data mining and mining knowledge in multidimensional databases. Among the many different paradigms and architectures of data mining systems, OLAM is particularly important for the following reasons: High quality of data in data warehouses: Most data mining tools need to work on integrated, consistent, and cleaned data, which requires costly data cleaning, data integration and data transformation as preprocessing steps. A data warehouse 63 constructed by such preprocessing serves as a valuable source of high quality data for OLAP as well as for data mining. Notice that data mining may also serve as a valuable tool for data cleaning and data integration as well. Available information processing infrastructure surrounding data warehouses: Comprehensive information processing and data analysis infrastructures have been or will be systematically constructed surrounding data warehouses, which include accessing, integration, consolidation, and transformation of multiple heterogeneous databases, ODBC/OLE DB connections, Web-accessing and service facilities, and reporting and OLAP analysis tools. It is prudent to make the best use of the available infrastructures rather than constructing everything from scratch. OLAP-based exploratory data analysis: Effective data mining needs exploratory data analysis. A user will often want to traverse through a database, select portions of relevant data, and analyze them at different granularities, and present knowledge / results in different forms. On-line analytical mining provides facilities for data mining on different subsets of data and at different levels of abstraction, by drilling, pivoting, filtering, dicing, and slicing on a data cube and on some intermediate data mining results. This, together with data/knowledge visualization tools, will greatly enhance the power and flexibility of exploratory data mining. On-line selection of data mining functions: Often a user may not know what kinds of knowledge she would like to mine. By integrating OLAP with multiple data mining functions, on-line analytical mining provides users with the flexibility to select desired data mining functions and swap data mining tasks dynamically. Architecture for On-Line Analytical Mining An OLAM server performs analytical mining in data cubes in a similar manner as an OLAP server performs on-line analytical processing. An integrated OLAM and OLAP architecture is shown in Figure 2.18, where the OLAM and OLAP servers both accept user on-line queries (or commands) via a graphical user interface API and work with the data cube in the 64 data analysis via a cube API. A metadata directory is used to guide the access of the data cube. Figure 2.18 An integrated OLAM and OLAP architecture. 2.8 Summary In this unit, we have given the detailed discussion on data warehousing concepts. The following concepts have been presented. A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data organized in support of management decision making. Several factors distinguish data warehouses from operational databases. Because the two 65 systems provide quite different functionalities and require different kinds of data, it is necessary to maintain data warehouses separately from operational databases. A multidimensional data model is typically used for the design of corporate data warehouses and departmental data marts. Such a model can adopt a star schema, snowflake schema, or fact constellation schema. The core of the multidimensional model is the data cube, which consists of a large set of facts (or measures) and a number of dimensions. Concept hierarchies organize the values of attributes or dimensions into gradual levels of abstraction. They are useful in mining at multiple levels of abstraction. On-line analytical processing (OLAP) can be performed in data warehouses/marts using the multidimensional data model. Typical OLAP operations include rollup, drill-(down, across, through), slice-and-dice, pivot (rotate), as well as statistical operations such as ranking and computing moving averages and growth rates. OLAP operations can be implemented efficiently using the data cube structure. Data warehouses often adopt three-tier architecture. Data warehouse metadata are data defining the warehouse objects. 2.9 Keywords Data Warehouse, OLAP, Star schema, Snowflake schema, Fact constellation, Distributive, Algebraic, Holistic, Hierarchies, Multidimensional Data Model, Data Warehouse Architecture, Data Warehouse Design, Enterprise warehouse, ROLAP versus MOLAP versus HOLAP, Partial Materialization. 2.10 Exercises 1. What is a Data Warehouse? Explain? 2. What are key features of Data Warehouse? Explain? 3. Differentiate between Operational Database Systems and Data Warehouses? 4. Write a note on Multidimensional Data Model? 5. What are the two schemas of Multidimensional Data Model? Explain? 6. Explain OLAP Operations. 7. Explain steps for the design and construction of Data Warehouses? 8. Explain Three-Tier Data Warehouse Architecture? 66 9. What are types of OLAP Servers? Explain? 10. What are three choices for data cube materialization? 11. Explain Architecture for On-Line Analytical Mining? 2.11 References 1. Data Mining: Concepts and Techniques by Jiawei Han and Micheline Kamber, Morgan Kaufmann Publisher, Second Edition, 2006. 2. Introduction to Data Mining with Case Studies by G. K. Gupta, Eastern Economy Edition (PHI, New Delhi), Third Edition, 2009. 3. Data Mining Techniques by Arun K Pujari, University Press, Second Edition, 2009. 4. Gartner: Data Warehouses, Operational Data Stores, Data Marts and Data Outhouses, Dec 2005. 5. Data Warehousing Fundamentals for IT Professionals, 2ed, Paulraj Ponniah. 67 UNIT-3: DATA CUBES AND IMPLEMENTATION Structure 3.1 Objectives 3.2 Introduction 3.3 Data Cube Implementations 3.4 Data Cube operations 3.5 Implementation of OLAP 3.6 Overview on OLAP Software 3.7 Summary 3.8 Keywords 3.9 Exercises 3.10 References 3.1 Objectives The objectives covered under this unit include: The introduction data cubes Data Cube Implementations Conceptual Modeling of Data Warehousing OLAP Operations in a Multidimensional Data OLAP Operations OLAP implementations 3.2 Introduction The cube is used to represent data along some measure of interest. Although called a "cube", it can be 2-dimensional, 3-dimensional, or higher-dimensional. Each dimension represents some attribute in the database and the cells in the data cube represent the measure of interest. For example, they could contain a count for the number of times that attribute combination 68 occurs in the database, or the minimum, maximum, sum or average value of some attribute. Queries are performed on the cube to retrieve decision support information. 3.3 Data Cube Implementations Implementation of the data cube is one of the most important, albeit ―expensive,‖ processes in On-Line Analytical Processing (OLAP). It involves the computation and storage of the results of aggregate queries grouping on all possible dimension-attribute combinations over a fact table in a data warehouse. Such pre computation and materialization of (parts of ) the cube is critical for improving the response time of OLAP queries and of operators such as roll-up, drill-down, slice-and-dice, and pivot, which use aggregation extensively [Chaudhuri and Dayal 1997; Gray et al. 1996]. Materializing the entire cube is ideal for fast access to aggregated data but may pose considerable costs both in computation time and in storage space. To balance the tradeoff between query response times and cube resource requirements, several efficient methods have been proposed, whose study is the main purpose of this article. As a running example, consider a fact table R consisting of three dimensions (A, B, C), and one measure M (Figure 1a). Figure 1b presents the corresponding cube. Each view that belongs to the data cube (also called cube node hereafter) materializes a specific group-by query as illustrated in Figure 1b. Clearly, if D is the number of dimensions of a fact table, the number of all possible group-by queries is 2D, which implies that the data cube size is exponentially larger with respect to D than the size of the original data (in the worst case). In typical applications, this is in the order of gigabytes, so development of efficient data-cube implementation algorithms is extremely critical. The data-cube implementation algorithms that have been proposed in the literature can be partitioned into four main categories, depending on the format they use in order to compute and store a data cube: Relational-OLAP (ROLAP) methods use traditional materialized views; Multidimensional-OLAP (MOLAP) methods use multidimensional arrays; GraphBased methods take advantage of specialized graphs that usually take the form of tree-like data structures; finally, approximation methods exploit various in memory representations (like histograms), borrowed mainly from statistics. Our focus in this article is on algorithms for ROLAP environments, due to several reasons: (a) Most existing publications share this focus; (b) ROLAP methods can be easily incorporated into existing relational servers, turning them into powerful OLAP tools with little effort; by contrast, MOLAP and Graph-Based 69 methods construct and store specialized data structures, making them incompatible, in any direct sense, with conventional database engines; (c) ROLAP methods generate and store precise results, which are much easier to manage at run time compared to approximations. Implementation of the data cube consists of two sub problems: one concerning the actual computation of the cube and one concerning the particulars of storing parts of the results of that computation. The set of algorithms those are applicable to each sub problem is intimately dependent on the particular approach that has been chosen: ROLAP, MOLAP, Graph-Based, or Approximate. Specifically for ROLAP, which is the focus of this article, the two sub problems take on the following specialized forms: —Data cube computation is the problem of scanning the original data, applying the required aggregate function to all groupings, and generating relational views with the corresponding cube contents. —Data cube selection is the problem of determining the subset of the data cube views that will actually be stored. Selection methods avoid storing some data cube pieces according to certain criteria, so that what is finally materialized balances the tradeoff between query response time and cube resource requirements. Definitions A data warehouse is based on a multidimensional data model which views data in the form of 70 a data cube. This is not a 3-dimensional cube: it is n-dimensional cube. Dimensions of the cube are the equivalent of entities in a database, e.g., how the organization wants to keep records. Examples: Product; Dates; Locations A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions o Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year) o Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables In data warehousing literature, an n-D base cube is called base cuboids. The top most 0-D cuboids, which holds the highest-level of summarization, is called the apex cuboids. The lattice of cuboids forms a data cube. Conceptual Modeling of Data Warehousing Star schema: A fact table in the middle connected to a set of dimension tables. The star schema architecture is the simplest data warehouse schema. It is called a star schema because the diagram resembles a star, with points radiating from a center. The center of the star consists of fact table and the points of the star are the dimension tables. Fact Tables A fact table typically has two types of columns: foreign keys to dimension tables and measures those that contain numeric facts. A fact table can contain fact's data on detail or aggregated level. Dimension Tables A dimension is a structure usually composed of one or more hierarchies that categorizes data. If a dimension hasn't got a hierarchies and levels it is called flat dimension or list. The primary keys of each of the dimension tables are part of the composite primary key of the fact table. Dimensional attributes help to describe the dimensional value. They are normally descriptive, textual values. Dimension tables are generally small in size then fact table. 71 Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake. The snowflake schema architecture is a more complex variation of the star schema used in a data warehouse, because the tables which describe the dimensions are normalized. 72 Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation. For each star schema it is possible to construct fact constellation schema (for example by splitting the original star schema into more star schemes each of them describes facts on another level of dimension hierarchies). The fact constellation architecture contains multiple fact tables that share many dimension tables. 73 Data Measures Three Categories: A data cube function is a numerical function that can be evaluated at each point in the data cube space. Given a data point in the data cube space: Entry (v1, v2, …, vn) Where vi is the value corresponding to dimension di. We need to apply the aggregate measures to the dimension values v1, v2, …, vn Distributive: If the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioning. Example: count(), sum(), min(), max(). Algebraic: Use distributive aggregate functions. If it can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate function. Example: avg(), min_N(), standard_deviation(). Holistic: If there is no constant bound on the storage size needed to describe a subaggregate. Example: median(), mode(), rank(). How to compute data cube measures? How do evaluate the dollars_sold and unit_sold in the star schema of the previous example? Assume that the relation database schema corresponding to our example is the following: 74 time (time_key, day, day_of_week, month, quarter, year) item (item_key, item_name, brand, type, supplier(supplier_key, supplier_type)) branch (branch_key, branch_name, branch_type) location (location_key, street, city, province_or_state, country) sales (time_key, item_ key, branch_key, location_key, number_of_unit_sold, price) Let us then compute the two measures we have in our data cube: dollars_sold and units_sold select s.time_key, s.item_key, s.branch_key, s.location_key, sum(s.number_of_units_sold*s.price), sum(s.number_of_units_sold) from time t, item i, branch b, location l, sales s where s.time_key = t.time_key and s.item_key = i.item_key and s.branch_key = b.branch_key and s.location_key = l.location_key group by s.time_key, s.item_key, s.branch_key, s.location_key Relationship between ―data cube‖ and ―group by‖? The above query corresponds to the base cuboid. By changing the group by clause in our query, we may generate other cuboids. A Concept Hierarchy A concept hierarchy is an order relation between a set of attributes of a concept or dimension. It can be manually (users or experts) or automatically generated (statistical analysis). Multidimensional data is usually organized into dimension and each dimension is further defined into a lower level of abstractions defined by concept hierarchies. Example: Dimension (location) 75 The order can be either partial or total: Location dimension: Street <city<state<country Time dimension: Day < {month<quarter ; week} < year Set-grouping hierarchy: It is a concept hierarchy among groups of values. Example: {1..10} < inexpensive 3.4 Data Cube operations OLAP Operations in a Multidimensional Data Sales volume as a function of product, time, and region. Dimensions hierarchical concepts: Product, Location, Time Industry Category Product Region Country City Office Year Quarter Month Day Week Sales volume as a function of product, month, and region. 76 region Product Month A Sample data cube: Cuboids of the sample cube: 77 Querying a data cube: OLAP Operations Objectives: OLAP is a powerful analysis tool: o Forecasting o Statistical computations, o aggregations, etc. Roll up (drill-up): It is performed by climbing up hierarchy of a dimension or by dimension reduction (reduce the cube by one or more dimensions).The roll up operation in the example is based location (roll up on location) is equivalent to grouping the data by country. Drill down (roll down): It is the reverse of roll-up. It is performed by stepping down a concept hierarchy for a dimension or introducing new dimensions. Slice: Slice is the act of picking a rectangular subset of a cube by choosing a single value for one of its dimensions, creating a new cube with one fewer dimension. Dice: The dice operation produces a sub cube by allowing the analyst to pick specific values of multiple dimensions. Pivot (rotate): Re-orient the cube for an alternative presentation of the data. Transform 3D view to series of 2D planes. Pivot allows an analyst to rotate the cube in space to see its various faces. For example, cities could be arranged vertically and products horizontally while viewing data for a particular quarter. Pivoting could replace products with time periods 78 to see data across time for a single product Other operations: Drill across: Involving (across) more than one fact table. Drill through: Through the bottom level of the cube to its back-end relational tables (using SQL) Starnet Query Model for Multidimensional Databases Each radial line represents a dimension. Each abstraction level in a hierarchy concept is called a footprint 79 3.5 Implementation of OLAP Multi Dimensional OLAP and Relational OLAP In this section, we will compare OLAP implementations using traditional relational star schemas and multidimensional databases. Multi Dimensional OLAP MOLAP - Multidimensional OLAP – OLAP done using MDBMS (traditional approach). Multi Dimensional Database Management Systems (MDBMS) are used to define and store data cubes in special data structures: Dimensions and Cubes. MDBMS have special storage structures and access routines (typically Array based) to efficiently store and retrieve data cubes. Advantages: Powerful, efficient database engines for manipulating data cubes (including indexes, access routines, etc.). Disadvantages: MDBMS use proprietary database structure and DML (e.g., not SQL). Requires different skill set, modeling tools, etc. Requires another vendor‘s DBMS (or feature subset) to be maintained. Not designed for transaction processing – for example, updating existing data is inefficient. Many commercial MOLAP systems are tightly integrated with reporting and analysis tools (BI tool sets). Some commercial MOLAP servers also support Relational OLAP Example proprietary MOLAP databases and BI tool sets: o Oracle Sybase (also supports Relational OLAP) 80 o IBM Cognos TM1 o MicroStrategy (also supports Relational OLAP) Relational OLAP ROLAP - Relational OLAP – OLAP done on relational DBMS. Advantages: Uses familiar RDBMS technologies and products. Uses familiar SQL. Existing skill base and tools. (Possibly) deal with only one vendor‘s DBMS for OLTP and OLAP. Disadvantages: Historically inefficient implementation (although have improved considerably over time). Example ROLAP databases and BI tool sets: o Oracle OLAP o Microsoft Analysis Services o Oracle Essbase (also supports MOLAP) o Mondrian (Open Source) offered by Pentaho o MicroStrategy (also support MOLAP) Oracle OLAP Implementations Oracle RDBMS has supported OLAP structures in both a proprietary MOLAP implementation (within the relational database system) and as relational OLAP cubes. Oracle OLAP Architecture 81 Implementation Techniques for OLAP Data Warehouse Implementation Objectives: ƒ Monitoring: Sending data from sources ƒ Integrating: Loading, cleansing,... ƒ Processing: Efficient cube computation, and query processing in general, indexing, ... Cube Computation: One approach extends SQL using compute cube operator. A cube operator is the n-dimensional generalization of the group-by SQL clause. OLAP needs to compute the cuboid corresponding each input query. Pre-computation: for fast response time, it seems a good idea to pre-compute data for all cuboids or at least a subset of cuboids since the number of cuboids is: Materialization of data cube: Store in warehouse results useful for common queries Pre-compute some cuboids: This is equivalent to the define new warehouse relations using SQL expressions Materialize every (cuboid) (full materialization), none (no materialization), or some (partial materialization) Selection of which cuboids to materialize based on size, sharing, access frequency, etc. Define new warehouse relations using SQL expressions Cube Operation Cube definition and computation in DMQL define cube sales[item, city, year]: sum(sales_in_dollars) compute cube sales Transform it into a SQL-like language (with a new operator cube by, introduced by Gray et al.‘96) SELECT item, city, year, SUM (amount) FROM SALES CUBE BY item, city, year Need compute the following Group-Bys (date, product, customer), (date,product),(date, customer), (product, customer), (date), (product), (customer) 82 () Cube Computation Methods ROLAP-based cubing: Sorting, hashing, and grouping operations are applied to the dimension attributes in order to reorder and cluster related tuples. Grouping is performed on some subaggregates as a ―partial grouping step‖. Aggregates may be computed from previously computed aggregates, rather than from the base fact table MOLAP Approach Uses Array-based algorithm The base cuboid is stored as multidimensional array. Read in a number of cells to compute partial cuboids Indexing OLAP Data: Bitmap Index Approach: Index on a particular column Each value in the column has a bit vector: bit-op is fast The length of the bit vector: # of records in the base table The i-th bit is set if the i-th row of the base table has the value for the indexed column Not suitable for high cardinality domains Example: Base Table: 83 Indexing OLAP Data: Join Indices Join index: JI(R-id, S-id) where R (R-id, …) >< S (S-id, …) Traditional indices map the values to a list of record ids. It materializes relational join in JI file and speeds up relational join — a rather costly operation. In data warehouses, join index relates the values of the dimensions of a star schema to rows in the fact table. E.g. fact table: Sales and two dimensions city and product A join index on city maintains for each distinct city a list of R-IDs of the tuples recording the Sales in the city Join indices can span multiple dimensions 84 3.6 Overview on OLAP Software What is Online Analytical Processing (OLAP)? Online Analytical Processing (OLAP) databases facilitate business-intelligence queries. OLAP is a database technology that has been optimized for querying and reporting, instead of processing transactions. The source data for OLAP is Online Transactional Processing (OLTP) databases that are commonly stored in data warehouses. OLAP data is derived from this historical data, and aggregated into structures that permit sophisticated analysis. OLAP data is also organized hierarchically and stored in cubes instead of tables. It is a sophisticated technology that uses multidimensional structures to provide rapid access to data for analysis. This organization makes it easy for a PivotTable report or PivotChart report to display highlevel summaries, such as sales totals across an entire country or region, and also display the details for sites where sales are particularly strong or weak. OLAP databases are designed to speed up the retrieval of data. Because the OLAP server, rather than Microsoft Office Excel, computes the summarized values, less data needs to be sent to Excel when you create or change a report. This approach enables you to work with much larger amounts of source data than you could if the data were organized in a traditional database, where Excel retrieves all of the individual records and then calculates the summarized values. OLAP databases contain two basic types of data: measures, which are numeric data, the quantities and averages that you use to make informed business decisions, and dimensions, which are the categories that you use to organize these measures. OLAP databases help organize data by many levels of detail, using the same categories that you are familiar with to analyze the data. The following sections describe each of these components in more detail: Cube: A data structure that aggregates the measures by the levels and hierarchies of each of the dimensions that you want to analyze. Cubes combine several dimensions, such as time, geography, and product lines, with summarized data, such as sales or inventory figures. Cubes are not "cubes" in the strictly mathematical sense because they do not necessarily have equal sides. However, they are an apt metaphor for a complex concept. Measure: A set of values in a cube that are based on a column in the cube's fact table and that are usually numeric values. Measures are the central values in the cube that 85 are preprocessed, aggregated, and analyzed. Common examples include sales, profits, revenues, and costs. Member: An item in a hierarchy representing one or more occurrences of data. A member can be either unique or non-unique. For example, 2007 and 2008 represent unique members in the year level of a time dimension, whereas January represents non-unique members in the month level because there can be more than one January in the time dimension if it contains data for more than one year. Calculated member: A member of a dimension whose value is calculated at run time by using an expression. Calculated member values may be derived from other members' values. For example, a calculated member, Profit, can be determined by subtracting the value of the member, Costs, from the value of the member, Sales. Dimension: A set of one or more organized hierarchies of levels in a cube that a user understands and uses as the base for data analysis. For example, a geography dimension might include levels for Country/Region, State/Province, and City. Or, a time dimension might include a hierarchy with levels for year, quarter, month, and day. In a PivotTable report or PivotChart report, each hierarchy becomes a set of fields that you can expand and collapse to reveal lower or higher levels. Hierarchy: A logical tree structure that organizes the members of a dimension such that each member has one parent member and zero or more child members. A child is a member in the next lower level in a hierarchy that is directly related to the current member. For example, in a Time hierarchy containing the levels Quarter, Month, and Day, January is a child of Qtr1. A parent is a member in the next higher level in a hierarchy that is directly related to the current member. The parent value is usually a consolidation of the values of all of its children. For example, in a Time hierarchy that contains the levels Quarter, Month, and Day, Qtr1 is the parent of January. Level: Within a hierarchy, data can be organized into lower and higher levels of detail, such as Year, Quarter, Month, and Day levels in a Time hierarchy. OLAP features in Excel Retrieving OLAP data: You can connect to OLAP data sources just as you do to other external data sources. You can work with databases that are created with Microsoft SQL Server OLAP Services version 7.0, Microsoft SQL Server Analysis Services version 2000, and Microsoft SQL Server Analysis Services version 2005, the Microsoft OLAP server 86 products. Excel can also work with third-party OLAP products that are compatible with OLEDB for OLAP. You can display OLAP data only as a PivotTable report or PivotChart report or in a worksheet function converted from a PivotTable report, but not as an external data range. You can save OLAP PivotTable reports and PivotChart reports in report templates, and you can create Office Data Connection (ODC) files (.odc) to connect to OLAP databases for OLAP queries. When you open an ODC file, Excel displays a blank PivotTable report, which is ready for you to lay out. Creating cube files for offline use You can create an offline cube file (.cub) with a subset of the data from an OLAP server database. Use offline cube files to work with OLAP data when you are not connected to your network. A cube enables you to work with larger amounts of data in a PivotTable report or PivotChart report than you could otherwise, and speeds retrieval of the data. You can create cube files only if you use an OLAP provider, such as Microsoft SQL Analysis Services Server version 2005, which supports this feature. Server Actions: A server action is an optional but useful feature that an OLAP cube administrator can define on a server that uses a cube member or measure as a parameter into a query to obtain details in the cube, or to start another application, such as a browser. Excel supports URL, Report, Rowset, Drill Through, and Expand to Detail server actions, but it does not support Proprietary, Statement, and Dataset. For more information, see Perform an OLAP server action in a PivotTable report . KPIs: A KPI is a special calculated measure that is defined on the server that allows you to track "key performance indicators" including status (Does the current value meet a specific number?) and trend (what is the value over time?). When these are displayed, the Server can send related icons that are similar to the new Excel icon set to indicate above or below status levels (such as a Stop light icon) or whether a value is trending up or down (such as a directional arrow icon). Server Formatting: Cube administrators can create measures and calculated members with color formatting, font formatting, and conditional formatting rules, that may be designated as a corporate standard business rule. For example, a server format for profit might be a number format of currency, a cell color of green if the value is greater than or equal to 30,000 and red if the value is less than 30,000, and a font style of bold if the value is less than 30,000 and regular if greater than or equal to 30,000. For more information, see Design the layout and format of a PivotTable report. 87 Office display language: A cube administrator can define translations for data and errors on the server for users who need to see PivotTable information in another language. This feature is defined as a file connection property and the user's computer country/regional setting must correspond to the display language. Software components that you need to access OLAP data sources An OLAP provider To set up OLAP data sources for Excel, you need one of the following OLAP providers: Microsoft OLAP provider Excel includes the data source driver and client software that you need to access databases created with Microsoft SQL Server OLAP Services version 7.0, Microsoft SQL Server OLAP Services version 2000 (8.0), and Microsoft SQL Server Analysis Services version 2005 (9.0). Third-party OLAP providers For other OLAP products, you need to install additional drivers and client software. To use the Excel features for working with OLAP data, the third-party product must conform to the OLE-DB for OLAP standard and be Microsoft Office compatible. For information about installing and using a third-party OLAP provider, consult your system administrator or the vendor for your OLAP product. Server databases and cube files The Excel OLAP client software supports connections to two types of OLAP databases. If a database on an OLAP server is available on your network, you can retrieve source data from it directly. If you have an offline cube file that contains OLAP data or a cube definition file, you can connect to that file and retrieve source data from it. Data sources A data source gives you access to all of the data in the OLAP database or offline cube file. After you create an OLAP data source, you can base reports on it, and return the OLAP data to Excel in the form of a PivotTable report or PivotChart report, or in a worksheet function converted from a PivotTable report. Microsoft Query You can use Query to retrieve data from an external database such as Microsoft SQL or Microsoft Access. You do not need to use Query to retrieve data from an OLAP PivotTable that is connected to a cube file. IBM OLAP The online analytical processing (OLAP) in IBM® Cognos® Enterprise software makes data available for users to explore, query and analyze on their own in interactive workspaces. The Cognos platform, the foundation for Cognos Enterprise, offers different OLAP options to 88 meet different needs while providing a consistent user experience and accelerated performance. Cognos Enterprise software provides OLAP capabilities for: Write-back, what-if analysis, planning and budgeting, or other specialized applications. IBM Cognos TM1® is a 64-bit, in-memory OLAP engine designed to meet these needs. Querying a data warehouse that is structured in a star or snowflake schema. A Cognos dynamic cube is an OLAP component designed to accelerate performance over terabytes of data in relational databases. 3.7 Summary The cube is used to represent data along some measure of interest. Although called a "cube", it can be 2-dimensional, 3-dimensional, or higher-dimensional. Each dimension represents some attribute in the database and the cells in the data cube represent the measure of interest. Implementation of the data cube is one of the most important, albeit ―expensive,‖ processes in On-Line Analytical Processing (OLAP). A data warehouse is based on a multidimensional data model which views data in the form of a data cube. Online Analytical Processing (OLAP) databases facilitate business-intelligence queries. OLAP is a database technology that has been optimized for querying and reporting, instead of processing transactions. The source data for OLAP is Online Transactional Processing (OLTP) databases that are commonly stored in data warehouses. OLAP data is derived from this historical data, and aggregated into structures that permit sophisticated analysis. 3.8 Keywords Data Cube, Star schema, Snowflake schema, Multi Dimensional OLAP, Relational OLAP, Cube Operation, ROLAP, MOLAP, IBM OLAP. 3.9 Exercises 1. Explain Conceptual Modeling of Data Warehousing? 2. What are Data Measures? Explain? 3. How to compute data cube measures? 89 4. Explain OLAP Operations? 5. Explain Implementation of OLAP? 6. What are Advantages and Disadvantages of Relational OLAP? 7. Write a note on OLAP Software? 3.10 References 1. Data Mining: Concepts and Techniques by Jiawei Han and Micheline Kamber, Morgan Kaufmann Publisher, Second Edition, 2006. 2. Data Mining Techniques by Arun K Pujari, University Press, Second Edition, 2009. 90 UNIT-4: BASICS OF DATA MINING Structure 4.1 Objectives 4.2 Introduction 4.3 Challenges of Data Mining 4.4 Data Mining Tasks 4.5 Types of Data 4.6 Data Pre-processing 4.7 Measures of Similarity and Dissimilarity 4.8 Data Mining Applications 4.7 Summary 4.8 Keywords 4.9 Exercises 4.10 References 4.1 Objectives The objectives covered under this unit include: The introduction data mining Challenges of Data Mining Data Mining Tasks Types of Data Data Pre-processing Measures of Similarity and Dissimilarity Data Mining Applications 91 4.2 Introduction Data mining is the process of discovering meaningful new correlations, patterns and trends by siftings rough large amounts of Data stored in repositories, using pattern recognition technologies as well as statistical and mathematical techniques.‖ There are other definitions: Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner. Data mining is an interdisciplinary field bringing together techniques from machine learning, pattern recognition, statistics, databases, and visualization to address the issue of information extraction from large data bases. Data mining is predicted to be ―one of the most revolutionary developments of the next decade,‖ according to the online technology magazine ZDNET. In fact, the MIT Technology Review chose data mining as one of 10 emerging technologies that will change the world. ―Data mining expertise is the most sought after...‖ among information technology professionals, according to the 1999 Information Week National Salary Survey. The survey reports: ―Data mining skills are in high demand this year, as organizations increasingly put data repositories online. Effectively analyzing information from customers, partners, and suppliers has become important to more companies. ‗Many companies have implemented a data warehouse strategy and are now starting to look at what they can do with all that data. 4.3 Challenges in Data Mining Developing a Unifying Theory of Data Mining: Several respondents feel that the current state of the art of data mining research is too ―adhoc.‖ Many techniques are designed for individual problems, such as classification or clustering, but there is no unifying theory. However, a theoretical framework that unifies different data mining tasks including clustering, classification, association rules, etc., as well as different data mining approaches (such as statistics, machine learning, database systems, etc.), would help the field and provide a basis for future research. Scaling Up for High Dimensional Data and High Speed: Data Streams One challenges how to design classifiers to handle ultra-high dimensional classification problems. There is a strong need now to build useful classifiers with 92 hundreds of millions or billions of features, for applications such as text mining and drug safety analysis. Such problems often begin with tens of thousands of features and also with Interactions between the features, so the number of implied features get huge quickly. One important problem is mining data streams in extremely large databases (e.g. 100TB). Satellite and computer network data can easily be of this scale. However, today‘s data mining technology is still too slow to handle data of this scale. In addition, data mining should be a continuous, online process, rather than an occasional one-shot process. Organizations that can do this will have a decisive advantage over ones that do not. Data streams present a new challenge for data mining researchers. Mining Sequence Data and Time Series Data: Sequential and time series data mining remains an important problem. Despite progress in other related fields, how to efficiently cluster, classify and predict the trends of these data are still an important open topic. A particularly challenging problem is the noise in time series data. It is an important open issue to tackle. Many time series used for predictions are contaminated by noise, making it difficult to do accurate short-term and long-term predictions. Examples of these applications include the predictions of financial time series and seismic time series. Although signal processing techniques, such as wavelet analysis and filtering, can be applied to remove the noise, they often introduce lags in the filtered data. Such lags reduce the accuracy of predictions because the predictor must overcome the lags before it can predict into the future. Existing data mining methods also have difficulty in handling noisy data and learning meaningful information from the data. Mining Complex Knowledge from Complex Data: One important type of complex knowledge is in the form of graphs. Recent research has touched on the topic of discovering graphs and structured patterns from large data, but clearly, more needs to be done. Another form of complexity is from data that are non-i.i.d. (independent and identically distributed). This problem can occur when mining data from multiple relations. In most domains, the objects of interest are not independent of each other, and are not of a single type. We need data mining systems that can soundly mine the rich structure of relations among objects, such as interlinked Web pages, social networks, metabolic networks in the cell, etc. Yet another important problem is how to mine nonrelational data. A great majority of most organizations‘ data is in text form, not databases, and in more complex data formats including Image, Multimedia, and Web data. Thus, there is a need to study data mining methods that go beyond classification and clustering. 93 Some interesting questions include how to perform better automatic summarization of text and how to recognize the movement of objects and people from Web and Wireless data logs in order to discover useful spatial and temporal knowledge. Distributed Data Mining and Mining Multi-Agent Data: The problem of distributed data mining is very important in network problems. In a distributed environment (such as a sensor or IP network), one has distributed probes placed at strategic locations within the network. The problem here is to be able to correlate the data seen at the various probes, and discover patterns in the global data seen at all the different probes. There could be different models of distributed data mining here, but one could involve a NOC that collects data from the distributed sites, and another in which all sites are treated equally. The goal here obviously would be to minimize the amount of data shipped between the various sites — essentially, to reduce the communication overhead. Security, Privacy, and Data Integrity: Several researchers considered privacy protection in data mining as an important topic. That is, how to ensure the users‘ privacy while their data are being mined. Related to this topic is data mining for protection of security and privacy. One respond states that if we do not solve the privacy issue, data mining will become a derogatory term to the general public. Some respondents consider the problem of knowledge integrity assessment to be important. We quote their observations: ―Data mining algorithms are frequently applied to data that have been intentionally modified from their original version, in order to misinform the recipients of the data or to counter privacy and security threats. Such modifications can distort, to an unknown extent, the knowledge contained in the original data. As a result, one of the challenges facing researchers is the development of measures not only to evaluate the knowledge integrity of a collection of data, but also of measures to evaluate the knowledge integrity of individual patterns. Additionally, the problem of knowledge integrity assessment presents several challenges.‖ 4.4 Data Mining Tasks The following list shows the most common data mining tasks. Description Estimation Prediction 94 Classification Clustering Association Description: Sometimes, researchers and analysts are simply trying to find ways to describe patterns and trends lying within data. For example, a pollster may uncover evidence that those who have been laid off are less likely to support the present incumbent in the presidential election. Descriptions of patterns and trends often suggest possible explanations for such patterns and trends. For example, those who are laid off are now less well off financially than before the incumbent was elected, and so would tend to prefer an alternative. Data mining models should be as transparent as possible. That is, the results of the data mining model should describe clear patterns that are amenable to intuitive interpretation and explanation. Some data mining methods are more suited than others to transparent interpretation. For example, decision trees provide an intuitive and human friendly explanation of the irresults. On the other hand, neural networks are comparatively opaque to non specialists, due to the non linearity and complexity of the model. Estimation: Estimation is similar to classification except that the target variable is numerical rather than categorical. Models are built using ―complete‖ records, which provide the value of the target variable as well as the predictors. Then, for new observations, estimates of the value of the target variable are made, based on the values of the predictors. For example, we might be interested in estimating the systolic blood pressure reading of a hospital patient, based on the patient‘s age, gender, body-mass index, and blood sodium levels. The relationship between systolic blood pressure and the predictor variables in the training set would provide us with an estimation model. We can then apply that model to new cases. Examples of estimation tasks in business and research include: Estimating the amount of money a randomly chosen family of four will spend for back-to-school shopping this fall. Estimating the percentage decrease in rotary-movement sustained by a National Football League running back with a knee injury. 95 Estimating the number of points per game that Patrick Ewing will score when doubleteamed in the playoffs. Prediction: Prediction is similar to classification and estimation, except that for prediction, the results lie in the future. Examples of prediction tasks in business and research include: Predicting the price of a stock three months into the future. Predicting the percentage increase in traffic deaths next year if the speed limit is increased Predicting the winner of this fall‘s baseball World Series, based on a comparison of team statistics Predicting whether a particular molecule in drug discovery will lead to a profitable new drug for a pharmaceutical company any of the methods and techniques used for classification and estimation may also is used, under appropriate circumstances, for prediction. These include the traditional statistical methods of point estimation and confidence interval estimations, simple linear regression and correlation, and multiple regression. Classification: In classification, there is a target categorical variable, such as income bracket, which, for example, could be partitioned into three classes or categories: high income, middle income, and low income. The data mining model examines a large set of records, each record containing information on the target variable as well as a set of input or predictor variables. For example, consider the excerpt from a data set. Suppose that the researcher would like to be able to classify the income brackets of persons not currently in the database, based on other characteristics associated with that person, such as age, gender, and occupation. This task is a classification task, very nicely suited to data mining methods and techniques. The algorithm would proceed roughly as follows. First, examine the data set containing both the predictor variables and the (already classified) target variable, income bracket. In this way, the algorithm (software) ―learns about‖ which combinations of variables are associated with which income brackets. For example, older females may be associated with the high-income bracket. This data set is called the training set. Then the algorithm would look at new records, for which no information about income 96 bracket is available. Based on the classifications in the training set, the algorithm would assign classifications to the new records. For example, a 63-year-old female professor might be classified in the high-income bracket. Examples of classification tasks in business and research include: Determining whether a particular credit card transaction is fraudulent Placing a new student into a particular track with regard to special needs Assessing whether a mortgage application is a good or bad credit risk Diagnosing whether a particular disease is present Determining whether a will was written by the actual deceased, or fraudulently by someone else Identifying whether or not certain financial or personal behaviour indicates a possible terrorist threat. Clustering: Clustering refers to the grouping of records, observations, or cases into classes of similar objects. A cluster is a collection of records that are similar to one another, and dissimilar to records in other clusters. Clustering differs from classification in that there is no target variable for clustering. The clustering task does not try to classify, estimate, or predict the value of a target variable. Instead, clustering algorithms seek to segment the entire data set into relatively homogeneous subgroups or clusters, where the similarity of the records within the cluster is maximized and the similarity to records outside the cluster is minimized. Examples of clustering tasks in business and research include: Target marketing of a niche product for a small-capitalization business that does not have a large marketing budget For accounting auditing purposes, to segmentize financial behaviour into benign and suspicious categories As a dimension-reduction tool when the data set has hundreds of attributes For gene expression clustering, where very large quantities of genes may exhibit similar behaviour. Association: The association task for data mining is the job of finding which attributes ―go together.‖ Most prevalent in the business world, where it is known as affinity analysis or market basket analysis, the task of association seeks to uncover rules for quantifying the relationship between two or more attributes. Association rules are of the form ―If antecedent, then 97 consequent,‖ together with a measure of the support and confidence associated with the rule. For example, a particular supermarket may find that of the 1000 customers shopping on a Thursday night, 200 bought diapers, and of those 200 who bought diapers, 50 bought beer. Thus, the association rule would be ―If buy diapers, then buy beer‖ with a support of 200/1000 = 20% and a confidence of 50/200 = 25%. Examples of association tasks in business and research include: Investigating the proportion of subscribers to a company‘s cell phone plan that respond positively to an offer of a service upgrade Examining the proportion of children whose parents read to them who are themselves good readers Predicting degradation in telecommunications networks Finding out which items in a supermarket are purchased together and which items are never purchased together Determining the proportion of cases in which a new drug will exhibit dangerous side effects. 4.5 Types of Data Data: As a general technology, data mining can be applied to any kind of data as long as the data are meaningful for a target application. Database Data: A database system, also called a database management system (DBMS), consists of a collection of interrelated data, known as a database, and a set of software programs to manage and access the data. The software programs provide mechanisms for defining database structures and data storage; for specifying and managing concurrent, shared, or distributed data access; and for ensuring consistency and security of the information stored despite system crashes or attempts at unauthorized access. A relational database is a collection of tables, each of which is assigned a unique name. Each table consists of a set of attributes (columns or fields) and usually stores a large set of tuples (records or rows). Each tuple in a relational table represents an object identified by a unique key and described by a set of attribute values. A semantic data model, such as an entityrelationship (ER) data model, is often constructed for relational databases. An ER data model represents the database as a set of entities and their relationships. 98 Data Warehouses: A data warehouse is a repository of information collected from multiple sources, stored under a unified schema, and usually residing at a single site. Data warehouses are constructed via a process of data cleaning, data integration, data transformation, data loading, and periodic data refreshing. A data warehouse is usually modeled by a multidimensional data structure, called a data cube, in which each dimension corresponds to an attribute or a set of attributes in the schema, and each cell stores the value of some aggregate measure such as count or sum(sales amount). A data cube provides a multidimensional view of data and allows the precomputation and fast access of summarized data. Although data warehouse tools help support data analysis, additional tools for data mining are often needed for in-depth analysis. Multidimensional data mining (also called exploratory multidimensional data mining) performs data mining in multidimensional space in an OLAP style. That is, it allows the exploration of multiple combinations of dimensions at varying levels of granularity in data mining, and thus has greater potential for discovering interesting patterns representing knowledge. Although data warehouse tools help support data analysis, additional tools for data mining are often needed for in-depth analysis. Multidimensional data mining (also called exploratory multidimensional data mining) performs data mining in multidimensional space in an OLAP style. That is, it allows the exploration of multiple combinations of dimensions at varying levels of granularity in data mining, and thus has greater potential for discovering interesting patterns representing knowledge. Transactional Data: In general, each record in a transactional database captures a transaction, such as a customer's purchase, a flight booking, or a user's clicks on a web page. A transaction typically includes a unique transaction identity number (trans_ID) and a list of the items making up the transaction, such as the items purchased in the transaction. A transactional database may have additional tables, which contain other information related to the transactions, such as item description, information about the salesperson or the branch, and so on. Other kinds of Data: Besides relational database data, data warehouse data, and transaction data, there are many other kinds of data that have versatile forms and structures and rather different semantic meanings. Such kinds of data can be seen in many applications: time-related or sequence data (e.g., historical records, stock exchange data, and time-series and biological sequence data), data streams (e.g., video surveillance and sensor data, which are continuously transmitted), spatial data (e.g., maps), engineering design data (e.g., the design of buildings, system 99 components, or integrated circuits), hypertext and multimedia data (including text, image, video, and audio data), graph and networked data (e.g., social and information networks), and the Web (a huge, widely distributed information repository made available by the Internet). These applications bring about new challenges, like how to handle data carrying special structures (e.g., sequences, trees, graphs, and networks) and specific semantics (such as ordering, image, audio and video contents, and connectivity), and how to mine patterns that carry rich structures and semantics. 4.6 Data Pre-processing The major steps involved in data pre-processing, namely, data cleaning, data integration, data reduction, and data transformation. 1. Data Cleaning : Real-world data tend to be incomplete, noisy, and inconsistent. Data cleaning (or data cleansing) routines attempt to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data. In this section, you will study basic methods for data cleaning. 1.1 Missing Values: Imagine that you need to analyze AllElectronics sales and customer data. You note that many tuples have no recorded value for several attributes such as customer income. How can you go about filling in the missing values for this attribute? Let's look at the following methods. Ignore the tuple: This is usually done when the class label is missing (assuming the mining task involves classification). This method is not very effective, unless the tuple contains several attributes with missing values. It is especially poor when the percentage of missing values per attribute varies considerably. By ignoring the tuple, we do not make use of the remaining attributes' values in the tuple. Such data could have been useful to the task at hand. Fill in the missing value manually: In general, this approach is time consuming and may not be feasible given a large data set with many missing values. Use a global constant to fill in the missing value: 100 Replace all missing attribute values by the same constant such as a label like ―Unknown‖ or −∞. If missing values are replaced by, say, ―Unknown,‖ then the mining program may mistakenly think that they form an interesting concept, since they all have a value in common—that of ―Unknown.‖ 1.2 Noisy Data: ―What is noise?‖ Noise is a random error or variance in a measured variable. In Chapter 2 , we saw how some basic statistical description techniques (e.g., boxplots and scatter plots), and methods of data visualization can be used to identify outliers, which may represent noise. Given a numeric attribute such as, say, price, how can we ―smooth‖ out the data to remove the noise? Let's look at the following data smoothing techniques. Binning: Binning methods smooth a sorted data value by consulting its ―neighborhood,‖ that is, the values around it. The sorted values are distributed into a number of ―buckets,‖ or bins. Because binning methods consult the neighborhood of values, they perform local smoothing. 2. Data Integration: Data mining often requires data integration—the merging of data from multiple data stores. Careful integration can help reduce and avoid redundancies and inconsistencies in the resulting data set. This can help improve the accuracy and speed of the subsequent data mining process. Entity Identification Problem It is likely that your data analysis task will involve data integration, which combines data from multiple sources into a coherent data store, as in data warehousing. These sources may include multiple databases, data cubes, or flat files. 101 There are a number of issues to consider during data integration. Schema integration and object matching can be tricky. How can equivalent real-world entities from multiple data sources be matched up? This is referred to as the entity identification problem. For example, how can the data analyst or the computer be sure that customer_id in one database and cust_number in another refer to the same attribute? Examples of metadata for each attribute include the name, meaning, data type, and range of values permitted for the attribute, and null rules for handling blank, zero, or null values (Section 3.2 ). Such metadata can be used to help avoid errors in schema integration. The metadata may also be used to help transform the data (e.g., where data codes for pay_type in one database may be ―H‖ and ―S‖ but 1 and 2 in another). Hence, this step also relates to data cleaning, as described earlier. χ2 Correlation Test for Nominal Data: For nominal data, a correlation relationship between two attributes, A and B, can be discovered by a χ2 (chi-square) test. Suppose A has c distinct values, namely a1, a2, … ac. B has r distinct values, namely b1, b2, … br. The data tuples described by A and B can be shown as a contingency table, with the c values of A making up the columns and the r values of B making up the rows. Let (Ai, Bj) denote the joint event that attribute A takes on value ai and attribute B takes on value bj, that is, where (A = ai, B = bj). Each and every possible (Ai, Bj) joint event has its own cell (or slot) in the table. The χ2 value (also known as the Pearson χ2statistic) is computed. Tuple Duplication : In addition to detecting redundancies between attributes, duplication should also be detected at the tuple level (e.g., where there are two or more identical tuples for a given unique data entry case). The use of denormalized tables (often done to improve performance by avoiding join s) is another source of data redundancy. Inconsistencies often arise between various duplicates, due to inaccurate data entry or updating some but not all data occurrences. For example, if a purchase order database contains attributes for the purchaser's name and address instead of a key to this information in a purchaser database, discrepancies can occur, such as the same purchaser's name appearing with different addresses within the purchase order database. 102 3. Data Reduction Imagine that you have selected data from the AllElectronics data warehouse for analysis. The data set will likely be huge! Complex data analysis and mining on huge amounts of data can take a long time, making such analysis impractical or infeasible. Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume, yet closely maintains the integrity of the original data. That is, mining on the reduced data set should be more efficient yet produce the same (or almost the same) analytical results. In this section, we first present an overview of data reduction strategies, followed by a closer look at individual techniques. 3.1 Wavelet Transforms: The discrete wavelet transform (DWT) is a linear signal processing technique that, when applied to a data vector X, transforms it to a numerically different vector, X′, of wavelet coefficients. The two vectors are of the same length. When applying this technique to data reduction, we consider each tuple as an n-dimensional data vector, that is, X = (x1, x2, …, xn), depicting n measurements made on the tuple from n database attributes. Histograms: Histograms use binning to approximate data distributions and are a popular form of data reduction. Histograms were introduced in Section 2.2.3. A histogram for an attribute, A, partitions the data distribution of A into disjoint subsets, referred to as buckets or bins. If each bucket represents only a single attribute–value/frequency pair, the buckets are called singleton buckets. Often, buckets instead represent continuous ranges for the given attribute. 4. Data Transformation Strategies Overview In data transformation, the data are transformed or consolidated into forms appropriate for mining. Strategies for data transformation include the following: Smoothing, which works to remove noise from the data. Techniques include binning, regression, and clustering. Attribute construction (or feature construction), where new attributes are constructed and added from the given set of attributes to help the mining process. Aggregation, where summary or aggregation operations are applied to the data. For example, the daily sales data may be aggregated so as to compute monthly and annual 103 total amounts. This step is typically used in constructing a data cube for data analysis at multiple abstraction levels. Normalization, where the attribute data are scaled so as to fall within a smaller range, such as −1.0 to 1.0, or 0.0 to 1.0. Discretization, where the raw values of a numeric attribute (e.g., age) are replaced by interval labels (e.g., 0–10, 11–20, etc.) or conceptual labels (e.g., youth, adult, senior). The labels, in turn, can be recursively organized into higher-level concepts, resulting in a concept hierarchy for the numeric attribute. 4.7 Measures of Similarity and Dissimilarity Measuring Data Similarity and Dissimilarity In data mining applications, such as clustering, outlier analysis, and nearest-neighbor classification, we need ways to assess how alike or unalike objects are in comparison to one another. For example, a store may want to search for clusters of customer objects, resulting in groups of customers with similar characteristics (e.g., similar income, area of residence, and age). Such information can then be used for marketing. A cluster is a collection of data objects such that the objects within a cluster are similar to one another and dissimilar to the objects in other clusters. Outlier analysis also employs clustering-based techniques to identify potential outliers as objects that are highly dissimilar to others. Knowledge of object similarities can also be used in nearest-neighbor classification schemes where a given object (e.g., a patient) is assigned a class label (relating to, say, a diagnosis) based on its similarity toward other objects in the model. Data Matrix versus Dissimilarity Matrix: In Section 2.2, we looked at ways of studying the central tendency, dispersion, and spread of observed values for some attribute X. Our objects there were one-dimensional, that is, described by a single attribute. In this section, we talk about objects described by multiple attributes. Therefore, we need a change in notation. Suppose that we have n objects (e.g., persons, items, or courses) described by p attributes (also called measurements or features, such as age, height, weight, or gender). The objects are , , and so on, where xij is the value for object xi of the jth attribute. For brevity, we hereafter refer to object xi as object i. The objects 104 may be tuples in a relational database, and are also referred to as data samples or feature vectors. Data matrix (or object-by-attribute structure): This structure stores the n data objects in the form of a relational table, or n-by-p matrix (n objects × p attributes) Dissimilarity matrix (or object-by-object structure): This structure stores a collection of proximities that are available for all pairs of n objects. It is often represented by an n-by-n table. Dissimilarity of Numeric Data: Minkowski Distance In this section, we describe distance measures that are commonly used for computing the dissimilarity of objects described by numeric attributes. These measures include the Euclidean, Manhattan, and Minkowski distances. The most popular distance measure is Euclidean distance (i.e., straight line or ―as the crow flies‖). Let i = (Xi1,Xi2….Xip) and j = (Xj1,Xj2….Xjp) be two objects described by p numeric attributes. The Euclidean distance between objects i and j is defined as D(i , j) = ((X1-Y1) – (X2-Y2))^1/2 Another well-known measure is the Manhattan (or city block) distance, named so because it is the distance in blocks between any two points in a city (such as 2 blocks down and 3 blocks over for a total of 5 blocks). It is defined as D(i , j) = (X1-Y1) – (X2-Y2) Both the Euclidean and the Manhattan distance satisfy the following mathematical properties: Non-negativity:: Distance is a non-negative number. Identity of indiscernibles:: The distance of an object to itself is 0. Symmetry:: Distance is a symmetric function. Triangle inequality:: Going directly from object i to object j in space is no more than making a detour over any other object k. Cosine Similarity: A document can be represented by thousands of attributes, each recording the frequency of a particular word (such as a keyword) or phrase in the document. Thus, each document is an object represented by what is called a term-frequency vector. For example, in Table 2.5, we see that Document1 contains five instances of the word team, while hockey occurs three 105 times. The word coach is absent from the entire document, as indicated by a count value of 0. Such data can be highly asymmetric. Cosine similarity is a measure of similarity that can be used to compare documents or, say, give a ranking of documents with respect to a given vector of query words. Let x and y be two vectors for comparison. Using the cosine measure as a similarity function, we have, sim = x.y/||x||.||y|| 4.8 Data Mining Applications In this section, we have focused some of the applications of data mining and its techniques are analyzed respectively Order. o Data Mining Applications in Healthcare Data mining applications in health can have tremendous potential and usefulness. However, the success of healthcare data mining hinges on the availability of clean healthcare data. In this respect, it is critical that the healthcare industry look into how data can be better captured, stored, prepared and mined. Possible directions include the standardization of clinical vocabulary and the sharing of data across organizations to enhance the benefits of healthcare data mining applications o Data mining is used for market basket analysis Data mining technique is used in MBA(Market Basket Analysis).When the customer want to buying some products then this technique helps us finding the associations between different items that the customer put in their shopping buckets. Here the discovery of such associations that promotes the business technique .In this way the retailers uses the data mining technique so that they can identify that which customers intension (buying the different pattern).In this way this technique is used for profits of the business and also helps to purchase the related items. o The data mining is used an emerging trends in the education system in the whole world In Indian culture most of the parents are uneducated .The main aim of in Indian government is the quality education not for quantity. But the day by day the education systems are changed and in the 21st century a huge number of universalities are established by the order of UGC. As the numbers of universities are established side by side, each and every day a millennium of students are enrolls across the country. With huge number of higher education aspirants, we believe that data mining technology can 106 help bridging knowledge gap in higher educational systems. The hidden patterns, associations, and anomalies that are discovered by data mining techniques from educational data can improve decision making processes in higher educational systems. This improvement can bring advantages such as maximizing educational system efficiency, decreasing student's drop-out rate, and increasing student's promotion rate, increasing student's retention rate in, increasing student's transition rate, increasing educational improvement ratio, increasing student's success, increasing student's learning outcome, and reducing the cost of system processes. In this current era we are using the KDD and the data mining tools for extracting the knowledge this knowledge can be used for improving the quality of education .The decisions tree classification is used in this type of applications. o Data mining is now used in many different areas in manufacturing engineering When we retrieve the data from manufacturing system then the customer is to use these data for different purposes like to find the errors in the data ,to enhance the design methodology ,to make the good quality of the data, how best the data can be supported for making the decision. But most of times the data can be first analyzed then after find the hidden patterns which will be control the manufacturing process which will further enhance the quality of the products. Since the importance of data mining in manufacturing has clearly increased over the last 20 years, it is now appropriate to critically review its history and Application . o Data Mining Applications can be generic or domain specific. Data mining system can be applied for generic or domain specific. Some generic data mining applications cannot take its own these decisions but guide users for selection of data, selection of data mining method and for the interpretation of the results. The multi agent based data mining application has capability of automatic selection of data mining technique to be applied. The Multi Agent System used at different levels [8]: First, at the level of concept hierarchy definition then at the result level to present the best adapted decision to the user. This decision is stored in knowledge Base to use in a later decision-making. Multi Agent System Tool used for generic data mining system development uses different agents to perform different tasks. 107 4.7 Summary Database technology has evolved from primitive file processing to the development of database management systems with query and transaction processing. Data mining is the task of discovering interesting patterns from large amounts of data, where the data can be stored in databases, data warehouses, or other information repositories. It is a young interdisciplinary field, drawing from areas such as database systems, data warehousing, statistics, machine learning, data visualization, information retrieval, and high-performance computing. Data pre-processing is an important issue for both data warehousing and data mining, as real-world data tend to be incomplete, noisy, and inconsistent. Data pre-processing includes data cleaning, data integration, data transformation, and data reduction. Data cleaning routines attempt to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data. Data integration combines data from multiple sources to form a coherent data store. Data transformation routines convert the data into appropriate forms for mining. 4.8 Keywords Data mining, Prediction, Clustering, data cleaning, data integration, data transformation, and data reduction, Missing Values, Noisy Data, Wavelet Transforms, Histograms, Smoothing, Aggregation, Normalization, Discretization, Similarity, Dissimilarity, Applications. 4.9 Exercises 1. Define Data mining? 2. Explain Challenges in Data Mining? 3. Explain Various Data Mining Tasks? 4. What are Types of Data? Explain? 5. What is Data Pre-processing? 6. Describe various stages of Data Pre-processing? 7. Explain measures of Similarity and Dissimilarity? 8. Explain any four data mining applications? 108 4.10 References 1. Data Mining: Concepts and Techniques by Jiawei Han and Micheline Kamber, Morgan Kaufmann Publisher, Second Edition, 2006. 2. Introduction to Data Mining with Case Studies by G. K. Gupta, Eastern Economy Edition (PHI, New Delhi), Third Edition, 2009. 3. Data Mining Techniques by Arun K Pujari, University Press, Second Edition, 2009. 4. Usama Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, and Ramasamy Uthurasamy, "Advances in Knowledge Discovery and Data Mining", AAAI Press/ The MIT Press, 1996. 5. Michael Berry and Gordon Linoff, "Data Mining Techniques (For Marketing, Sales, and Customer Support), John Wiley & Sons, 1997. 109 UNIT-5: FREQUENT PATTERNS FOR DATA MINING Structure 5.1 Objectives 5.2 Introduction 5.3 Basic Concepts and Algorithms of Mining Frequent Patterns 5.4 Associations, and Correlations, 5.5 Frequent Item set Generation 5.6 Rule Generation 5.7 Compact Representation of Frequent Item sets. 5.8 Summary 5.9 Keywords 5.10 Exercises 5.11 References 5.1 Objectives In this unit, you will study about the following concepts: The basic concepts of Mining frequent Patters and associated algorithms which we can employ An example :Market basket analysis will be discussed Several Algorithms related to Mining of frequent patterns are covered Some association rules are described Correlations are mentioned Frequent Item set Generation, Rule Generation Compact Representation of Frequent Item sets. 110 5.2 Introduction The topic of frequent pattern mining is indeed rich. This unit is dedicated to methods of frequent item set mining. We delve into the following questions: How can we find frequent item sets from large amounts of data, where the data are either transactional or relational? How can we mine association rules in multilevel and multidimensional space? Which association rules are the most interesting? How can we help or guide the mining procedure to discover interesting associations or correlations? How can we take advantage of user preferences or constraints to speed up the mining process? The techniques learned in this unit may also be extended for more advanced forms of frequent pattern mining, such as from sequential and structured data sets, as we will study in later units. In this section, you will learn methods for mining the simplest form of frequent pat-Terns such as those discussed for market basket analysis in Section We begin by Presenting Apriori, the basic algorithm for finding frequent item sets we look at how to generate strong association rules from frequent item-sets. This Unit describes several variations to the Apriori algorithm for improved efficiency and scalability. It presents pattern-growth methods for mining frequent item sets that confine the subsequent search space to only the data sets containing the current frequent item sets. It presents methods for mining frequent item sets that take advantage of the vertical data format. Finally, we discuss how the results of sequence mining can be applied in a real application domain. The sequence mining task is to discover a set of attributes, shared across time among a large number of objects in a given database. For example, consider the sales database of a bookstore, where the objects represent customers and the attributes represent authors or books. Let‘s say that the database records the books bought by each customer over a period of time. The discovered patterns are the sequences of books most frequently bought by the customers. Consider another example of a web access database at a popular site, where an object is a web user and an attribute is a web page. The discovered patterns are the sequences of most frequently accessed pages at that site. This kind of information can be used to restructure the web-site, or to dynamically insert relevant links in web pages based on user access patterns. 111 5.3 Basic concepts and algorithms of mining frequent patterns Frequent patterns are patterns (such as item sets, subsequences, or substructures) that appear in a data set frequently. For example, a set of items, such as milk and bread that appear frequently together in a transaction data set is a frequent itemset. A subsequence, such as buying first a PC, then a digital camera, and then a memory card, if it occurs frequently in a shopping history database, is a (frequent) sequential pattern. A substructure can refer to different structural forms, such as sub-graphs, sub-trees, or sub-lattices, which may be combined with itemsets or subsequences. If a substructure occurs frequently, it is called a (frequent) structured pattern. Finding such frequent patterns plays an essential role in mining associations, correlations, and many other interesting relationships among data. Moreover, it helps in data classification, clustering, and other data mining tasks as well. Thus, frequent pattern mining has become an important data mining task and a focused theme in data mining research. In this unit, we introduce the concepts of frequent patterns, associations, and correlations, study how they can be mined efficiently. Frequent pattern mining searches for recurring relationships in a given data set. This section introduces the basic concepts of frequent pattern mining for the discovery of databases. By presenting an example of market basket analysis, the earliest form of frequent pattern mining for association rules. The basic concepts of mining frequent patterns and associations are presents a road map to the different kinds of frequent patterns, association rules, and correlation rules that can be mined. Interesting associations and correlations between itemsets in transactional and relational Market Basket Analysis: A Motivating Example Frequent itemset mining leads to the discovery of associations and correlations among items in large transactional or relational data sets. With massive amounts of data continuously being collected and stored, many industries are becoming interested in mining such patterns from their databases. The discovery of interesting correlation relationships among huge amounts of business transaction records can help in many business decision-making processes, such as catalog design, cross-marketing, and customer shopping behavior analysis. 112 A typical example of frequent itemset mining is market basket analysis. This process analyzes customer buying habits by finding associations between the different items that customers place in their ―shopping baskets‖. The discovery of such associations can help retailers develop marketing strategies by gaining insight into which items are frequently purchased together by customers. For instance, if customers are buying milk, how likely are they to also buy bread (and what kind of bread) on the same trip to the supermarket? Such information can lead to increased sales by helping retailers do selective marketing and plan their shelf space. Market basket analysis: Suppose, as manager of an All Electronics branch, you would like to learn more about the buying habits of your customers. Specifically, you wonder, ―Which groups or sets of items are customers likely to purchase on a given trip to the store?‖ To answer your question, market basket analysis may be performed on the retail data of customer transactions at your store. You can then use the results to plan marketing or advertising strategies, or in the design of a new catalog. For instance, market basket analysis may help you design different store layouts. In one strategy, items that are frequently purchased together can be placed in proximity in order to further encourage the sale of such items together. If customers who purchase computers also tend to buy antivirus software at the same time, then placing the hardware display close to the software display may help increase the sales of both items. In an alternative strategy, placing hardware and software at opposite ends of the store may entice customers who purchase such items to pick up other items along the way. For instance, after deciding on an expensive computer, a customer may observe security systems for sale while heading toward the software display to purchase antivirus software and may decide to purchase a home security system as well. Market basket analysis can also help retailers plan which items to put on sale at reduced prices. If customers tend to purchase computers and printers together, then having a sale on printers may encourage the sale of printers as well as computers. If we think of the universe as the set of items available at the store, then each item has a Boolean variable representing the presence or absence of that item. Each basket can then be represented by a Boolean vector of values assigned to these variables. The Boolean vectors can be analyzed for buying patterns that reflect items that are frequently associated or purchased together. These patterns can be represented in the form of association rules. For 113 example, the information that customers who purchase computers also tend to buy antivirus software at the same time is represented in Association Rule (5.1) below: computer)antivirus software [support = 2%; confidence = 60%] Rule support and confidence are two measures of rule interestingness. They respectively reflect the usefulness and certainty of discovered rules. A support of 2% for Association Rule (5.1) means that 2% of all the transactions under analysis show that computer and antivirus software are purchased together. A confidence of 60% means that 60% of the customers who purchased a computer also bought the software. Typically, association rules are considered interesting if they satisfy both a minimum support threshold and a minimum confidence threshold. Such thresholds can be set by users or domain experts. Additional analysis can be performed to uncover interesting statistical correlations Mining frequent patterns without candidate generation Mining frequent patterns in transaction databases, time-series databases, and many other kinds of databases has been studied popularly in data mining research. Most of the previous studies adopt an Apriori-like candidate set generation-and-test approach. However, candidate set generation is still costly, especially when there exist prolific patterns and/or long patterns. In this study, we discuss a novel frequent pattern tree (FP-tree) structure, which is an extended prefix-tree structure for storing compressed, crucial information about frequent patterns, and develop an efficient FP-tree-based mining method, FP-growth, for mining the complete set of frequent patterns by pattern fragment growth. Efficiency of mining is achieved with three techniques: (1) a large database is compressed into a highly condensed, much smaller data structure, which avoids costly, repeated database scans, (2) our FP-treebased mining adopts a pattern fragment growth method to avoid the costly generation of a large number of candidate sets, and (3) a partitioning-based, divide-and-conquer method is used to decompose the mining task into a set of smaller tasks for mining confined patterns in conditional databases, which dramatically reduces the search space. Our performance study shows that the FP-growth method is efficient and scalable for mining both long and short frequent patterns, and is about an order of magnitude faster than the Apriori algorithm and also faster than some recently reported new frequent pattern mining methods. 114 Algorithms: Data Mining Algorithms in R/Frequent Pattern Mining/ FP-Growth Algorithm: In Data Mining, the task of finding frequent pattern in large databases is very important and has been studied in large scale in the past few years. Unfortunately, this task is computationally expensive, especially when a large number of patterns exist. The FP-Growth Algorithm, proposed by Han, is an efficient and scalable method for mining the complete set of frequent patterns by pattern fragment growth, using an extended prefixtree structure for storing compressed and crucial information about frequent patterns named frequent-pattern tree (FP-tree). In his study, Han proved that his method outperforms other popular methods for mining frequent patterns, e.g. the Apriori Algorithm and the Tree Projection. Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set. Motivation: Finding inherent regularities in data What products were often purchased together? — Beer and diapers?! What are the subsequent purchases after buying a PC? What kinds of DNA are sensitive to this new drug? Can we automatically classify web documents? Applications Basket data analysis, cross-marketing, catalogue design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis Why is Frequent Pattern mining important? Discloses an intrinsic and important property of data sets Forms the foundation for many essential data mining tasks Association, correlation, and causality analysis Sequential, structural (e.g., sub-graph) patterns Pattern analysis in spatiotemporal, multimedia, time-series, and stream data Classification: associative classification Cluster analysis: frequent pattern-based clustering Data warehousing: iceberg cube and cube-gradient Semantic data compression: fascicles 115 Association Rules Mining: A Recent Overview Association rule mining, one of the most important and well researched techniques of data mining. It aims to extract interesting correlations, frequent patterns, associations or casual structures among sets of items in the transaction databases or other data repositories. Association rules are widely used in various areas such as telecommunication networks, market and risk management, inventory control etc. Various association mining techniques and algorithms will be briefly introduced and compared later. Association rule mining is to find out association rules that satisfy the predefined minimum support and confidence from a given database. The problem is usually decomposed into two sub-problems. One is to find those itemsets whose occurrences exceed a predefined threshold in the database; those itemsets are called frequent or large itemsets. The second problem is to generate association rules from those large itemsets with the constraints of minimal confidence. Suppose one of the large itemsets is Lk, Lk = {I1, I2, … , Ik}, association rules with this itemsets are generated in the following way: the first rule is {I1, I2, … , Ik-1}⇒ {Ik}, by checking the confidence this rule can be determined as interesting or not. Then other rule are generated by deleting the last items in the antecedent and inserting it to the consequent, further the confidences of the new rules are checked to determine the interestingness of them. Those processes iterated until the antecedent becomes empty. Since the second sub-problem is quite straight forward, most of the researches focus on the first sub-problem. The first sub-problem can be further divided into two sub-problems: candidate large itemsets generation process and frequent itemsets generation process. We call those itemsets whose support exceeds the support threshold as large or frequent item. In many cases, the algorithms generate an extremely large number of association rules, often in thousands or even millions. Further, the association rules are sometimes very large. It is nearly impossible for the end users to comprehend or validate such large number of complex association rules, thereby limiting the usefulness of the data mining results. Several strategies have been proposed to reduce the number of association rules, such as generating only ―interesting‖ rules, generating only ―non-redundant‖ rules, or generating only those rules satisfying certain other criteria such as coverage, leverage, lift or strength. 116 Association Rules Apriori is an algorithm for frequent item set mining and association rule learning over transactional databases. It proceeds by identifying the frequent individual items in the database and extending them to larger and larger item sets as long as those item sets appear sufficiently often in the database. The frequent item sets determined by Apriori can be used to determine association rules which highlight general trends in the database: this has applications in domains such as market basket analysis. Apriori is designed to operate on databases containing transactions (for example, collections of items bought by customers, or details of a website frequentation). Other algorithms are designed for finding association rules in data having no transactions (Winepi and Minepi), or having no time stamps (DNA sequencing). Each transaction is seen as a set of items (an itemset). Given a threshold subsets of at least , the Apriori algorithm identifies the item sets which are transactions in the database. Apriori uses a "bottom up" approach, where frequent subsets are extended one item at a time (a step known as candidate generation), and groups of candidates are tested against the data. The algorithm terminates when no further successful extensions are found. Apriori uses breadth-first search and a Hash tree structure to count candidate item sets efficiently. It generates candidate item sets of length from item sets of length . Then it prunes the candidates which have an infrequent sub pattern. According to the downward closure lemma, the candidate set contains all frequent -length item sets. After that, it scans the transaction database to determine frequent item sets among the candidates. Method: Initially, scan DB once to get frequent 1-itemset Generate length (k+1) candidate itemsets from length k frequent itemsets Test the candidates against DB Terminate when no frequent or candidate set can be generated 117 Psudo-code Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end return k Lk; Important Details of Apriori How to generate candidates? Step 1: self-joining Lk 118 Step 2: pruning How to count supports of candidates? Example of Candidate-generation L3={abc, abd, acd, ace, bcd} Self-joining: L3*L3 abcd from abc and abd acde from acd and ace Pruning: acde is removed because ade is not in L3 C4={abcd} Algorithm Furthermore, most approaches use very complicated internal data structures which have poor locality and add additional space and computational overheads. SPADE, is a new algorithm for fast discovery of Sequential Patterns. SPADE utilizes combinatorial properties to decompose the original problem into smaller sub-problems that can be independently solved in main-memory using efficient lattice search techniques, and using simple join operations. All sequences are discovered in only three database scans. Experiments show that SPADE outperforms the best previous algorithm by a factor of two, and by an order of magnitude with some pre-processed data. It also has linear scalability with respect to the number of input-sequences, and a number of other database parameters. The task of discovering all frequent sequences in large databases is quite challenging. The search space is extremely large. For example, with m attributes there are O.mk/potentially frequent sequences of length k. With millions of objects in the database the problem of I/O minimization becomes paramount. However, most current algorithms are iterative in nature, requiring as many full database scans as the longest frequent sequence; clearly a very expensive process. Some of the methods, especially those using some form of sampling, can be sensitive to the data-skew, which can adversely affect performance. using Equivalence classes), for discovering the set of all frequent sequences SPADE not only minimizes I/O costs by reducing database scans, but also minimizes computational costs by using efficient search schemes. The vertical id-list based approach is also insensitive to data-skew. An extensive set of experiments shows that SPADE outperforms previous approaches by a factor of two, and by an order of magnitude if we have 119 some additional off-line information. Furthermore, SPADE scales linearly in the database size, and a number of other database parameters. Challenges of Frequent Pattern Mining Challenges Multiple scans of transaction database Huge number of candidates Tedious workload of support counting for candidates Improving Apriori: general ideas Reduce passes of transaction database scans Shrink number of candidates Facilitate support counting of candidates Basic Concepts & Basic Association Rules Algorithms Let I=I1, I2, … , Im be a set of m distinct attributes, T be transaction that contains a set of items such that T ⊆ I, D be a database with different transaction records Ts. Anassociation rule is an implication in the form of X⇒Y, where X, Y ⊂ I are sets of items called itemsets, and X ∩ Y =∅. X is called antecedent while Y is called consequent, the rule means X implies Y. There are two important basic measures for association rules, support(s) and confidence(c). Since the database is large and users concern about only those frequently purchased items, usually thresholds of support and confidence are predefined by users to drop those rules that are not so interesting or useful. The two thresholds are called minimal support and minimal confidence respectively. Support(s) of an association rule is defined as the percentage/fraction of records that contain X ∪ Y to the total number of records in the database. Suppose the support of an item is 0.1%, it means only 0.1 percent of the transaction contain purchasing of this item. Confidence of an association rule is defined as the percentage/fraction of the number of transactions that contain X ∪ Y to the total number of records that contain X. 120 Confidence is a measure of strength of the association rules, suppose the confidence of the association rule X⇒Y is 80%, it means that 80% of the transactions that contain X also contain Y together. In general, a set of items (such as the antecedent or the consequent of a rule) is called an itemset. The number of items in an itemset is called the length of an itemset. Itemsets of some length k are referred to as k-itemsets. Generally, an association rules mining algorithm contains the following steps: • The set of candidate k-itemsets is generated by 1-extensions of the large (k -1)itemsets generated in the previous iteration. • Supports for the candidate k-itemsets are generated by a pass over the database. • Itemsets that do not have the minimum support are discarded and the remaining itemsets are called large k-itemsets. This process is repeated until no more large itemsets are found. The AIS algorithm was the first algorithm proposed for mining association rule. In this algorithm only one item consequent association rules are generated, which means that the consequent of those rules only contain one item, for example we only generate rules like X ∩ Y⇒Z but not those rules as X⇒Y∩ Z. The main drawback of the AIS algorithm is too many candidate itemsets that finally turned out to be small are generated, which requires more space and wastes much effort that turned out to be useless. At the same time this algorithm requires too many passes over the whole database. Apriori is more efficient during the candidate generation process. Apriori uses pruning techniques to avoid measuring certain itemsets, while guaranteeing completeness. These are the itemsets that the algorithm can prove will not turn out to be large. However there are two bottlenecks of the Apriori algorithm. One is the complex candidate generation process that uses most of the time, space and memory. Another bottleneck is the multiple scan of the database. Based on Apriori algorithm, many new algorithms were designed with some modifications or improvements. 121 Increasing the Efficiency of Association Rules Algorithms The computational cost of association rules mining can be reduced in four ways: • by reducing the number of passes over the database • by sampling the database • by adding extra constraints on the structure of patterns • through parallelization. In recent years much progress has been made in all these directions. Reducing the number of passes over the database FP-Tree, frequent pattern mining, is another milestone in the development ofassociation rule mining, which breaks the main bottlenecks of the Apriori. The frequent itemsets are generated with only two passes over the database and without any candidate generation process. FP-tree is an extended prefix-tree structure storing crucial, quantitative information about frequent patterns. Only frequent length-1 items will have nodes in the tree, and the tree nodes are arranged in such a way that more frequently occurring nodes will have better chances of sharing nodes than less frequently occurring ones. FP-Tree scales much better than Apriori because as the support threshold goes down, the number as well as the length of frequent itemsets increase dramatically. The candidate sets that Apriori must handle become extremely large, and the pattern matching with a lot of candidates by searching through the transactions becomes very expensive. The frequent patterns generation process includes two sub processes: constructing the FT-Tree, and generating frequent patterns from the FP-Tree. The mining result is the same with Apriori series algorithms. To sum up, the efficiency of FP-Tree algorithm account for three reasons. First, the FP-Tree is a compressed representation of the original database because only those frequent items are used to construct the tree, other irrelevant information are pruned. Secondly this algorithm only scans the database twice. Thirdly, FP-Tree uses a divide and conquer method that considerably reduced the size of the subsequent conditional FP-Tree. Every algorithm has his limitations, for FP-Tree it is difficult to be used in an interactive mining system. During the interactive mining process, users may change the threshold of support according to the rules. However for FP-Tree the changing of support may lead to repetition of the whole mining process. Another limitation is that 122 FP-Tree is that it is not suitable for incremental mining. Since as time goes on databases keep changing, new datasets may be inserted into the database, those insertions may also lead to a repetition of the whole process if we employ FP-Tree algorithm. Tree Projection is another efficient algorithm recently proposed in. The general idea of Tree Projection is that it constructs a lexicographical tree and projects a large database into a set of reduced, itembased sub-databases based on the frequent patterns mined so far. The number of nodes in its lexicographic tree is exactly that of the frequent itemsets. The efficiency of Tree Projection can be explained by two main factors: (1) the transaction projection limits the support counting in a relatively small space; and (2) the lexicographical tree facilitates the management and counting of candidates and provides the flexibility of picking efficient strategy during the tree generation and transaction projection phrases. 5.8 Summary In this unit, we learnt about the following concepts: Frequent patterns which are patterns (such as itemsets, subsequences, or substructures) that appears in a data set frequently. Market basket analysis is just one form of frequent pattern mining. In fact, there are many kinds of frequent patterns, association rules, and correlation relationships. We begin by presenting Apriori, the basic algorithm for finding frequent itemsets we look at how to generate strong association rules from frequent itemsets. Describes several variations to the Apriori algorithm for improved efficiency and scalability. 5.9 Keywords Apriori, Itemset, Pattern, Market basket analysis, Frequent Pattern mining 5.10 Exercises 1. Explain the Market Basket Analysis. 2. Explain Association and Correlation. 3. Explain apriori algorithm. 123 4. Discuss frequent pattern mining concept and algorithm. 5. Write short notes on association rules. 6. Why is frequent pattern mining important? 7. How do you mine frequent patterns without candidate generation? 8. Explain SPADE algorithm. 5.11 References 1. Data Mining Concepts and techniques by Jiawei Han and Micheline Kamber. 2. Fast Algorithms for Mining Association Rules by Rakesh Agrawal, Ramakrishnan Srikant 3. Discovering Frequent Closed Itemsets for Association Rules Nicolas Pasquier, Yves Bastide, Rafik Taouil, and Lotfi Lakhal 4. Introduction to Data Mining by Tan, Steinbach, Kumar 124 UNIT-6: FP GROWTH ALGORITHMS Structure 6.1 Objectives 6.2 Introduction 6.3 Alternative methods for generating Frequent Itemsets 6.4 FP Growth Algorithm 6.5 Evaluation of Association Patterns 6.6 Summary 6.7. Keywords 6.8 Exercises 6.9 References 6.1 Objectives In this unit, we will learn alternative methods for generating frequent itemsets, description of FP growth algorithm. We also learn about how the evaluation of association patterns is made. 6.2 Introduction This section highlights the alternative methods that are available for generating frequent itemsets. The Apriori Algorithm that was explained is one of the most widely used algorithms in association mining but it is not without its limitations. When there is a dense data set the Apriori Algorithm performance lessens and another disadvantage is that it has a high overhead. Butter →Bread, Chocolate → Teddy, Bear, Beer → Diapers, which of these three seem interesting to you? Which of these three might affect the way you do business? We all can already assume that most people who buy bread will buy butter and if I were to tell you that I have analysis to show that most customers who buy chocolate also buy a teddy bear you wouldn‘t be 125 surprised. But what if we told you about a link between Beer and Diapers, would not that spark your interest? 6.3 Alternative methods for generating frequent item sets The way that the transaction data set is represented can also affect the performance of the algorithm. The more popular representation is the vertical data layout which has been shown in previous section examples; an alternative representation is the horizontal data layout. The horizontal data layout can only be used for smaller data sets because the initial layout transaction data set might not be able to fit into main memory. The traversal of the itemset lattice is another crucial area that has improved over the last couple of years. The rest of this section introduces alternative methods for generating frequent itemsets that take the aforementioned limitations into account and try to improve the efficiency of the Apriori algorithm. Mining Closed Frequent Itemsets In this section, we saw how frequent itemset mining may generate a huge number of frequent itemsets, especially when the min sup threshold is set low or when there exist long patterns in the data set. It showed that closed frequent itemsets9 can substantially reduce the number of patterns generated in frequent itemset mining while preserving the complete information regarding the set of frequent itemsets. That is, from the set of closed frequent itemsets, we can easily derive the set of frequent itemsets and their support. Thus in practice, it is more desirable to mine the set of closed frequent itemsets rather than the set of all frequent itemsets in most cases. ―How can we mine closed frequent itemsets?‖ A naïve approach would be to first mine the complete set of frequent itemsets and then remove every frequent itemset that is a proper subset of, and carries the same support as, an existing frequent itemset. However, this is quite costly. this method would have to first derive 2100 �1 frequent itemsets in order to obtain a length-100 frequent itemset, all before it could begin to eliminate redundant itemsets. This is prohibitively expensive. In fact, there exist only a very small number of closed frequent itemsets in the data set A recommended methodology is to search for closed frequent itemsets directly during the mining process. This requires us to prune the search space as soon as we can identifythe case of closed itemsets during mining. Pruning strategies include the following: 126 Item merging: If every transaction containing a frequent itemset X also contains an itemset Y but not any proper superset of Y, then X [Y forms a frequent closed itemset and there is no need to search for any itemset containing X but no Y., the projected conditional database for prefix itemset fI5:2g is ffI2, I1g,fI2, I1, I3gg, from which we can see that each of its transactions contains itemset fI2, I1g but no proper superset of fI2, I1g. Itemset fI2, I1g can be merged with fI5g to form the closed itemset, fI5, I2, I1: 2g, and we do not need to mine for closed itemsets that contain I5 but not fI2, I1g. Traversal of Itemset Lattice Methods: General-to-Specific This is the strategy that is employed by the Apriori algorithm. It uses prior frequent itemsets to generate future itemsets; by this we mean that it uses frequent k-1 itemsets to generate candidate k-itemset. This strategy is effective when the maximum length of a frequent itemset is not too long. Specific-to-General This strategy is the reverse of the general-to-specific and does as the name suggests. It starts with more specific frequent itemsets before finding general frequent itemsets. This strategy helps us discover maximal frequent itemsets in dense transactions where the frequent itemset border is located near the bottom of the lattice. 127 Bi-directional This strategy is a combination of the previous two, and it is useful because it can rapidly identify the frequent itemset border. It is highly efficient when the frequent itemset border isn‘t located at either extreme which is conveniently handle by one of the previous strategies. The only limitation is that it requires more space to store the candidate itemsets. Equivalence Classes This strategy breaks the lattice equivalence classes as shown in the figure and it then employs a frequent itemset generation algorithm that searches through each class thoroughly before moving to the next class. This is beneficial when a certain itemset is known to be a frequent itemset. For instance, if we knew that the first two items are predominant in most transactions, partitioning the lattice based on the prefix as shown in the diagram might prove advantageous, we can also partition the lattice based on its suffix. 128 Breadth-First As the name suggests this strategy applies a breadth-first approach to the lattice. This method searches through the 1-itemset row first, discovers all the frequent 1-itemsets before going to the next level to find frequent 2-itemsets. Depth-First This strategy traverses the lattice in a depth-first manner. The algorithm picks a certain 1itemset and follows it through all the way down until an infrequent itemset is found with the 1-itemset as its prefix. This method is useful because it helps us determine the border between frequent and infrequent itemsets more quickly than use the breadth-first approach. 6.4 FP Growth Algorithm The FP-Growth Algorithm is an alternative way to find frequent itemsets without using candidate generations, thus improving performance. For so much it uses a divide-and129 conquer strategy. The core of this method is the usage of a special data structure named frequent-pattern tree (FP-tree), which retains the itemset association information. In simple words, this algorithm works as follows: first it compresses the input database creating an FP-tree instance to represent frequent items. After this first step it divides the compressed database into a set of conditional databases, each one associated with one frequent pattern. Finally, each such database is mined separately. Using this strategy, the FPGrowth reduces the search costs looking for short patterns recursively and then concatenating them in the long frequent patterns, offering good selectivity. In large databases, it‘s not possible to hold the FP-tree in the main memory. A strategy to cope with this problem is to firstly partition the database into a set of smaller databases (called projected databases), and then construct an FP-tree from each of these smaller databases. The next subsections describe the FP-tree structure and FP-Growth Algorithm, finally an example is presented to make it easier to understand these concepts. FP-Tree structure The frequent-pattern tree (FP-tree) is a compact structure that stores quantitative information about frequent patterns in a database. Han defines the FP-tree as the tree structure defined below: • One root labelled as ―null‖ with a set of item-prefix subtrees as children, and a frequent-item-header table (presented in the left side of Figure 1); • Each node in the item-prefix subtree consists of three fields: • Item-name: registers which item is represented by the node; • Count: the number of transactions represented by the portion of the path reaching the node; • Node-link: links to the next node in the FP-tree carrying the same item-name, or null if there is none. • Each entry in the frequent-item-header table consists of two fields: • Item-name: as the same to the node; • Head of node-link: a pointer to the first node in the FP-tree carrying the itemname. Additionally the frequent-item-header table can have the count support for an item. The Figure 6.1 below show an example of a FP-tree 130 Figure 6.1: An example of an FP-tree The original algorithm to construct the FP-Tree defined by Han is presented below. Algorithm 1: FP-tree construction Input: A transaction database DB and a minimum support threshold?. Output: FP-tree, the frequent-pattern tree of DB. Method: The FP-tree is constructed as follows. Scan the transaction database DB once. Collect F, the set of frequent items, and the support of each frequent item. Sort F in support-descending order as FList, the list of frequent items. Create the root of an FP-tree, T, and label it as ―null‖. For each transaction Trans in DB do the following: Select the frequent items in Trans and sort them according to the order of FList. Let the sorted frequent-item list in Trans be [ p | P], where p is the first element and P is the remaining list. Call insert tree([ p | P], T ). The function insert tree([ p | P], T ) is performed as follows. If T has a child N such that N.item-name = p.item-name, then increment N ‘s count by 1; else create a new node N , with its count initialized to 1, its parent link linked to T , and its node-link linked to the nodes with the same item-name via the node-link structure. If P is nonempty, call insert tree(P, N ) recursively. By using this algorithm, the FP-tree is constructed in two scans of the database. The first scan collects and sort the set of frequent items, and the second constructs the FP-Tree. 131 Algorithm 2: FP-Growth Input: A database DB, represented by FP-tree constructed according to Algorithm 1, and a minimum support threshold?. Output: The complete set of frequent patterns. Method: Call FP-growth(FP-tree, null). Procedure FP-growth(Tree, a) { (01) if Tree contains a single prefix path then // Mining single prefix-path FP-tree { (02) let P be the single prefix-path part of Tree; (03) let Q be the multipath part with the top branching node replaced by a null root; (04) for each combination (denoted as ß) of the nodes in the path P do (05) generate pattern ß ∪ a with support = minimum support of nodes in ß; (06) let freq pattern set(P) be the set of patterns so generated; } 07) else let Q be Tree; (08) for each item ai in Q do { // Mining multipath FP-tree (09) generate pattern ß = ai ∪ a with support = ai .support; (10) construct ß‘s conditional pattern-base and then ß‘s conditional FP-tree Tree ß; (11) if Tree ß ≠ Ø then (12) call FP-growth(Tree ß , ß); (13) let freq pattern set(Q) be the set of patterns so generated; } (14) return(freq pattern set(P) ∪ freq pattern set(Q) ∪ (freq pattern set(P) × freq pattern set(Q))) } When the FP-tree contains a single prefix-path, the complete set of frequent patterns can be generated in three parts: the single prefix-path P, the multipath Q, and their combinations (lines 01 to 03 and 14). The resulting patterns for a single prefix path are the enumerations of its subpaths that have the minimum support (lines 04 to 06). Thereafter, the multipath Q is defined (line 03 or 07) and the resulting patterns from it are processed (lines 08 to 13). Finally, in line 14 the combined results are returned as the frequent patterns found. 132 6.5 Evaluation of association patterns After the creation of association rules we must decide which rules are actually interesting and of use to us. A market basket data which has about 10 transactions and 5 items can have up to 100 association rules and we need to be able to sift through this all these patterns and identify the most interesting ones. Interestingness is the term coined to define patterns that we consider of interest can be identified by subject and objective measures. Subjective measures are those that depend on the class of users who examine the pattern, for instance the example in the introduction about Teddy Bears → Chocolate & Beer → Diapers is an example of subjective measures, the pattern Teddy Bear → Chocolate can be considered subjectively uninteresting because it doesn‘t reveal any information that isn‘t expected. Incorporating subjective knowledge into pattern evaluation is a complex task and is beyond the scope of this introductory course. An objective measure on the other hand uses statistical information which can be derived from the data to determine whether a particular pattern is interesting; support and confidence are both examples of objective measures of interestingness. These measures can be applied independently of a particular application. But there are limitations that we encounter when we try to use just the numerical support and confidence to determine the usefulness of a particular rule and because of these limitations other measures have been used to evaluate the quality of an association pattern. The rest of this section covers the details of objective measures of interestingness. Objective Measures of Interestingness Lift This is the most popular objective measure of interestingness. It computes the ratio between the rule‘s confidence and the support of the itemset in the rules consequent. Lift = c(A B) / s(b) Interest Factor Is the binary variables equivalent to the lift. Basically it compares the frequency of a pattern against a baseline frequency. I(A, B) = s(A, B)/ s(A) x s(B) = Nf11 / f1+f+1 133 The interest factor lets you know if the itemsets are independent of each other, positively correlated or negatively correlated. = 1, 𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 I(A, B) = > 1, 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑙𝑦 𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑒𝑑 < 1, 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑙𝑦 𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑒𝑑 This measure is not without its own limitation, when dealing with association rules in which the itemset has a high support, the interest factor ends up being close to 1, which suggests that they itemsets are independent, this is a false conclusion and so in situations such as these, using the confidence measure is a better choice. Correlation Analysis Is another objective measure used to analyze relationships between a pair of variables. For binary variables, correlation is can be measured using the equation below: Ф = [f11f00 – f01f10] / [sqrt(f1+f+1f0+f+0] where = 0, 𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 Ф = +1, 𝑝𝑒𝑟𝑓𝑒𝑐𝑡 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑙𝑦 𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑒𝑑 −1, 𝑝𝑒𝑟𝑓𝑒𝑐𝑡 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑙𝑦 𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑒𝑑 Limitations: The correlation measure does not remain invariant when there are proportional changes to the sample size. Another limitation is that it gives equal importance to both copresence and co-absence of items in the transaction and so it is more suitable for analysis of symmetric binary variables IS Measure IS an object measure of interestingness that was proposed to help deal with the limitation of the Correlation measure? It is defined as follows The limitation is that the value of the measure can be large even for uncorrelated and negatively correlated patterns. 6.6 Summary In this unit, we have studied about FP growth algorithm and studied about generation of alternate frequent item sets. We have also discussed about the evaluation of association rules and studied about the parameters used for the said purpose. 134 6.7 Keywords FP-Growth algorithm, Apriori algorithm, Association patterns. 6.8 Exercises 1. Discuss in brief alternative methods of generating frequent item sets. 2. Explain traversal of itemsets based on lattice methods. 3. Explain FP-growth algorithm. 4. Write short notes on FP tree structure. 5. Devise an algorithm to construct an FP-tree. 6. How do you evaluate association patterns? 6.9 References 1. Introduction to Data Mining with Case Studies, by Gupta G. K 2. Data & Text Mining - Business Applications Approach by Thomas W Miller 135 UNIT-7: CLASSIFICATION AND PREDICTION Structure 7.1 Objectives 7.2 Introduction 7.3 Basics of Classification 7.4 General approach to solve classification problem 7.5 Prediction 7.6 Issues Regarding Classification and Prediction 7.7 Summary 7.8 Keywords 7.9 Exercises 7.10 References 7.1 OBJECTIVES In this unit we will learn about basics of classification followed by a brief introduction of general approach to solve classification problem, description of predictions and description of issues regarding classification and prediction. 7.2 Introduction Databases are rich with hidden information that can be used for making intelligent business decisions. Classification and prediction are two forms of data analysis that can be used to extract models describing important data classes or to predict future data trends. Whereas classification predicts categorical labels, prediction models continuous-valued functions. For example, a classification model may be built to predict the expenditures of potential customers on computer equipment given their income and occupation. Many classification and prediction methods have been proposed by researches in machine learning, expert systems, statistics, and neurobiology. Most algorithms are memory resident, typically assuming a small data size. Recent database mining research has built on such work, 136 developing scalable classification and prediction techniques capable of handling large diskresident data. These techniques often consider parallel and distributed processing. Classification may refer to categorization, the process in which ideas and objects are recognized, differentiated, and understood. It is the processes of assigning the data to the predefined classes. Modern systems analysis, which is a tool for complex analysis of objects, is based on the technology of data mining as a tool for identification of structures and laws under not only adequate but also incomplete information. Data mining algorithms primarily include methods for the reconstruction of dependences in identification, classification, and clusterization of objects. What is classification? Following are the examples of cases where the data analysis task is called as Classification: A bank loan officer wants to analyze the data in order to know which customers (loan applicant) are risky or which are safe. A marketing manager at a company needs to analyze to guess a customer with a given profile will buy a new computer. In both of the above examples a model or classifier is constructed to predict categorical labels. These labels are risky or safe for loan application data and yes or no for marketing data. What is prediction? Following are the examples of cases where the data analysis task is called as Prediction: Suppose the marketing manager needs to predict how much a given customer will spend during a sale at his company. In this example we are bother to predict a numeric value. Therefore the data analysis task is example of numeric prediction. In this case a model or predictor will be constructed that predicts a continuous-valuedfunction or ordered value. 7.3 Basics of Classification Definition: Given a collection of records (training set) each record contains a set of attributes; one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes. 137 Goal: Previously unseen records should be assigned a class as accurately as possible. A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. When the class is numerical, the problem is a regression problem where the model constructed predicts a continuous valued function, or ordered value, as opposed to a class label. This model is prediction. Regression analysis is a statistical methodology that is most used for numeric prediction. Classification is a two-Step Process: 1. Model construction: Describing a set of predetermined classes. Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute. The set of tuples used for model construction is training set. The model is represented as classification rules, decision trees, or mathematical formulae. 2. Model usage: For classifying future or unknown objects, estimate the accuracy of the model. The known label of test sample is compared with the classified result from the model. Accuracy rate is the percentage of test set samples that are correctly classified by the model. Test set is independent of training set, otherwise over-fitting will occur Classification V/s Predictio: A bank loans officer needs analysis of her data in order to learn which loan applicants are ―safe‖ and which are ―risky‖ for the bank. A marketing manager at All Electronics needs data analysis to help guess whether a customer with a given profile will buy a new computer. A medical researcher wants to analyze breast cancer data in order to predict which one of three specific treatments a patient should receive. In each of these examples, the data analysis task is classification, where a model or classifier is constructed to predict categorical labels, such as ―safe‖ or ―risky‖ for the loan application data; ―yes‖ or ―no‖ for the marketing data; or ―treatment A,‖ ―treatment B,‖ or ―treatment C‖ for the medical data. These categories can be represented by discrete values, where the ordering among values has no meaning. For example, the values 1, 2, and 3 may be used to represent treatments A, B, and C, where there is no ordering implied among this group of treatment regimes. Suppose that the marketing manager would like to predict how much a given customer will spend during a sale at All Electronics. This data analysis task is an example of numeric prediction, where the model constructed predicts a continuous-valued function, or ordered value, as opposed to a categorical label. This model is a predictor. Regression analysis is a statistical methodology that is most often used for numeric prediction; hence the two terms 138 are often used synonymously. We do not treat the two terms as synonyms, however, because several other methods can be used for numeric prediction, as we shall see later in this chapter. Classification and numeric prediction are the two major types of prediction problems. For simplicity, when there is no ambiguity, we will use the shortened term of prediction to refer to numeric prediction. How does classification work? Data classification is a two-step process. In the first step, a classifier is built describing a predetermined set of data classes or concepts. This is the learning step (or training phase), where a classification algorithm builds the classifier by analyzing or ―learning from‖ a training set made up of database tuples and their associated class labels. A tuple, X, is represented by an n-dimensional attribute vector, X = (x1, x2, : : : , xn), depicting n measurements made on the tuple from n database attributes, respectively, A1, A2, : : : , An. Each tuple, X, is assumed to belong to a predefined class as determined by another database attribute called the class label attribute. The class label attribute is discrete-valued and unordered. It is categorical in that each value serves as a category or class. The individual tuples making up the training set are referred to as training tuples and are selected from the database under analysis. In the context of classification, data tuples can be referred to as samples, examples, instances, data points, or objects. Because the class label of each training tuple is provided, this step is also known as supervised learning (i.e., the learning of the classifier is ―supervised‖ in that it is told 7.4 General approach to solve classification problem ―How does classification work?‖ Data classification is a two-step process, consisting of a learning step (where a classification model is constructed) and a classification step (where the model is used to predict class labels for given data). The process is shown for the loan application data of Figure 7.1 . (The data are simplified for illustrative purposes. In reality, we may expect many more attributes to be considered. 139 Figure 7.1. Data Classification Process The data classification process involves: (a) Learning: Training data are analyzed by a classification algorithm. Here, the class label attribute is loan_decision, and the learned model or classifier is represented in the form of classification rules. (b) Classification: Test data are used to estimate the accuracy of the classification rules. If the accuracy is considered acceptable, the rules can be applied to the classification of new data tuples. 140 In the first step, a classifier is built describing a predetermined set of data classes or concepts. This is the learning step (or training phase), where a classification algorithm builds the classifier by analyzing or ―learning from‖ a training set made up of database tuples and their associated class labels. A tuple, X, is represented by an n-dimensional attribute vector, depicting n measurements made on the tuple from n database attributes respectively. Each tuple, X, is assumed to belong to a predefined class as determined by another database attribute called the class label attribute. The class label attribute is discrete-valued and unordered. It is categorical (or nominal) in that each value serves as a category or class. The individual tuples making up the training set are referred to as training tuples and are randomly sampled from the database under analysis. In the context of classification, data tuples can be referred to as samples, examples, instances, data points, or objects. Because the class label of each training tuple is provided, this step is also known as supervised learning (i.e., the learning of the classifier is ―supervised‖ in that it is told to which class each training tuple belongs). It contrasts with unsupervised learning (or clustering), in which the class label of each training tuple is not known, and the number or set of classes to be learned may not be known in advance. For example, if we did not have the loan_decision data available for the training set, we could use clustering to try to determine ―groups of like tuples,‖ which may correspond to risk groups within the loan application data. This first step of the classification process can also be viewed as the learning of a mapping or function, , that can predict the associated class label y of a given tuple X. In this view, we wish to learn a mapping or function that separates the data classes. Typically, this mapping is represented in the form of classification rules, decision trees, or mathematical formulae. The rules can be used to categorize future data tuples, as well as provide deeper insight into the data contents. They also provide a compressed data representation. ―What about classification accuracy?‖ Firstly, the predictive accuracy of the classifier is estimated. If we were to use the training set to measure the classifier's accuracy, this estimate would likely be optimistic, because the classifier tends to over fit the data (i.e., during learning it may incorporate some particular anomalies of the training data that are not present in the general data set overall). Therefore, a test set is used, made up of test tuples and their associated class labels. They are independent of the training tuples, meaning that they were not used to construct the classifier. The accuracy of a classifier on a given test set is the percentage of test set tuples that are correctly classified by the classifier. The associated class label of each test tuple is compared with the learned classifier's class prediction for that tuple. 141 If the accuracy of the classifier is considered acceptable, the classifier can be used to classify future data tuples for which the class label is not known. (Such data are also referred to in the machine learning literature as ―unknown‖ or ―previously unseen‖ data.). 7.5 Prediction Many forms of data mining are predictive. For example, a model might predict income based on education and other demographic factors. Predictions have an associated probability (How likely is this prediction to be true?). Prediction probabilities are also known as confidence (How confident can I be of this prediction?). Some forms of predictive data mining generate rules, which are conditions that imply a given outcome. For example, a rule might specify that a person who has a bachelor's degree and lives in a certain neighbourhood is likely to have an income greater than the regional average. Rules have an associated support (What percentage of the population satisfies the rule?). Prediction is nothing but models continuous-valued functions i.e., predicts unknown or missing values. 7.6 Issues Regarding Classification and Prediction This section describes issues regarding pre-processing the data for classification and prediction. Criteria for the comparison and evaluation of classification methods are also described. Preparing the Data for Classification and Prediction The following pre-processing steps may be applied to the data to help improve the accuracy, efficiency, and scalability of the classification or prediction process. Data cleaning: This refers to the pre-processing of data in order to remove or reduce noise (by applying smoothing techniques, for example) and the treatment of missing values (e.g., by replacing a missing value with the most commonly occurring value for that attribute, or with the most probable value based on statistics). Although most classification algorithms have some mechanisms for handling noisy or missing data, this step can help reduce confusion during learning. Relevance analysis: Many of the attributes in the data may be redundant. Correlation analysis can be used to identify whether any two given attributes are statistically related. For example, a strong correlation between attributes A1 and A2 would suggest 142 that one of the two could be removed from further analysis. A database may also contain irrelevant attributes. Attribute subset selection can be used in these cases to find a reduced set of attributes such that the resulting probability distribution of the data classes is as close as possible to the original distribution obtained using all attributes. Hence, relevance analysis, in the form of correlation analysis and attribute subset selection, can be used to detect attributes that do not contribute to the classification or prediction task. Including such attributes may otherwise slow down, and possibly mislead, the learning step. Ideally, the time spent on relevance analysis, when added to the time spent on learning from the resulting ―reduced‖ attribute (or feature) subset, should be less than the time that would have been spent on learning from the original set of attributes. Hence, such analysis can help improve classification efficiency and scalability. Data transformation and reduction: The data may be transformed by normalization, particularly when neural networks or methods involving distance measurements are used in the learning step. Normalization involves scaling all values for a given attribute so that they fall within a small specified range, such as �1:0 to 1:0, or 0:0 to 1:0. In methods that use distance measurements, for example, this would prevent attributes with initially large ranges (like, say, income) from out weighing attributes with initially smaller ranges (such as binary attributes). The data can also be transformed by generalizing it to higher-level concepts. Concept hierarchies may be used for this purpose. This is particularly useful for continuous valued attributes. For example, numeric values for the attribute income can be generalized to discrete ranges, such as low, medium, and high. Similarly, categorical attributes, like street, can be generalized to higher-level concepts, like city. Because generalization compresses the original training data, fewer input/output operations may be involved during learning. Data can also be reduced by applying many other methods, ranging from wavelet transformation and principle components analysis to discretization techniques, such as binning, histogram analysis, and clustering. Comparing Classification and Prediction Methods Here are the criteria for comparing methods of Classification and Prediction: 143 Accuracy - Accuracy of classifier refers to ability of classifier predict the class label correctly and the accuracy of predictor refers to how well a given predictor can guess the value of predicted attribute for a new data. Speed - This refers to the computational cost in generating and using the classifier or predictor. Robustness - It refers to the ability of classifier or predictor to make correct predictions from given noisy data. Scalability - Scalability refers to ability to construct the classifier or predictor efficiently given large amount of data. Interpretability - This refers to what extent the classifier or predictor understand. 7.7 Summary In this unit, we studied about classification and prediction methods. The pre-processing issues for classification/prediction are also discussed in brief. Comparative analogy is provided to understand classification and prediction processes. 7.8 Key words Prediction, Classification, Data cleaning, Data pre-processing. 7.9 Exercises 1. Explain the basic concepts of Classification. 2. Discuss the issues regarding data preparation in classification process. 3. What are the criteria used for comparing classification to prediction? 4. Discuss the general strategy of classification process. 7.10 References 1. Data Mining Techniques, Arun K Pujari, 1st Edition. 2. Data Mining Concepts and Techniques, Jiawei Han and Micheline Kamber. 3. Data Mining: Introductory and Advanced topics, Margaret H Dunham PEA. 4. The Data Warehouse lifecycle toolkit, Ralph Kimball Wiley Student Edition. 144 UNIT-8: APPROACHES FOR CLASSIFICATION Structure 8.1 Objectives 8.2 Introduction 8.3 Basics of Probability Theory 8.4 Statement and Interpretation 8.5 Examples and Applications 8.6 Advantages and disadvantages of Bayesian methods 8.7 Bayesian Classifier 8.8 Classification by decision tree induction 8.9 Rule based classification 8.10 Summary 8.11 Keywords 8.12 Exercises 8.13 References 8.1 Objectives In this unit we will learn about The basics of probability theory The statement and examples of Bayes theorem The applications of Bayes theorem Bayesian classification Decision tree and its application for classification. Rule based classification Pruning Tree 145 8.2 Introduction Bayesian classifiers: Bayesian classifiers are statistical classifiers. They can predict class membership probabilities, such as the probability that a given tuple belongs to a particular class. Bayesian classification is based on Bayes‘ theorem, described below. Studies comparing classification algorithms have found a simple Bayesian classifier known as the naive Bayesian classifier to be comparable in performance with decision tree and selected neural network classifiers. Bayesian classifiers have also exhibited high accuracy and speed when applied to large databases. Rule-Based Classification: Here we look at rule-based classifiers, where the learned model is represented as a set of IF-THEN rules. We first examine how such rules are used for classification. We then study ways in which they can be generated, either from a decision tree or directly from the training data using a sequential covering algorithm. 8.3 Basics of Probability Theory In the logic based approaches we have still assumed that everything is either believed false or believed true. However, it is often useful to represent the fact that we believe that something is probably true, or true with probability (say) 0.65. This is useful for dealing with problems where there is genuine randomness and unpredictability in the world (such as in games of chance) and also for dealing with problems where we could, if we had sufficient information, work out exactly what is true in the world, but where this is impractical. It is possible to have solution through the concept of probability. Probability theory is the branch of mathematics concerned with probability, the analysis of random phenomena. The central objects of probability theory are random variables, stochastic processes, and events: mathematical abstractions of non-deterministic events or measured quantities that may either be single occurrences or evolve over time in an apparently random fashion. If an individual coin toss or the roll of dice is considered to be a random event, then if repeated many times the sequence of random events will exhibit certain patterns, which can be studied and predicted. Two representative mathematical results describing such patterns are the law of large numbers and the central limit theorem. As a mathematical foundation for statistics, probability theory is essential to many human activities that involve quantitative analysis of large sets of data. Methods of probability 146 theory also apply to descriptions of complex systems given only partial knowledge of their state, as in statistical mechanics. A great discovery of twentieth century physics was the probabilistic nature of physical phenomena at atomic scales, described in quantum mechanics. In probability theory and statistics, Bayes's theorem (alternatively Bayes's law or Bayes's rule) is a theorem with two distinct interpretations. In the Bayesian interpretation, it expresses how a subjective degree of belief should rationally change to account for evidence. In the frequentist interpretation, it relates inverse representations of the probabilities concerning two events. In the Bayesian interpretation, Bayes' theorem is fundamental to Bayesian statistics, and has applications in fields including science, engineering, medicine and law. The application of Bayes' theorem to update beliefs is called Bayesian inference. Introductory example: If someone told you they had a nice conversation in the train, the probability it was a woman they spoke with is 50%. If they told you the person they spoke to was going to visit a quilt exhibition, it is far more likely than 50% it is a woman. Call W the event "they spoke to a woman", and Q, the event "a visitor of the quilt exhibition". Then: P(W) = 0.50; but with the knowledge of Q, the updated value is P(W|Q) that may be calculated with Bayes' formula as: in which M (man) is the complement of W. As P(M) = P(W) = 0.5 and P(Q|W) >> P(Q|M), the updated value will be quite close to 1. 8.4 Statement and Interpretation Mathematically, Bayes' theorem gives the relationship between the probabilities of A and B, P(A) and P(B), and the conditional probabilities of A given B and B given A, P(A|B) and P(B|A). In its most common form, it is: The meaning of this statement depends on the interpretation of probability ascribed to the terms: 147 Bayesian interpretation: In the Bayesian (or epistemological) interpretation, probability measures a degree of belief. Bayes' theorem then links the degree of belief in a proposition before and after accounting for evidence. For example, suppose somebody proposes that a biased coin is twice as likely to land heads as tails. Degree of belief in this might initially be 50%. The coin is then flipped a number of times to collect evidence. Belief may rise to 70% if the evidence supports the proposition. For proposition A and evidence B, P(A), the prior, is the initial degree of belief in A. P(A|B), the posterior, is the degree of belief having accounted for B. P(B|A)/P(B) represents the support B provides for A. Frequentist interpretation: In the frequentist interpretation, probability is defined with respect to a large number of trials, each producing one outcome from a set of possible outcomes, . An event is a subset of . The probability of event A, P(A), is the proportion of trials producing an outcome in A. Similarly for the probability of B, P(B). If we consider only trials in which A occurs, the proportion in which B also occurs is P(B|A). If we consider only trials in which B occurs, the proportion in which A also occurs is P(A|B). Bayes' theorem is a fixed relationship between these quantities. This situation may be more fully visualized with tree diagrams, shown to the right. The two diagrams represent the same information in different ways. For example, suppose that A is having a risk factor for a medical condition, and B is having the condition. In a population, the proportion with the condition depends whether those with or without the risk factor are examined. The proportion having the risk factor depends whether those with or without the condition is examined. Bayes' theorem links these inverse representations. Bayesian forms: Simple form For events A and B, provided that P(B) ≠ 0. 148 In a Bayesian inference step, the probability of evidence B is constant for all models An. The posterior may then be expressed as proportional to the numerator: Extended form Often, for some partition of the event space {Ai}, the event space is given or conceptualized in terms of P(Ai) and P(B|Ai). It is then useful to eliminate P(B) using the law of total probability: In the special case of a binary partition, Three or more events Extensions to Bayes' theorem may be found for three or more events. For example, for three events, two possible tree diagrams branch in the order BCA and ABC. By repeatedly applying the definition of conditional probability: As previously, the law of total probability may be substituted for unknown marginal probabilities. 149 For random variables Figure 13.1 Diagram illustrating the meaning of Bayes' theorem as applied to an event space generated by continuous random variables X and Y. Note that there exists an instance of Bayes' theorem for each point in the domain. In practise, these instances might be parameterised by writing the specified probability densities as a function of x and y. Consider a sample space Ω generated by two random variables X and Y. In principle, Bayes' theorem applies to the events A = {X=x} and B = {Y=y}. However, terms become 0 at points where either variable has finite probability density. To remain useful, Bayes' theorem may be formulated in terms of the relevant densities (see Derivation). Simple form If X is continuous and Y is discrete, If X is discrete and Y is continuous, If both X and Y are continuous, 150 Extended form A continuous event space is often conceptualized in terms of the numerator terms. It is then useful to eliminate the denominator using the law of total probability. For fy(Y), this becomes an integral: Bayes' rule Under the Bayesian interpretation of probability, Bayes' rule may be thought of as Bayes' theorem in odds form. Where Derivation of Bayes Theorem: For general events: Bayes' theorem may be derived from the definition of conditional probability: For random variables: For two continuous random variables X and Y, Bayes' theorem may be analogously derived from the definition of conditional density: 151 8.5. Examples and Applications (a) Frequentist example An entomologist spots what might be a rare subspecies of beetle, due to the pattern on its back. In the rare subspecies, 98% have the pattern. In the common subspecies, 5% have the pattern. The rare subspecies accounts for only 0.1% of the population. How likely is the beetle to be rare? From the extended form of Bayes' theorem, (b) Drug testing Suppose a drug test is 99% sensitive and 99% specific. That is, the test will produce 99% true positive results for drug users and 99% true negative results for non-drug users. Suppose that 0.5% of people are users of the drug. If a randomly selected individual tests positive, what is the probability he or she is a user? 152 Despite the apparent accuracy of the test, if an individual tests positive, it is more likely that they do not use the drug than that they do. This surprising result arises because the number of non-users is very large compared to the number of users, such that the number of false positives (0.995%) outweighs the number of true positives (0.495%). To use concrete numbers, if 1000 individuals are tested, there are expected to be 995 non-users and 5 users. From the 995 non-users, positives are expected. From the 5 users, false true positives are expected. Out of 15 positive results, only 5, about 33%, are genuine. Applications: Bayesian inference has applications in artificial intelligence and expert systems. Bayesian inference techniques have been a fundamental part of computerized pattern recognition techniques since the late 1950s. There is also an ever growing connection between Bayesian methods and simulation-based Monte Carlo techniques since complex models cannot be processed in closed form by a Bayesian analysis, while a graphical model structure may allow for efficient simulation algorithms like the Gibbs sampling and other Metropolis–Hastings algorithm schemes. Recently Bayesian inference has gained popularity amongst the phylogenetics community for these reasons; a number of applications allow many demographic and evolutionary parameters to be estimated simultaneously. In the areas of population genetics and dynamical systems theory, approximate Bayesian computation (ABC) is also becoming increasingly popular. As applied to statistical classification, Bayesian inference has been used in recent years to develop algorithms for identifying e-mail spam. Applications which make use of Bayesian inference for spam filtering include DSPAM, Bogofilter, SpamAssassin, SpamBayes, and Mozilla. Spam classification is treated in more detail in the article on the naive Bayes classifier. Solomonoff's Inductive inference is the theory of prediction based on observations; for example, predicting the next symbol based upon a given series of symbols. The only assumption is that the environment follows some unknown but computable probability distribution. It combines two well-studied principles of inductive inference: Bayesian statistics and Occam‘s razor. Solomonoff's universal prior probability of any prefix p of a computable sequence x is the sum of the probabilities of all 153 programs (for a universal computer) that compute something starting with p. Given some p and any computable but unknown probability distribution from which x is sampled, the universal prior and Bayes' theorem can be used to predict the yet unseen parts of x in optimal fashion. 8.6 Advantages and Disadvantages of Bayesian Methods The Bayesian methods have a number of advantages that indicates their suitability in uncertainty management. Most significant is their sound theoretical foundation in probability theory. Thus, they are currently the most mature of all of the uncertainty reasoning methods. While Bayesian methods are more developed than the other uncertainty methods, they are not without faults. 1. They require a significant amount of probability data to construct a knowledge base. Furthermore, human experts are normally uncertain and uncomfortable about the probabilities they are providing. 2. What are the relevant prior and conditional probabilities based on? If they are statistically based, the sample sizes must be sufficient so the probabilities obtained are accurate. If human experts have provided the values, are the values consistent and comprehensive? 3. Often the type of relationship between the hypothesis and evidence is important in determining how the uncertainty will be managed. Reducing these associations to simple numbers removes relevant information that might be needed for successful reasoning about the uncertainties. For example, Bayesian-based medical diagnostic systems have failed to gain acceptance because physicians distrust systems that cannot provide explanations describing how a conclusion was reached (a feature difficult to provide in a Bayesian-based system). 4. The reduction of the associations to numbers also eliminated using this knowledge within other tasks. For example, the associations that would enable the system to explain its reasoning to a user are lost, as is the ability to browse through the hierarchy of evidences to hypotheses. 154 8.7 BAYESIAN CLASSIFIER A Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem (from Bayesian statistics) with strong (naive) independence assumptions. A more descriptive term for the underlying probability model would be "independent feature model". In simple terms, a naive Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 4" in diameter. Even if these features depend on each other or upon the existence of the other features, a naive Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple. Depending on the precise nature of the probability model, naive Bayes classifiers can be trained very efficiently in a supervised learning setting. In many practical applications, parameter estimation for naive Bayes models uses the method of maximum likelihood; in other words, one can work with the naive Bayes model without believing in Bayesian probability or using any Bayesian methods. In spite of their naive design and apparently over-simplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations. An advantage of the naive Bayes classifier is that it requires a small amount of training data to estimate the parameters (means and variances of the variables) necessary for classification. Because independent variables are assumed, only the variances of the variables for each class need to be determined and not the entire covariance matrix. Probabilistic model Abstractly, the probability model for a classifier is a conditional model. over a dependent class variable with a small number of outcomes or classes, conditional on several feature variables F through Fn. The problem is that if the number of features is large or when a feature can take on a large number of values, then basing such a model on probability tables is infeasible. We therefore reformulate the model to make it more tractable. Using Bayes' theorem, this can be written In plain English the above equation can be written as 155 In practice, there is interest only in the numerator of that fraction, because the denominator does not depend on and the values of the features Fi are given, so that the denominator is effectively constant. The numerator is equivalent to the joint probability model which can be rewritten as follows, using the chain rule for repeated applications of the definition of conditional probability: Now the "naive" conditional independence assumptions come into play: assume that each feature Fi is conditionally independent of every other feature Fj for j≠ i given the category . This means that , , , and so on, For i≠ j, k, l, and so the joint model can be expressed as This means that under the above independence assumptions, the conditional distribution over the class variable Where is: (the evidence) is a scaling factor dependent only on if the values of the feature variables are known. 156 , that is, a constant Bayes’ Theorem Bayes‘ theorem is named after Thomas Bayes, a nonconformist English clergyman who did early work in probability and decision theory during the 18th century. Let X be a data tuple. In Bayesian terms, X is considered ―evidence.‖ As usual, it is described by measurements made on a set of n attributes. Let H be some hypothesis, such as that the data tuple X belongs to a specified class C. For classification problems, we want to determine P(HjX), the probability that the hypothesis H holds given the ―evidence‖ or observed data tuple X. In other words, we are looking for the probability that tuple X belongs to class C, given that we know the attribute description of X. P(HjX) is the posterior probability, or a posteriori probability, of H conditioned on X. For example, suppose our world of data tuples is confined to customers described by the attributes age and income, respectively, and that X is a 35-yearold customer with an income of Rs. 40,000. Suppose that H is the hypothesis that our customer will buy a computer. Then P(HjX) reflects the probability that customer X will buy a computer given that we know the customer‘s age and income. In contrast, P(H) is the prior probability, or a priori probability, of H. For our example, this is the probability that any given customer will buy a computer, regardless of age, income, or any other information, for that matter. The posterior probability, P(HjX), is based on more information (e.g., customer information) than the prior probability, P(H), which is independent of X. Similarly, P(XjH) is the posterior probability of X conditioned on H. That is, it is the probability that a customer, X, is 35 years old and earns Rs. 40,000, given that we know the customer will buy a computer. P(X) is the prior probability of X. Using our example, it is the probability that a person from our set of customers is 35 years old and earns Rs. 40,000. ―How are these probabilities estimated?‖ P(H), P(XjH), and P(X) may be estimated from the given data, as we shall see below. Bayes‘ theorem is useful in that it provides a way of calculating the posterior probability, P(HjX), from P(H), P(XjH), and P(X). 8.8 CLASSIFICATION BY DECISION TREE INDUCTION Decision tree learning uses a decision tree as a predictive model which maps observations about an item to conclusions about the item's target value. It is one of the predictive modelling approaches used in statistics, data mining and machine learning. More descriptive 157 names for such tree models are classification trees or regression trees. In these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. For example, we might have a decision tree to help a financial institution decide whether a person should be offered a loan: We wish to be able to induce a decision tree from a set of data about instances together with the decisions or classifications for those instances. Example Instance Data size: small medium large colour: red blue green shape: brick wedge sphere pillar %% yes medium blue brick small red sphere large green pillar large green sphere small red wedge large red wedge large red pillar %% no • In this example, there are 7 instances, described in terms of three features or attributes (size, colour, and shape), and the instances are classified into two classes %% yes and %% no. 158 • We shall now describe an algorithm for inducing a decision tree from such a collection of classified instances. • Originally termed CLS (Concept Learning System) it has been successively enhanced. Tree Induction Algorithm • The algorithm operates over a set of training instances, C. • If all instances in C are in class P, create a node P and stop, otherwise select a feature or attribute F and create a decision node. • Partition the training instances in C into subsets according to the values of V. • Apply the algorithm recursively to each of the subsets C. Output of Tree Induction Algorithm This can easily be expressed as a nested if-statement if (shape == wedge) return no; if (shape == brick) return yes; if (shape == pillar) { if (colour == red) return no; if (colour == green) return yes; } if (shape == sphere) return yes; 159 Classification by Decision Tree Induction Decision tree induction is the learning of decision trees from class-labelled training tuples. A decision tree is a flowchart-like tree structure, where each internal node (non leaf node) denotes a test on an attribute, each branch represents an outcome of the test, and each leaf Node (or terminal node) holds a class label. The topmost node in a tree is the root node. A typical decision tree is shown above. Internal nodes are denoted by rectangles, and leaf nodes are denoted by ovals. Some decision tree algorithms produce only binary trees (where each internal node branches to exactly two other nodes), whereas others can produce non binary trees. ―How are decision trees used for classification?‖ Given a tuple, X, for which the associated class label is unknown, the attribute values of the tuple are tested against the decision tree. A path is traced from the root to a leaf node, which holds the class prediction for that tuple. Decision trees can easily be converted to classification rules. ―Why are decision tree classifiers so popular?‖ The construction of decision tree classifiers does not require any domain knowledge or parameter setting, and therefore is appropriate for exploratory knowledge discovery. Decision trees can handle high dimensional data. Their representation of acquired knowledge in tree form is intuitive and generally easy to assimilate by humans. The learning and classification steps of decision tree induction are simple and fast. In general, decision tree classifiers have good accuracy. However, successful use may depend on the data at hand. Decision tree induction algorithms have been used for classification in many application areas, such as medicine, manufacturing and production, financial analysis, astronomy, and molecular biology. Decision trees are the basis of several commercial rule induction systems. 160 Decision Tree Algorithm: Algorithm: Generate decision tree. Generate a decision tree from the training tuple of data Partition D. Input: Data partition, D, which is a set of training tuples and their associated class labels; attribute list, the set of candidate attributes; Attribute selection method, a procedure to determine the splitting criterion that ―best‖ partitions the data tuples into individual classes. This criterion consists of a splitting attribute and, possibly, either a split point or splitting subset. Output: A decision tree. Method: (1) create a node N; (2) if tuples in D are all of the same class, C then (3) return N as a leaf node labeled with the class C; (4) if attribute list is empty then (5) return N as a leaf node labeled with the majority class in D; // majority voting (6) apply Attribute selection method(D, attribute list) to find the ―best‖ splitting criterion; (7) label node N with splitting criterion; (8) if splitting attribute is discrete-valued and multi way splits allowed then // not restricted to binary trees (9) Attribute list attribute list �splitting attribute; // remove splitting attribute (10) for each outcome j of splitting criterion // partition the tuples and grow subtrees for each partition (11) let Dj be the set of data tuples in D satisfying outcome j; // a partition (12) if Dj is empty then (13) attach a leaf labeled with the majority class in D to node N; else attach the node returned by Generate decision tree(Dj, attribute list) to node N; (14) endfor return N; Starts with a training set of tuples and their associated class labels. The training set is recursively partitioned into smaller subsets as the tree is being built. A basic decision tree algorithm is summarized above. The strategy is as follows. The algorithm is called with three 161 parameters: D, attribute list, and Attribute selection method. We refer to D as a data partition. Initially, it is the complete set of training tuples and their associated class labels. The parameter attribute list is a list of attributes describing the tuples. Attribute selection method specifies a heuristic procedure for selecting the attribute that ―best‖ discriminates the given tuples according to class. This procedure employs an attribute selection measure, such as information gain or the gini index. Whether the tree is strictly binary is generally driven by the attribute selection measure. Some attribute selection measures, such as the gini index, enforce the resulting tree to be binary. Others, like information gain, do not, therein allowing multi-way splits (i.e., two or more branches to be grown from a node). The tree starts as a single node, N, representing the training tuples in D (step 1) If the tuples in D are all of the same class, then node N becomes a leaf and is labelled with that class (steps 2 and 3). Note that steps 4 and 5 are terminating conditions. All of the terminating conditions are explained at the end of the algorithm. Otherwise, the algorithm calls Attribute selection method to determine the splitting criterion. The splitting criterion tells us which attribute to test at node N by determining the ―best‖ way to separate or partition the tuples in D into individual classes (step 6). The splitting criterion also tells us which branches to grow from node N with respect to the outcomes of the chosen test. More specifically, the splitting criterion indicates the splitting attribute and may also indicate either a split-point or a splitting subset. The splitting criterion is determined so that, ideally, the resulting partitions at each branch are as ―pure‖ as possible. A partition is pure if all of the tuples in it belong to the same class. In other words, if we were to split up the tuples in D according to the mutually exclusive outcomes of the splitting criterion, we hope for the resulting partitions to be as pure as possible. The node N is labelled with the splitting criterion, which serves as a test at the node (step 7). A branch is grown from node N for each of the outcomes of the splitting criterion. The tuples in D are partitioned accordingly (steps 10 to 11). There are three possible scenarios. Let A be the splitting attribute. A has v distinct values, fa1, a2, : : : , avg, based on the training data. 1. A is discrete-valued: In this case, the outcomes of the test at node N correspond directly to the known values of A. A branch is created for each known value, aj, of A and labelled with that value. Partition Dj is the subset of class-labelled tuples in D having value aj of A. Because all of the tuples in a given partition have the same value for A, then A need not be considered in any future partitioning of the tuples. Therefore, it is removed from attribute list (steps 8 to 9). 162 2. A is continuous-valued: In this case, the test at node N has two possible outcomes, corresponding to the conditions A _ split point and A > split point, respectively, where split point is the split-point returned by Attribute selection method as part of the splitting criterion. (In practice, the split-point, a, is often taken as the midpoint of two known adjacent values of A and therefore may not actually be a pre-existing value of A from the training data.) Two branches are grown from N and labelled according to the above outcomes. The tuples are partitioned such thatD1 holds the subset of class-labelled tuples in D for which A_split point, while D2 holds the rest. 3. A is discrete-valued and a binary tree must be produced (as dictated by the attribute selection measure or algorithm being used): The test at node N is of the form ―A 2 SA?‖. SA is the splitting subset for A, returned by Attribute selection method as part of the splitting criterion. It is a subset of the known values of A. If a given tuple has value aj of A and if aj 2 SA, then the test at node N is satisfied. Two branches are grown from N. By convention, the left branch out of N is labelled yes so that D1 corresponds to the subset of class-labelled tuples in D that satisfy the test. The right branch out of N is labelled no so that D2 corresponds to the subset of class-labelled tuples from D that do not satisfy the test. The algorithm uses the same process recursively to form a decision tree for the tuples at each resulting partition, Dj, of D (step 14). The recursive partitioning stops only when any one of the following terminating conditions is true: 1. All of the tuples in partition D (represented at node N) belong to the same class (steps 2 and 3), or 2. There are no remaining attributes on which the tuples may be further partitioned (step 4). In this case, majority voting is employed (step 5). This involves converting node N into a leaf and labelling it with the most common class in D. Alternatively, the class distribution of the node tuples may be stored. 3. There are no tuples for a given branch, that is, a partition Dj is empty (step12). In this case, a leaf is created with the majority class in D (step 13). The resulting decision tree is returned (step 15). The computational complexity of the algorithm given training set D is O(n_jDj_log(jDj)), where n is the number of attributes describing the tuples in D and jDj is the number of training tuples in D. This means that the computational cost of growing a tree grows at most n_jDj_log(jDj) with jDj tuples. 163 Tree Pruning When a decision tree is built, many of the branches will reflect anomalies in the training data due to noise or outliers. Tree pruning methods address this problem of over fitting the data. Such methods typically use statistical measures to remove the least reliable branches. Pruned trees tend to be smaller and less complex and, thus, easier to comprehend. They are usually faster and better at correctly classifying independent test data (i.e., of previously unseen tuples) than un-pruned trees. ―How does tree pruning work?‖ There are two common approaches to tree pruning: pre-pruning and post-pruning. In the pre-pruning approach, a tree is ―pruned‖ by halting its construction early (e.g.,by deciding not to further split or partition the subset of training tuples at a given node). Upon halting, the node becomes a leaf. The leaf may hold the most frequent class among the subset tuples or the probability distribution of those tuples. When constructing a tree, measures such as statistical significance, information gain, Gini index, and so on can be used to assess the goodness of a split. If partitioning the tuples at a node would result in a split that falls below a pre-specified threshold, then further partitioning of the given subset is halted. There are difficulties, however, in choosing an appropriate threshold. High thresholds could result in oversimplified trees, whereas low thresholds could result in very little simplification. The second and more common approach is post-pruning, which removes sub-trees from a ―fully grown‖ tree. A sub-tree at a given node is pruned by removing its branches and replacing it with a leaf. The leaf is labelled with the most frequent class among the sub-tree being replaced. In the pruned version of the tree, the sub-tree in question is pruned by replacing it with the leaf ―class B.‖ This approach considers the cost complexity of a tree to be a function of the number of leaves in the tree and the error rate of the tree (where the error rate is the percentage of tuples misclassified by the tree). It starts from the bottom of the tree. For each internal node, N, it computes the cost complexity of the sub-tree at N, and the cost complexity of the sub-tree at N if it were to be pruned (i.e., replaced by a leaf node). The two values are compared. If pruning the sub-tree at node N would result in a smaller cost complexity, then the sub-tree is pruned. Otherwise, it is kept. A pruning set of class-labelled tuples is used to estimate cost complexity. This set is independent of the training set used to build the unpruned tree and of any test set used for accuracy estimation. 164 8.9 Rule-Based Classification IF-THEN Rules: Rule-based classifier make use of set of IF-THEN rules for classification. We can express the rule in the following from: IF condition THEN conclusion Let us consider a rule R1, R1: IF age=youth AND student=yes THEN buy_computer=yes • The IF part of the rule is called rule antecedent or precondition. • The THEN part of the rule is called rule consequent. • In the antecedent part the condition consist of one or more attribute tests and these tests are logically ANDed. • The consequent part consist class prediction. We can also write rule R1 as follows: R1: (age = youth) ^ (student = yes))(buys computer = yes) If the condition holds the true for a given tuple, then the antecedent is satisfied. Rule Extraction Here we will learn how to build a rule based classifier by extracting IF-THEN rules from decision tree. Points to remember to extract rule from a decision tree: • One rule is created for each path from the root to the leaf node. • To from the rule antecedent each splitting criterion is logically ANDed. • The leaf node holds the class prediction, forming the rule consequent. Using IF-THEN Rules for Classification Rules are a good way of representing information or bits of knowledge. A rule-based classifier uses a set of IF-THEN rules for classification. An IF-THEN rule is an expression of the form IF condition THEN conclusion. An example is rule R1, R1: IF age = youth AND student = yes THEN buys computer = yes. The ―IF‖-part (or left-hand side)of a rule isknownas the rule antecedent or precondition. The ―THEN‖-part (or right-hand side) is the rule consequent. In the rule antecedent, the condition consists of one or more attribute tests (such as age = youth, and student = yes) that are 165 logically ANDed. The rule‘s consequent contains a class prediction (in this case, we are predicting whether a customer will buy a computer). R1 can also be written as R1: (age = youth) ^ (student = yes))(buys computer = yes). If the condition (that is, all of the attribute tests) in a rule antecedent holds true for a given tuple,we say that the rule antecedent is satisfied (or simply, that the rule is satisfied) and that the rule covers the tuple. A rule R can be assessed by its coverage and accuracy. Given a tuple, X, from a classlabeled data set,D, let ncovers be the number of tuples covered by R; ncorrect be the number of tuples correctly classified by R; and jDj be the number of tuples in D. We can define the coverage and accuracy of R as coverage(R) =n covers |D| accuracy(R) =n correct ncovers That is, a rule‘s coverage is the percentage of tuples that are covered by the rule (i.e., whose attribute values hold true for the rule‘s antecedent). For a rule‘s accuracy, we look at the tuples that it covers and see what percentage of them the rule can correctly classify. Rule Extraction from a Decision Tree we learned how to build a decision tree classifier from a set of training data. Decision tree classifiers are a popular method of classification—it is easy to understand how decision trees work and they are known for their accuracy. Decision trees can become large and difficult to interpret. In this subsection, we look at how to build a rule based classifier by extracting IFTHEN rules from a decision tree. In comparison with a decision tree, the IF-THEN rules may be easier for humans to understand, particularly if the decision tree is very large. To extract rules from a decision tree, one rule is created for each path from the root to a leaf node. Each splitting criterion along a given path is logically ANDed to form the rule antecedent (―IF‖ part). The leaf node holds the class prediction, forming the rule consequent (―THEN‖ part). Rule Induction Using a Sequential Covering Algorithm IF-THEN rules can be extracted directly from the training data (i.e., without having to generate a decision tree first) using a sequential covering algorithm. The name comes from the notion that the rules are learned sequentially (one at a time), where each rule for a given 166 class will ideally cover many of the tuples of that class (and hopefully none of the tuples of other classes). Sequential covering algorithms are the most widely used approach to mining disjunctive sets of classification rules, and form the topic of this subsection. Note that in a newer alternative approach, classification rules can be generated using associative classification algorithms, which search for attribute-value pairs that occur frequently in the data. These pairs may form association rules, which can be analyzed and used in classification. Since this latter approach is based on association rule mining , we prefer to defer its treatment until later, There are many sequential covering algorithms. Popular variations include AQ, CN2, and the more recent, RIPPER. The general strategy is as follows. Rules are learned one at a time. Each time a rule is learned, the tuples covered by the rule are removed, and the process repeats on the remaining tuples. This sequential learning of rules is in contrast to decision tree induction. Because the path to each leaf in a decision tree corresponds to a rule, we can consider decision tree induction as learning a set of rules simultaneously. A basic sequential covering algorithm Here, rules are learned for one class at a time. Ideally, when learning a rule for a class, Ci, we would like the rule to cover all (or many) of the training tuples of class C and none (or few) of the tuples from other classes. In this way, the rules learned should be of high accuracy. The rules need not necessarily be of high coverage. This is because we can have more than one rule for a class, so that different rules may cover different tuples within the same class. The process continues until the terminating condition is met, such as when there are no more training tuples or the quality of a rule returned is below a user-specified threshold. The Learn One Rule procedure finds the ―best‖ rule for the current class, given the current set of training tuples. Algorithm: Sequential covering. Learn a set of IF-THEN rules for classification. Input: D, a data set class-labeled tuples; Att vals, the set of all attributes and their possible values. Output: A set of IF-THEN rules. Method: (1) Rule set = fg; // initial set of rules learned is empty (2) for each class c do (3) repeat (4) Rule = Learn One Rule(D, Att vals, c); 167 (5) remove tuples covered by Rule from D; (6) until terminating condition; (7) Rule set = Rule set +Rule; // add new rule to rule set (8) endfor (9) return Rule Set; ―How are rules learned?‖ Typically, rules are grown in a general-to-specific manner We can think of this as a beam search, where we start off with an empty rule and then gradually keep appending attribute tests to it. We append by adding the attribute test as a logical conjunct to the existing condition of the rule antecedent. Suppose our training set, D, consists of loan application data. Attributes regarding each applicant include their age, income, education level, residence, credit rating, and the term of the loan. The classifying attribute is loan decision, which indicates whether a loan is accepted (considered safe) or rejected (considered risky). To learn a rule for the class ―accept,‖ we start off with the most general rule possible, that is, the condition of the rule antecedent is empty. The rule is: IF THEN loan decision = accept. We then consider each possible attribute test that may be added to the rule. These can be derived from the parameter Att vals, which contains a list of attributes with their associated values. For example, for an attribute-value pair (att, val), we can consider attribute tests such as att = val, att _ val, att > val, and so on. Typically, the training data will contain many attributes, each of which may have several possible values. Finding an optimal rule set becomes computationally explosive. Instead, Learn One Rule adopts a greedy depth-first strategy. Each time it is faced with adding a new attribute test (conjunct) to the current rule, it picks the one that most improves the rule quality, based on the training samples. We will say more about rule quality measures in a minute. For the moment, let‘s say we use rule accuracy as our quality measure., suppose Learn One Rule finds that the attribute test income = high best improves the accuracy of our current (empty) rule. We append it to the condition, so that the current rule becomes IF income = high THEN loan decision = accept. Each time we add an attribute test to a rule, the resulting rule should cover more of the ―accept‖ tuples. During the next iteration, we again consider the possible attribute tests and end up selecting credit rating = excellent. Our current rule grows to become IF income = high AND credit rating = excellent THEN loan decision = accept. 168 The process repeats, where at each step, we continue to greedily grow rules until the resulting rule meets an acceptable quality level. Greedy search does not allow for backtracking. At each step, we heuristically add what appears to be the best choice at the moment. What if we unknowingly made a poor choice along the way? To lessen the chance of this happening, instead of selecting the best attribute test to append to the current rule, we can select the best k attribute tests. In this way, we perform a beam search of width k wherein we maintain the k best candidates overall at each step, rather than a single best candidate. Rule Induction Using Sequential Covering Algorithm Sequential Covering Algorithm can be used to extract IF-THEN rules form the training data. We do not require to generate a decision tree first. In this algorithm each rule for a given class covers many of the tuples of that class. Some of the sequential Covering Algorithms are AQ, CN2, and RIPPER. As per the general strategy the rules are learned one at a time. For each time rules are learned, a tuple covered by the rule is removed and the process continues for rest of the tuples. This is because the path to each leaf in a decision tree corresponds to a rule. Note: The Decision tree induction can be considered as learning a set of rules simultaneously. The following is the sequential learning Algorithm where rules are learned for one class at a time. When learning a rule from a class Ci, we want the rule to cover all the tuples from class C only and no tuple form any other class. Algorithm: Sequential Covering Input: D, a data set class-labeled tuples, Att_vals, the set of all attributes and their possible values. Output: A Set of IF-THEN rules. Method: Rule_set={ }; // initial set of rules learned is empty for each class c do repeat Rule = Learn_One_Rule(D, Att_valls, c); remove tuples covered by Rule form D; until termination condition; Rule_set=Rule_set+Rule; // add a new rule to rule-set end for return Rule_Set; 169 Rule Pruning The rule is pruned is due to the following reason: • The Assessment of quality are made on the original set of training data. The rule may perform well on training data but less well on subsequent data. That's why the rule pruning is required. • The rule is pruned by removing conjunct. The rule R is pruned, if pruned version of R has greater quality than what was assessed on an independent set of tuples. FOIL is one of the simple and effective method for rule pruning. For a given rule R, FOIL_Prune = pos-neg/ pos+neg Where pos and neg is the number of positive tuples covered by R, respectively. Note:This value will increase with the accuracy of R on pruning set. Hence, if the FOIL_Prune value is higher for the pruned version of R, then we prune R. Learn One Rule does not employ a test set when evaluating rules. Assessments of rule quality as described above are made with tuples from the original training data. Such assessment is optimistic because the rules will likely overfit the data. That is, the rules may perform well on the training data, but less well on subsequent data. To compensate for this, we can prune the rules. A rule is pruned by removing a conjunct (attribute test). We choose to prune a rule, R, if the pruned version of R has greater quality, as assessed on an independent set of tuples. As in decision tree pruning, we refer to this set as a pruning set. Various pruning strategies can be used, such as the pessimistic pruning approach described in the previous section. FOIL uses a simple yet effective method. 8.10 Summary The significance of Bayes theorem in classification is discussed in detail with suitable examples. We have also discussed the advantages and disadvantages of Bayesian methods. The basics of decision tree and its application in classification are presented in brief. The rule based classification is discussed in detail. 8.11 Keywords Bayesian Classification, Decision Tree Induction, Rule-Based Classification. 170 8.12 Exercises 1. Give the statement of Bayes theorem. 2. Discuss the significance of Bayes theorem in classification 3. List and explain the applications of Bayes theorem. 4. What are the advantages and disadvantages of Bayes theorem? 5. Explain classification by decision tree induction. 6. Explain rule based classification. 7. Write short notes on pruning. 8. Suppose a drug test is 88% sensitive and 88% specific. That is, the test will produce 88% true positive results for drug users and 88% true negative results for non-drug users. Suppose that 0.6% of people are users of the drug. If a randomly selected individual tests positive, what is the probability he or she is a user? 8.13 References 1. Introduction to Data Mining with Case Studies, by Gupta G. K. 2. Applications Of Data Mining by T Sudha, M Usha Rani 171 Unit 9: CLASSIFICATION TECHNIQUES Structure 9.1 Objectives 9.2 Introduction 9.3 Classification by Back propagation, 9.4 Support Vector Machines, 9.5 Associative Classification, 9.6 Decision Trees, 9.7 Lazy Learners (K-NN). 9.8 Summary 9.9 Keywords 9.10 Exercises 9.11 References 9.1 Objectives The objectives covered under this unit include: Classification by Back propagation Support Vector Machines Associative Classification Decision Trees Lazy Learners 9.2 Introduction Back propagation, an abbreviation for "backward propagation of errors", is a common method of training artificial neural networks. From a desired output, the network learns from many inputs, similar to the way a child learns to identify a dog from examples of dogs. 172 Neurocomputing is computer modeling based, in part, upon simulation of the structure and function of the brain. Neural networks excel in pattern recognition, that is, the ability to recognize a set of previously learned data. Although their use is rapidly growing in engineering, they are new to the pharmaceutical community Although the long-term goal of the neural-network community remains the design of autonomous machine intelligence, the main modern application of artificial neural networks is in the field of pattern recognition. In the sub-field of data classification, neural-network methods have been found to be useful alternatives to statistical techniques such as those which involve regression analysis or probability density estimation. The potential utility of neural networks in the classification of multisource satellite-imagery databases has been recognized for well over a decade, and today neural networks are an established tool in the field of remote sensing. The most widely applied neural network algorithm in image classification remains the feed forward back propagation algorithm. A Support Vector Machine (SVM) is a discriminative classifier formally defined by a separating hyper plane. In other words, given labelled training data (supervised learning), the algorithm outputs an optimal hyper plane which categorizes new examples. A Decision Tree takes as input an object given by a set of properties, output a Boolean value (yes/no decision). Each internal node in the tree corresponds to test of one of the properties. Branches are labelled with the possible values of the test. Lazy learners store training examples and delay the processing (―lazy evaluation‖) until a new instance must be classified. Imagine a contrasting lazy approach, in which the learner instead waits until the last minute before doing any model construction to classify a given test tuple. That is, when given a training tuple, a lazy learner simply stores it(or does only a little minor pro-cessing) and waits until it is given a test tuple. Only when it sees the test tuple does it perform generalization to classify the tuple based on its similarity to the stored training tuples. Unlike eager learning methods, lazy learners do less work when a training tuple is presented and more work when making a classification or numeric prediction. Because lazy learners store the training tuples or ―instances,‖ they are also referred to as instance-based learners, even though all learning is essentially based on instances. 173 9.3 Classification by Back propagation Back propagation is a neural network learning algorithm. Psychologists originally kindled the field of neural networks and neurobiologists who sought to develop and test computational analogues of neurons. Roughly speaking, a neural network is a set of connected input/output units where each connection has a weight associated with it. During the learning phase, the network learns by adjusting the weights so as to be able to predict the correct class label of the input samples. Neural network learning is also referred to as connectionist learning due to the connections between units. Neural networks involve long training times and are therefore more suitable for applications where this is feasible. They require a number of parameters that are typically best determined empirically, such as the network topology or ―structure.‖ Neural networks have been criticized for their poor interpretability, since it is difficult for humans to interpret the symbolic meaning behind the learned weights. These features initially made neural networks less desirable for data mining. Advantages of neural networks, however, include their high tolerance to noisy data as well as their ability to classify patterns on which they have not been trained. In Addition, several algorithms have recently have been developed for the extraction of rules from trained neural networks. These factors contribute towards the usefulness of neural networks for classification in data mining. The most popular neural network algorithm is the back propagation algorithm, Proposed in the 1980‘s. A Multilayer Feed-Forward Neural Network The back propagation algorithm performs learning on a multilayer fee-forward neural network. The inputs correspond to the attributes measured for each raining sample. The inputs are fed simultaneously into layer of units making up the input layer. The weighted outputs of these units are, in turn, fed simultaneously to a second layer of neuron like units, 174 known as a hidden layer. The hidden layers weighted outputs can be input to another hidden layer, and so on. The number of hidden layers is arbitrary, although in practice, usually only one is used. The weighted outputs of the last hidden layer are input to units making up the output layer, which emits the network‘s prediction for given samples. The units in the hidden layers and output layer are sometimes referred to as neurodes, due to their symbolic biological basis, or as output units. Multilayer feed-forward networks of linear threshold functions, given enough hidden units, can closely approximate any function. Defining Network Topology Before training can begin, the user must decide on the network topology by specifying the number of units in the input layer, the number of hidden layers (if more than one), the number of units in each hidden layer, and the number of units in the output layer. Normalizing the input values for each attribute measured in the training samples will help speed up the learning phase. Typically, input values are normalized so as to fall between 0.0 and 1.0. Discrete-valued attributes may be encoded such that there is one input unit per domain value. For example, if the domain of an attribute A is (a0,a1,a2) then we may assign three input units to represent A. That is, we may have, say, as input units. Each unit is initialized to 0. If A =a0, then it is set to 1. If A==a1 it is set to 1, and so on. One output unit may be used to represent two classes (where the value I represents one class, and the value 0 represents the other). If there are more than two classes, then one output unit per class is used. There are no clear rules as to the ―best‖ number of hidden layer units. Network design is a trial-and –error process and may affect the accuracy of the resulting trained network. The initial values of the weights may also affect the resulting accuracy. Back propagation Back propagation learns by iteratively processing a set of training samples, comparing the network‘s prediction for each sample with the actual known class label. For each training sample, the weights are modified so as to minimize the mean squared error between the network‘s prediction and the actual class. These modifications are made in the ―backwards‖ direction, that is, form the output layer through each hidden layer down to the first hidden layer (hence the name backpropagation). Although it is not guaranteed in general the weights will eventually converge, and the learning process stops. The algorithm is summarized below. Initialize the weights. The 175 weights in the network are initialized to small random number (e.g., ranging from -1.0 to 1.0, or -0.5 to 0.5). Each unit has a bias associated with it, as explained below. The biases are similarly initialized to small random numbers. Each training sample: X, is processed by the following steps Propagate the inputs forward: In this step, the net input and output of each unit in the hidden and output layers are computed. First, the training sample is fed to the input layer of the network. Note that for unit in the input layer, its output is equal to its input layers is computed as a linear combination to it in the previous layer. To compute the net input to the unit, each input is connected to the unit is multiplied by its corresponding weight, and this is summed. Given a unit in a hidden or output layer, the net input to unit is Ij=∑Wij Oi+θj Where Wij, is the weight of the connection from unit; in the previous layer to unit; 0i is the output of unit j from the previous layer; and 0j is the bias of the unit. The bias acts as a threshold in that it serves to vary the activity of the unit. Each unit in the hidden and output layers takes its net input and then applies an activation function to it. The function symbolizes the activation of the neuron represented by the unit. The logistic, or sigmoid, function is used. This function is also referred to as a squashing function, since it maps a large input domain onto the smaller range of 0 to 1. The logistic function is non-linear and differentiable, allowing there back propagation algorithm to model classification problems that are linearly inseparable. Back propagate the error: The error is propagated backwards by updating the weights and biases to reflect the error of the network‘s prediction. For a unit j in the output layer, the error Errj is computed by Errj = 0j (1-0j)(Tj-0j) To compute the error of a hidden layer unit j, the weighted sum of the errors of the units connected to unit j in the next layer is considered. The error of a hidden layer unit j is Erri =0j(1-0j)∑Errk Wjk 176 Where Wjk is the weight of the connection from unit; to a unit k in the next higher layer and Errk is the error of unit k. The weights and biases are updated to reflect the i propagated errors. Weights are updated by the following equations, where DWij is the change in weight Wij. Δ Wij = (L) Errj 0i W ij=W ij + DWij The variable L is the learning rate, a constant typically having a value between 0.0 to 1.0. Back propagation learns using a method of gradient descent to search for a set of weights that can model the given classification problem so as to minimize the mean squard distance between the networks class prediction and the actual class label of the samples. The learning rate helps to avoid getting stuck at a local minimum in decision space (i.e., where the weights appear to converge, but are not the optimum solution) and encourages finding the global minimum. If the learning rate is too small, then learning will occur at a very slow pace. If the learning rate is too large, then oscillation between. Inadequate solutions may occur. A rule of thumb is to set the learning rate to 1/f, where T is the number of iterations through the training set so far. Biases are updated by the following equations below, where Δ θj, is the change in Bias θj; Δ θj =(L) Errj θj= θj+ Δ θj Note that here we are updating the weights and biases after the presentation of each sample. This is referred to as case updating. Alternatively, the weight and bias Increments could be accumulated in variables, so that the weights and biases are updated after all of the samples in the training set have been presented. This latter strategy is called epoch updating, where one iteration through the training set is an epoch. In theory, the mathematical derivation of back propagation employs epoch updating, yet in practice, case updating is more common since it tends to yield more accurate results. Terminating condition Training stops when • All ΔWij in the previous epoch were so small as to be below some specified threshold, or 177 • The percentage of samples misclassified in the previous epoch is below some threshold, or • A pre-specified number of epochs have expired. In practice, several hundreds of thousands of epochs may be required before the weights will converge. 9.4 Support Vector Machines Support Vector Machines, is a promising new method for the classification of both linear and nonlinear data. In a nutshell, a support vector machine (or SVM) is an algorithm that works as follows. It uses a nonlinear mapping to transform the original training data into a higher dimension. Within this new dimension, it searches for the linear optimal separating hyperplane (that is, a ―decision boundary‖ separating the tuples of one class from another). With an appropriate nonlinear mapping to a sufficiently high dimension, data from two classes can always be separated by a hyperplane. The SVM finds this hyperplane using support vectors (―essential‖ training tuples) and margins (defined by the support vectors). ―Why SVMs have attracted a great deal of attention lately?‖ Although the training time of even the fastest SVMs can be extremely slow, they are highly accurate, owing to their ability to model complex nonlinear decision boundaries. They are much less prone to overfitting than othermethods. The support vectors found also provide a compact description of the learned model. SVMs can be used for prediction as well as classification. They have been applied to a number of areas, including handwritten digit recognition, object recognition, and speaker identification, as well as benchmark time-series prediction tests. The Case When the Data Are Linearly Separable To explain the mystery of SVMs, let‘s first look at the simplest case—a two-class problem where the classes are linearly separable. Let the data set D be given as (X1, y1), (X2, y2)... (X|D|, y|D|), where Xi is the set of training tuples with associated class labels, yi. Each yi can take one of two values, either+1 or-1 (i.e., yi є{+1, -1}), corresponding to the classes buys_computer= yes and buys_computer = no, respectively. To aid in visualization, consider an example based on two input attributes, A1 and A2, as shown in Figure. 178 From the graph, we see that the 2-D data are linearly separable (or ―linear,‖ for short) because a straight line can be drawn to separate all of the tuples of class +1 from all of the tuples of class -1. There are an infinite number of separating lines that could be drawn. Next to find the ―best‖ one, that is, one that (we hope) will have the minimum classification error on previously unseen tuples. How to find this best line? Note that if our data were 3-D (i.e.,with three attributes),we would want to find the best separating plane. Generalizing to n dimensions, we want to find the best hyperplane. We will use the term ―hyperplane‖ to refer to the decision boundary that we are seeking, regardless of the number of input attributes. So, in other words, how can we find the best hyperplane? An SVM approaches this problem by searching for the maximum marginal hyperplane. Consider Figure, which shows two possible separating hyperplanes and their associated margins. Before we get into the definition of margins, let‘s take an intuitive look at this figure. 179 Both hyper planes can correctly classify all of the given data tuples. Intuitively, however, the hyper plane with the larger margin to be more accurate at classifying future data tuples than the hyper plane with the smaller margin. This is why (during the learning or training phase), the SVM searches for the hyper plane with the largest margin, that is, the maximum marginal hyper plane (MMH). The associated margin gives the largest separation between classes. Getting to an informal definition of margin, say that the shortest distance from a hyper plane to one side of its margin is equal to the shortest distance from the hyper plane to the other side of its margin, where the ―sides‖ of the margin are parallel to the hyper plane. When dealing with the MMH, this distance is, in fact, the shortest distance from the MMH to the closest training tuple of either class. A separating hyper plane can be written as W.X+b = 0, Where W is a weight vector, namely, W = {w1, w2, ... , wn}, n is the number of attributes; and b is a scalar, often referred to as a bias. To aid in visualization, let‘s consider two input attributes, A1 and A2, as in Figure (b). Training tuples are 2-D, e.g., X = (x1, x2), where x1 and x2 are the values of attributes A1 and A2, respectively, for X. If we think of b as an additional weight, w0, we can rewrite the above separating hyperplane as w0+w1x1+w2x2 = 0, Thus, any point that lies above the separating hyperplane satisfies w0+w1x1+w2x2 > 0: Similarly, any point that lies below the separating hyperplane satisfies w0+w1x1+w2x2 < 0: 180 The weights can be adjusted so that the hyperplanes defining the ―sides‖ of the margin can be written as H1 : w0+w1x1+w2x2 ≥ 1 for yi = +1, and H2 : w0+w1x1+w2x2 ≤-1 for yi = -1: That is, any tuple that falls on or above H1 belongs to class +1, and any tuple that falls on or below H2 belongs to class-1. Combining the two inequalities of Equations in above H1 and H2. yi(w0+w1x1+w2x2) ≥1, ∀i. Any training tuples that fall on hyper planes H1 or H2 (i.e., the ―sides‖ defining the margin) satisfy Equation above and are called support vectors. That is, they are equally close to the (separating) MMH. Essentially, the support vectors are the most difficult tuples to classify and give the most information regarding classification. From the above, we can obtain a formulae for the size of the maximal margin. The distance from the separating hyperplane to any point on H1 is 1/ ||W||, where ||W|| is the Euclidean norm of W, that is W. W. By definition, this is equal to the distance from any point on H2 to the separating hyperplane. Therefore, the maximal margin is 2/||W||. ―So, how does an SVM find the MMH and the support vectors?‖Using some ―fancy math tricks,‖ we can rewrite Equation (6.38) so that it becomes what is known as a constrained. if W = w1 , w2 … . , w3 then W. W = w12 + w22 + ⋯ + wn2 Lagrangian formulation (convex) quadratic optimization problem. If the data are small (say, less than 2,000 training tuples), any optimization software package for solving constrained convex quadratic problems can then be used to find the support vectors and MMH. For larger data, special and more efficient algorithms for training SVMs can be used instead. Once we have found the support vectors and MMH (note that the support vectors define the MMH!), we have a trained support vector machine. The MMH is a linear class boundary, and so the corresponding SVM can be used to classify linearly separable data. We refer to such a trained SVM as a linear SVM. 181 ―Once we have got a trained support vector machine, how do we use it to classify test (i.e., new) tuples?‖ Based on the Lagrangian formulation mentioned above, the MMH can be rewritten as the decision boundary. l d X T yi αi Xi X T + b0 , = i=1 where yi is the class label of support vector Xi; X T is a test tuple; αi and b0 are numeric parameters that were determined automatically by the optimization or SVM algorithm above; and l is the number of support vectors. Note that the αi are Lagrangian multipliers. For linearly separable data, the support vectors are a subset of the actual training tuples (although there will be a slight twist regarding this when dealing with nonlinearly separable data, as see below). Given a test tuple, X T , we plug it into Equation above, and then check to see the sign of the result. This tells us onwhich side of the hyperplane the test tuple falls. If the sign is positive, then X T falls on or above the MMH, and so the SVM predicts thatX T , belongs to class +1 (representing buys_ computer = yes, in buys_computer case). If the sign is negative, then X T falls on or below the MMH and the class prediction is -1 (representing buys computer = no). Notice that the Lagrangian formulation contains a dot product between support vector Xi and test tuple X T ,. This will prove very useful for finding the MMH and support vectors for the case when the given data are nonlinearly separable, as described further below. Before we move on to the nonlinear case, there are two more important things to note. The complexity of the learned classifier is characterized by the number of support vectors rather than the dimensionality of the data. Hence, SVMs tend to be less prone to overfitting than some other methods. The support vectors are the essential or critical training tuples they lie closest to the decision boundary (MMH). If all other training tuples were removed and training were repeated, the same separating hyperplane would be found. Furthermore, the number of support vectors found can be used to compute an (upper) bound on the expected error rate of the SVM classifier, which is independent of the data dimensionality. An SVM with a small number of support vectors can have good generalization, even when the dimensionality of the data is high. 182 The Case When the Data Are Linearly Inseparable SVMs for classifying linearly separable data, but what if the data are not linearly separable, as in Figure. In such cases, no straight line can be found that would separate the classes. The linear SVMs we studied would not be able to find a feasible solution here. The approach described for linear SVMs can be extended to create nonlinear SVMs for the classification of linearly inseparable data (also called nonlinearly separable data, or nonlinear data, for short). Such SVMs are capable of finding nonlinear decision boundaries (i.e., nonlinear hypersurfaces) in input space. We obtain a nonlinear SVM by extending the approach for linear SVMs as follows. There are two main steps. In the first step, transform the original input data into a higher dimensional space using a nonlinear mapping. Several common nonlinear mappings can be used in this step, once the data have been transformed into the new higher space, the second step searches for a linear separating hyperplane in the new space. We again end up with a quadratic optimization problem that can be solved using the linear SVM formulation. The maximal marginal hyperplane found in the new space corresponds to a nonlinear separating hypersurface in the original space. So that the linear decision hyperplane in the new (Z) space corresponds to a nonlinear second-order polynomial in the original 3-D input space, d Z = w1 x1 + w2 x2 + w3 x3 + w4 x1 2 + w5 x1 x2 + w6 x1 x3 + b w1 z1 + w2 z2 + w3 z3 + w4 z4 + w5 z5 + w6 z6 + b First, how do we choose the nonlinear mapping to a higher dimensional space? Second, the computation involved will be costly. Refer back to Equation d X T for the classification of a test tuple, X T . Given the test tuple, compute its dot product with every one of the support vectors. In training, we have to compute a similar dot product several times in order to find 183 the MMH. This is especially expensive. Hence, the dot product computation required is very heavy and costly. It so happens that in solving the quadratic optimization problem of the linear SVM(i.e., when searching for a linear SVM in the new higher dimensional space), the training tuples appear only in the form of dot products, ɸ(Xi). ɸ(Xj), where ɸ (X) is simply the nonlinear mapping function applied to transform the training tuples. Instead of computing the dot product on the transformed data tuples, it turns out that it is mathematically equivalent to instead apply a kernel function, K(Xi, Xj), to the original input data. That is, K(Xi, Xj) = ɸ (Xi) . ɸ (Xj). In other words, everywhere that ɸ(Xi). ɸ(Xj) appears in the training algorithm, we can replace it with K(Xi,Xj). In this way, all calculations are made in the original input space, which is of potentially much lower dimensionality. We can safely avoid the mapping—it turns out that we don‘t even have to know what the mapping is! We will talk more later about what kinds of functions can be used as kernel functions for this problem. Then proceed to find a maximal separating hyperplane. The procedure is similar to that described in Section above in linear separable, although it involves placing a user-specified upper bound, C, on the Lagrange multipliers, ai. This upper bound is best determined experimentally. ―What are some of the kernel functions that could be used?‖ Properties of the kinds of kernel functions that could be used to replace the dot product scenario described above have been studied. Three admissible kernel functions include like: Polynomial kernel of degree h: K Xi , Xj = (Xi . Xj + 1)h Gaussian radial basis function kernel: K Xi , Xj = e − X i −X j 2σ 2 2 Sigmoid kernel: K Xi , Xj = tan(kXi . Xj − δ) There are no golden rules for determining which admissible kernel will result in the most accurate SVM. In practice, the kernel chosen does not generally make a large difference in resulting accuracy. SVM training always finds a global solution. linear and nonlinear SVMs for binary (i.e., two-class) classification. SVM classifiers can be combined for the multiclass case. A simple and effective approach, given m classes, trains m classifiers, one for each 184 class (where classifier j learns to return a positive value for class j and a negative value for the rest). A test tuple is assigned the class corresponding to the largest positive distance. Aside from classification, SVMs can also be designed for linear and nonlinear regression. Here, instead of learning to predict discrete class labels (like the yi ϵ{+1, −1}above), SVMs for regression attempt to learn the input-output relationship between input training tuples, Xi, and their corresponding continuous-valued outputs, yi ∈ R . An approach similar to SVMs for classification is followed. Additional user-specified parameters are required. A major research goal regarding SVMs is to improve the speed in training and testing so that SVMs may become a more feasible option for very large data sets (e.g., of millions of support vectors). Other issues include determining the best kernel for a given data set and finding more efficient methods for the multiclass case. 9.5 Associative Classification Association rule mining is an important and highly active area of data mining research. Recently, data mining techniques have been developed that apply concepts used in association rule mining to the problem of classification. There are three methods in historical order. The first two, ARCS and associative classification, use association rules for classification. The third method, CAEP, mines‖ emerging patterns‖ that consider the concept of support used in mining associations. The first method mines association rules based on clustering and then employs the rules for classification. The ARCS or Association Rule Clustering System, mines association rules of the form Aquan1ΛAquan2=>Acat where Aquan1 and Aquan2 are tests on quantitative attributive ranges(where the ranges are dynamically determined), and Acat assigns a class label for a categorical attribute from the given training data. Association rules are plotted on a 2-D grid. The algorithm scans the grid, searching for rectangular clusters of rules. In this way, adjacent ranges of the quantitative attributes occurring within a rule cluster may be combined. The clustered association rules generated by ARCS were empirically found to be slightly more accurate than C4.5 when there are outliers in the data. The accuracy of ARCS is related to the degree of discretization used. In terms of scalability, ARCS requires ―a constant amount of memory‖, regardless of the database size. C4.5 has exponentially higher execution times than ARCS, requiring the entire database, multiplied by some factor, to fit entirely in main memory. 185 The second method is referred to as associative classification. It mines rules of the form condset=>y, where condset is a set of items (or attribute-value pairs) and y is a class label. Rules that satisfy a pre-specified minimum support are frequent, where a rule has support s if. s% of the samples in the given data set contain condset and belong to class y. A rule satisfying minimum confidence is called accurate, where a rule has confidence c if c% of the samples in the given data set that contain condset belongs to class y. If a set of rules has the same condset, then the rule with the highest confidence is selected as the possible rule (PR) to represent the set. The association classification method consists of two steps. The first step finds the set of all PRs that are both frequent and accurate. It uses an iterative approach, where prior knowledge is used to prune the rule search. The second step uses a heuristic method to construct the classifier, where the discovered rules are organized according to decreasing precedence based on their confidence and support. The algorithm may require several passes over the data set, depending on the length of the longest rule found. When classifying a new sample, the first rule satisfying the sample is used to classify it. The classifier also contains a default rule, having lowest precedence, which specifies a default class for any new sample that is not satisfied by any other rule in the classifier. In general, the associative classification method was empirically found to be more accurate than C4.5 on several data sets. Each of the above two steps was shown to have linear scale-up. The third method, CAEP (classification by aggregating emerging patterns), uses the notion of item set supports to mine emerging patterns (EPs), which are used to construct a classifier. Roughly speaking, an EP is an item set (or set of items) whose support increases significantly from one class of data to another. The ratio of the two supports is called the growth rate of the EP. For example, suppose that we have a data set of customers with the classes buys^computer = ―yes‖, or C1, and buys_computer = ―no‖, or C2, the item set {age = ―<_30‖, student = ―no‖} is a typical EP, whose support increases from 0.2% in C1 to 57.6% in C2 at a growth rate of EP = 288. Note that an item is either a simple equality test; on a categorical attribute is in an interval. Each EP is a multiattribute test and can be very strong at differentiating instances of one class from another. For instance, if a new sample X contains the above EP, then with odds of 9916% we can claim that X belongs to C2. In general, the differentiating power of an EP is roughly proportional to its growth rate and its support in the target class. The third method, CAEP (classification by aggregating emerging patterns), uses the notion of 186 item set support to mine emerging patterns (EPs), which are used to construct a classifier. Roughly speaking, an EP is an item set (or set of items) whose support increases significantly from one class of data to another. The ratio of the two supports is called the growth rate of the EP. For example, suppose that we have a data set of customers with the classes buys^computer = ―yes‖, or C1, and buys_computer = ―no‖, or C2, the item set {age = ―<_30‖,student = ―no‖} is a typical EP, whose support increases from o.2% in C1 to 57.6% in C2 at a growth rate of EP = 288. Note that an item is either a simple equality test; on a categorical attribute is in an interval. Each EP is a multiattribute test and can be very strong at differentiating instances of one class from another. For instance, if a new sample X contains the above EP, then with odds of 99l6% we can claim that X belongs to C2. In general, the differentiating power of an EP is roughly proportional to its growth rate and its support in the target class. For each class C, CAEP find EPs satisfying given support and growth rate thresholds, where growth rate computed with respect to the set of all non-C samples versus the target set of all C samples, ―Border based‖ algorithms can be used for this purpose. Where classifying a new sample, X, for each class C, the differentiating power of the EPs of class C that occur in X are aggregated to derive a score for C that is then normalized. The class with the largest normalized score determines the class label of X. CAEP has been found to be more accurate than C4.5 and association-based classification on several data sets. It also performs well on data sets where the mail class of interest is in the minority. It scales up on data volume and dimensionality. An alternative classifier, called the JEP-classifier, was proposed based on jumping emerging patterns (JEPs). A JEP is a special type of EP, defined as an itemset whose support increases abruptly from zero in one data set to nonzero in another data set. The two classifiers are considered complementary. 9.6 Decision Trees A decision tree is a classifier expressed as a recursive partition of the instance space. The decision tree consists of nodes that form a rooted tree, meaning it is a directed tree with a node called ―root‖ that has no incoming edges. All other nodes have exactly one incoming edge. A node with outgoing edges is called an internal or test node. All other nodes are called leaves (also known as terminal or decision nodes). In a decision tree, each internal node splits the instance space into two or more sub-spaces according to a certain discrete function of the input attributes values. In the simplest and most frequent case, each test considers a single 187 attribute, such that the instance space is partitioned according to the attribute‘s value. In the case of numeric attributes, the condition refers to a range. Each leaf is assigned to one class representing the most appropriate target value. Alternatively, the leaf may hold a probability vector indicating the probability of the target attribute having a certain value. Instances are classified by navigating them from the root of the tree down to a leaf, according to the outcome of the tests along the path. Figure below describes a decision tree that reasons whether or not a potential customer will respond to a direct mailing. Internal nodes are represented as circles, whereas leaves are denoted as triangles. Note that this decision tree incorporates both nominal and numeric attributes. Given this classifier, the analyst can predict the response of a potential customer (by sorting it down the tree), and understand the behavioural characteristics of the entire potential customers population regarding direct mailing. Each node is labelled with the attribute it tests, and its branches are labelled with its corresponding values. In case of numeric attributes, decision trees can be geometrically interpreted as a collection of hyper planes, each orthogonal to one of the axes. Naturally, decision-makers prefer less complex decision trees, since they may be considered more comprehensible. Furthermore, the tree complexity has a crucial effect on its accuracy. The tree complexity is explicitly controlled by the stopping criteria used and the pruning method employed. Usually the tree complexity is measured by one of the following metrics: the total number of nodes, total number of leaves, tree depth and number of attributes used. Decision tree induction is closely related to rule induction. Each path from the root of a decision tree to one of its leaves can be transformed into a rule simply by conjoining the tests along the path to form the antecedent part, and taking the leaf‘s class prediction as the class value. 188 For example, one of the paths in Figure above can be transformed into the rule: ―If customer age is less than or equal to or equal to 30, and the gender of the customer is ―Male‖ – then the customer will respond to the mail‖. The resulting rule set can then be simplified to improve its comprehensibility to a human user, and possibly its accuracy. Algorithmic Framework for Decision Trees Decision tree inducers are algorithms that automatically construct a decision tree from a given dataset. Typically the goal is to find the optimal decision tree by minimizing the generalization error. However, other target functions can be also defined, for instance, minimizing the number of nodes or minimizing the average depth. Induction of an optimal decision tree from a given data is considered to be a hard task. It has been shown that finding a minimal decision tree consistent with the training set is NP–hard. Moreover, it has been shown that constructing a minimal binary tree with respect to the expected number of tests required for classifying an unseen instance is NP–complete. Even finding the minimal equivalent decision tree for a given decision tree or building the optimal decision tree from decision tables is known to be NP–hard. The above results indicate that using optimal decision tree algorithms is feasible only in small problems. Consequently, heuristics methods are required for solving the problem. Roughly speaking, these methods can be divided into two groups: top–down and bottom–up with clear preference in the literature to the first group. There are various top–down decision trees inducers such as ID3 in 1986, C4.5 in 1993, CART in 1984). Some consist of two conceptual phases: growing and pruning (C4.5 and CART). Other inducers perform only the growing phase. The selection of the most appropriate function is made according to some splitting measures. After the selection of an appropriate split, each node further subdivides the training set into smaller subsets, until no split gains sufficient splitting measure or a stopping criteria is satisfied. Uni-variate Splitting Criteria In most of the cases, the discrete splitting functions are univariate. Univariate means that an internal node is split according to the value of a single attribute. Consequently, the inducer searches for the best attribute upon which to split. There are various univariate criteria. These criteria can be characterized in different ways, such as: According to the origin of the 189 measure: information theory, dependence, and distance. According to the measure structure: impurity based criteria, normalized impurity based criteria and Binary criteria. The following section describes the most common criteria used in the problems. Impurity-based Criteria Given a random variable x with k discrete values, distributed according to P = (p1, p2,..., pk), an impurity measure is a function ɸ: [0, 1]k → R that satisfies the following conditions: ɸ (P) ≥0 ɸ (P) is minimum if ∃i such that component pi = 1. ɸ (P) is maximum if ∀i, 1 ≤ i ≤ k, pi = 1/k. ɸ (P) is symmetric with respect to components of P. ɸ (P) is smooth (differentiable everywhere) in its range. Note that if the probability vector has a component of 1 (the variable x gets only one value), then the variable is defined as pure. On the other hand, if all components are equal, the level of impurity reaches maximum. Given a training set S, the probability vector of the target attribute y is defined as: Py S = σy=c dom (y ) S σy=c 1 S ,……. S S The goodness–of–split due to discrete attribute ai is defined as reduction in impurity of the target attribute after partitioning S according to the values vi,j є dom(ai): dom a i ∆ɸ ai , S = ɸ Py S − j=1 σa i =v i,j S S . ɸ(Py (σa i =v i,j S) Information Gain Information gain is an impurity-based criterion that uses the entropy measure (origin from information theory) as the impurity measure. Information Gain (ai; S) = 190 σa i=v Entropy y, S − i,j S . Entropy (y, σa i=v S) S v i,j ∈dom (a i ) i,j Where Entropy y, S = − σy=c j S S v i,j ∈dom (a i ) . log 2 σy=c j S S Gini Index Gini index is an impurity-based criterion that measures the divergences between the probabilities distributions of the target attribute‘s values. The Gini index has been used in various works and it is defined as: 2 σy=c j S Gini y, S = 1 − S c i ∈dom (y) Consequently the evaluation criterion for selecting the attribute ai is defined as: σy=v i,j S GiniGain ai , S = Gini y, S − c i ∈dom y S . 𝐺𝑖𝑛𝑖(𝑦, 𝜎𝑎 𝑖=𝑣𝑖,𝑗 𝑆) Gain Ratio The gain ratio ―normalizes‖ the information gain as follows GainRatio ai , S = InformationGain(ai, S) Entropy(ai, S) Note that this ratio is not defined when the denominator is zero. Also the ratio may tend to favour attributes for which the denominator is very small. Consequently, it is suggested in two stages. First the information gain is calculated for all attributes. As a consequence, taking into consideration only attributes that have performed at least as good as the average information gain, the attribute that has obtained the best ratio gain is selected. It has been shown that the gain ratio tends to outperform simple information gain criteria, both from the accuracy aspect, as well as from classifier complexity. Multivariate Splitting Criteria In multivariate splitting criteria, several attributes may participate in a single node split test. Obviously, finding the best multivariate criteria is more complicated than finding the best 191 univariate split. Furthermore, although this type of criteria may dramatically improve the tree‘s performance, these criteria are much less popular than the univariate criteria. Most of the multivariate splitting criteria are based on the linear combination of the input attributes. Finding the best linear combination can be performed using a greedy search, linear programming;, linear discriminate analysis. Stopping Criteria The growing phase continues until a stopping criterion is triggered. The following conditions are common stopping rules: • All instances in the training set belong to a single value of y. • The maximum tree depth has been reached. • The number of cases in the terminal node is less than the minimum number of cases for parent nodes. • If the node were split, the number of cases in one or more child nodes would be less than the minimum number of cases for child nodes. • The best splitting criteria is not greater than a certain threshold. Pruning Methods Employing tightly stopping criteria tends to create small and under–fitted decision trees. On the other hand, using loosely stopping criteria tends to generate large decision trees that are over–fitted to the training set. Pruning methods were developed for solving this dilemma. According to this methodology, a loosely stopping criterion is used, letting the decision tree to over fit the training set. Then the over-fitted tree is cut back into a smaller tree by removing sub–branches that are not contributing to the generalization accuracy. Employing pruning methods can improve the generalization performance of a decision tree, especially in noisy domains. Another key motivation of pruning is ―trading accuracy for simplicity‖. When the goal is to produce a sufficiently accurate compact concept description, pruning is highly useful. Within this process, the initial decision tree is seen as a completely accurate one. Thus the accuracy of a pruned decision tree indicates how close it is to the initial tree. There are various techniques for pruning decision trees. Most of them perform top-down or bottom-up traversal of the nodes. A node is pruned if this operation improves a certain criteria. 192 Reduced Error Pruning A simple procedure for pruning decision trees is known as reduced error pruning. While traversing over the internal nodes from the bottom to the top, the procedure checks for each internal node, whether replacing it with the most frequent class does not reduce the tree‘s accuracy. In this case, the node is pruned. The procedure continues until any further pruning would decrease the accuracy. In order to estimate the accuracy, use a pruning set. It can be shown that this procedure ends with the smallest accurate sub– tree with respect to a given pruning set. Minimum Error Pruning (MEP) The minimum error pruning performs bottom–up traversal of the internal nodes. In each node it compares the l-probability error rate estimation with and without pruning. The l-probability error rate estimation is a correction to the simple probability estimation using frequencies. If St denotes the instances that have reached a leaf t, then the expected error rate in this leaf is: εˊ t = 1 − maxc i ∈dom (y) σy=c i St + l. papr(y = Ci ) St + l where Papr(y = ci) is the a–priori probability of y getting the value ci, and l denotes the weight given to the a–priori probability. The error rate of an internal node is the weighted average of the error rate of its branches. The weight is determined according to the proportion of instances along each branch. The calculation is performed recursively up to the leaves. If an internal node is pruned, then it becomes a leaf and its error rate is calculated directly using the last equation. Consequently, we can compare the error rate before and after pruning a certain internal node. If pruning this node does not increase the error rate, the pruning should be accepted. 9.7 Lazy Learners (or Learning from Your Neighbours) We can think of the learned model as being ready and eager to classify previously unseen tuples. Imagine a contrasting lazy approach, in which the learner instead waits until the last minute before doing any model construction in order to classify a given test tuple. That is, when given a training tuple, a lazy learner simply stores it (or does only a little minor processing) and waits until it is given a test tuple. Only when it sees the test tuple does it perform generalization in order to classify the tuple based on its similarity to the stored 193 training tuples. Unlike eager learning methods, lazy learners do less work when a training tuple is presented and more work when making a classification or prediction. Because lazy learners store the training tuples or ―instances,‖ they are also referred to as instance based learners, even though all learning is essentially based on instances. When making a classification or prediction, lazy learners can be computationally expensive. They require efficient storage techniques and are well-suited to implementation on parallel hardware. They offer little explanation or insight into the structure of the data. Lazy learners, however, naturally support incremental learning. They are able to model complex decision spaces having hyperpolygonal shapes that may not be as easily describable by other learning algorithms (such as hyper-rectangular shapes modelled by decision trees). We look at one examples of lazy learners: k-nearestneighbor classifiers. k-Nearest-Neighbour Classifiers The k-nearest-neighbor method was first described in the early 1950s. The method is labor intensive when given large training sets, and did not gain popularity until the 1960s when increased computing power became available. It has since been widely used in the area of pattern recognition. Nearest-neighbor classifiers are based on learning by analogy, that is, by comparing a given test tuple with training tuples that are similar to it. The training tuples are described by n attributes. Each tuple represents a point in an n-dimensional space. In this way, all of the training tuples are stored in an n-dimensional pattern space. When given an unknown tuple, a k-nearest-neighbour classifier searches the pattern space for the k training tuples that are closest to the unknown tuple. These k training tuples are the k ―nearest neighbours‖ of the unknown tuple. ―Closeness‖ is defined in terms of a distance metric, such as Euclidean distance. The Euclidean distance between two points or tuples, say, X1 = (x11, x12, ... , x1n) and X2 = (x21, x22, : : : , x2n), is n (x1i − x2i)2 dist(X1, X2) = i=1 In other words, for each numeric attribute, we take the difference between the corresponding values of that attribute in tuple X1 and in tuple X2, square this difference, and accumulate it. The square root is taken of the total accumulated distance count. Typically, we normalize the 194 values of each attribute before using above Equation. This helps prevent attributes with initially large ranges (such as income) from outweighing attributes with initially smaller ranges (such as binary attributes). Min-max normalization, for example, can be used to transforma value v of a numeric attribute A to v ˊ in the range [0, 1] by computing. vˊ = v − minA maxA − minA where minA and maxA are the minimum and maximum values of attribute A. For k-nearest-neighbor classification, the unknown tuple is assigned the most common class among its k nearest neighbors. When k = 1, the unknown tuple is assigned the class of the training tuple that is closest to it in pattern space. Nearest neighbor classifiers can also be used for prediction, that is, to return a real-valued prediction for a given unknown tuple. In this case, the classifier returns the average value of the real-valued labels associated with the k nearest neighbors of the unknown tuple. ―But how can distance be computed for attributes that not numeric, but categorical, such as color?‖ The above discussion assumes that the attributes used to describe the tuples are all numeric. For categorical attributes, a simple method is to compare the corresponding value of the attribute in tuple X1 with that in tuple X2. If the two are identical (e.g., tuples X1 and X2 both have the color blue), then the difference between the two is taken as 0. If the two are different (e.g., tuple X1 is blue but tuple X2 is red), then the difference is considered to be 1. Other methods may incorporate more sophisticated schemes for differential grading (e.g., where a larger difference score is assigned, say, for blue and white than for blue and black). ―What about missing values?‖ In general, if the value of a given attribute A is missing in tuple X1 and/or in tuple X2, we assume the maximum possible difference. Suppose that each of the attributes have been mapped to the range [0, 1]. For categorical attributes, we take the difference value to be 1 if either one or both of the corresponding values of A are missing. If A is numeric and missing from both tuples X1 and X2, then the difference is also taken to be 1. If only one value is missing and the other (which we‘ll call v ˊ ) is present and normalized, then we can take the difference to be either |1-v ˊ | or |0-v ˊ | (i.e., 1v ˊ or v ˊ ), whichever is greater. ―How can we determine a good value for k, the number of neighbours?‖ This can be determined experimentally. Starting with k = 1, we use a test set to estimate the error rate of the classifier. This process can be repeated each time by incrementing k to allow for one 195 more neighbor. The k value that gives the minimum error rate may be selected. In general, the larger the number of training tuples is, the larger the value of k will be (so that classification and prediction decisions can be based on a larger portion of the stored tuples). As the number of training tuples approaches infinity and k =1, the error rate can be no worse than twice the Bayes error rate (the latter being the theoretical minimum). If k also approaches infinity, the error rate approaches the Bayes error rate. Nearestneighbour classifiers use distance-based comparisons that intrinsically assign equal weight to each attribute. They therefore can suffer from poor accuracy when given noisy or irrelevant attributes. The method, however, has been modified to incorporate attribute weighting and the pruning of noisy data tuples. The choice of a distance metric can be critical. Nearest-neighbor classifiers can be extremely slow when classifying test tuples. If D is a training database of |D| tuples and k = 1, then O(|D|) comparisons are required in order to classify a given test tuple. By presorting and arranging the stored tuples into search trees, the number of comparisons can be reduced to O(log(|D|). Parallel implementation can reduce the running time to a constant, that is O(1), which is independent of |D|. Other techniques to speed up classification time include the use of partial distance calculations and editing the stored tuples. In the partial distance method, we compute the distance based on a subset of the n attributes. If this distance exceeds a threshold, then further computation for the given stored tuple is halted, and the process moves on to the next stored tuple. The editing method removes training tuples that prove useless. This method is also referred to as pruning or condensing because it reduces the total number of tuples stored. 9.8 Summary • Backpropagation, an abbreviation for "backward propagation of errors", is a common method of training artificial neural networks. Back propagation is a neural network learning algorithm. • Neurocomputing is computer modeling based, in part, upon simulation of the structure and function of the brain. Neural networks excel in pattern recognition, • The back propagation algorithm performs learning on a multilayer fee-forward neural network. The inputs correspond to the attributes measured for each raining sample. The inputs are fed simultaneously into layer of units making up the input layer. • Back propagation learns by iteratively processing a set of training samples, comparing the network‘s prediction for each sample with the actual known class label. For each 196 training sample, the weights are modified so as to minimize the mean squared error between the network‘s prediction and the actual class. • Support Vector Machines, is a promising new method for the classification of both linear and nonlinear data. 9.9 Keywords Support Vector Machines, Back-propagation, Decision trees. 9.10 Exercises 1. Explain classification by Back Propagation. 2. Discuss associate classification in brief. 3. Explain decision trees. 4. What are lazy learners? Explain. 9.11 References 1. Data Mining: Concepts and Techniques by Jiawei Han and Micheline Kamber, Morgan Kaufmann Publisher, Second Edition, 2006. 2. Introduction to Data Mining (ISBN: 0321321367) by Pang-Ning Tan, Michael Steinbach, Vipin Kumar, Addison-Wesley Publisher, 2005. 3. Data Mining Techniques by Arun K Pujari, University Press, Second Edition, 2009. 197 Unit 10: Genetic Algorithms, Rough Set and Fuzzy Sets Structure 10.1 Objectives 10.2 Introduction 10.3 Genetic Algorithms 10.4 Rough Set Approach 10.5 Fuzzy set Approach 10.6 Summary 10.7 Keywords 10.8 Exercises 10.9 References 10.1 Objectives The objectives covered under this unit include basic concepts about: Genetic Algorithms for data mining Rough Set Approach based data mining Fuzzy Set Approach based data mining. 10.2 Introduction In the computer science field of artificial intelligence, genetic algorithm (GA) is a search heuristic that mimics the process of natural selection. This heuristic (also sometimes called a meta-heuristic) is routinely used to generate useful solutions to optimization and search problems. Genetic algorithms belong to the larger class of evolutionary algorithms (EA), which generate solutions to optimization problems using techniques inspired by natural evolution, such as inheritance, mutation, selection, and crossover. 198 In a genetic algorithm, a population of candidate solutions (called individuals, creatures, or phenotypes) to an optimization problem is evolved toward better solutions. Each candidate solution has a set of properties (its chromosomes or genotype) which can be mutated and altered; traditionally, solutions are represented in binary as strings of 0s and 1s, but other encodings are also possible. The evolution usually starts from a population of randomly generated individuals, and is an iterative process, with the population in each iteration called a generation. In each generation, the fitness of every individual in the population is evaluated; the fitness is usually the value of the objective function in the optimization problem being solved. The more fit individuals are stochastically selected from the current population, and each individual's genome is modified (recombined and possibly randomly mutated) to form a new generation. The new generation of candidate solutions is then used in the next iteration of the algorithm. Commonly, the algorithm terminates when either a maximum number of generations has been produced, or a satisfactory fitness level has been reached for the population. A rough set, first described by Polish computer scientist Zdzisław I. Pawlak, is a formal approximation of a crisp set (i.e., conventional set) in terms of a pair of sets which give the lower and the upper approximation of the original set. In the standard version of rough set theory, the lower- and upper-approximation sets are crisp sets, but in other variations, the approximating sets may be fuzzy sets. The following section contains an overview of the basic framework of rough set theory, as originally proposed by Zdzisław I. Pawlak, along with some of the key definitions. The initial and basic theory of rough sets is sometimes referred to as "Pawlak Rough Sets" or "classical rough sets", as a means to distinguish from more recent extensions and generalizations. Information system framework Let be an information system (attribute-value system), where set of finite objects (the universe) and that for every is a non-empty is a non-empty, finite set of attributes such . Va is the set of values that attribute a may take. The information table assigns a value a(x) from Va to each attribute and object in the universe . With any there is an associated equivalence relation IND(P): The relation IND(P) is called a -indiscernibility relation. The partition of all equivalence classes of IND(P) and is denoted by (or is a family of ). 199 If (x, y) , then and are indiscernible (or indistinguishable) by attributes from p. Definition of a rough set Let be a target set that we wish to represent using attribute subset are told that an arbitrary set of objects ; that is, we comprises a single class, and we wish to express this class (i.e., this subset) using the equivalence classes induced by attribute subset general, cannot be expressed exactly, because the set may include and exclude objects which are indistinguishable on the basis of attributes For . In example, consider the target subset . set , and let attribute , the full available set of features. It will be noted that the set cannot be expressed exactly, because in [x]p, objects indiscernible. Thus, there is no way to represent any set are which includes O3 but excludes objects O7 and O10. However, the target set within can be approximated using only the information contained by constructing the -lower and -upper approximations of : Lower approximation and positive region The in -lower approximation, or positive region, is the union of all equivalence classes which are contained by (i.e., are subsets of) the target set – in the example, , the union of the two equivalence classes in which are contained in the target set. The lower approximation is the complete set of objects in U/P that can be positively (i.e., unambiguously) classified as belonging to target set . Upper approximation and negative region The -upper approximation is the union of all equivalence classes in empty intersection with the target example, equivalence classes in set , the which have non– union in of the the three that have non-empty intersection with the target set. The upper approximation is the complete set of objects that in that cannot be positively (i.e., 200 unambiguously) classified as belonging to the complement ( ) of the target set . In other words, the upper approximation is the complete set of objects that are possibly members of the target set . The set therefore represents the negative region, containing the set of objects that can be definitely ruled out as members of the target set. Boundary region The boundary region, given by set difference , consists of those objects that can neither be ruled in nor ruled out as members of the target set . In summary, the lower approximation of a target set is a conservative approximation consisting of only those objects which can positively be identified as members of the set. (These objects have no indiscernible "clones" which are excluded by the target set.) The upper approximation is a liberal approximation which includes all objects that might be members of target set. (Some objects in the upper approximation may not be members of the target set.) From the perspective of U/P, the lower approximation contains objects that are members of the target set with certainty (probability = 1), while the upper approximation contains objects that are members of the target set with non-zero probability (probability > 0). The rough set The tuple composed of the lower and upper approximation is called a rough set; thus, a rough set is composed of two crisp sets, one representing a lower boundary of the target set , and the other representing an upper boundary of the target set The accuracy of the rough-set representation of the set . can be given (Pawlak 1991) by the following: That is, the accuracy of the rough set representation of , , the ratio of the number of objects which can positively be placed in objects that can possibly be placed in , is to the number of – this provides a measure of how closely the rough set is approximating the target set. Clearly, when the upper and lower approximations are equal (i.e., boundary region empty), then , and the approximation is perfect; at 201 the other extreme, whenever the lower approximation is empty, the accuracy is zero (regardless of the size of the upper approximation). In mathematics, Fuzzy sets are sets whose elements have degrees of membership. Fuzzy sets were introduced by Lotfi A. Zadeh and Dieter Klaua in 1965 as an extension of the classical notion of set. At the same time, Salii (1965) defined a more general kind of structures called L-relations, which were studied by him in an abstract algebraic context. Fuzzy relations, which are used now in different areas, such as linguistics, decision-making and clustering are special cases of L-relations when L is the unit interval [0, 1]. In classical set theory, the membership of elements in a set is assessed in binary terms according to a bivalent condition — an element either belongs or does not belong to the set. By contrast, fuzzy set theory permits the gradual assessment of the membership of elements in a set; this is described with the aid of a membership function valued in the real unit interval [0, 1]. Fuzzy sets generalize classical sets, since the indicator functions of classical sets are special cases of the membership functions of fuzzy sets, if the latter only take values 0 or 1. In fuzzy set theory, classical bivalent sets are usually called crisp sets. The fuzzy set theory can be used in a wide range of domains in which information is incomplete or imprecise, such as bioinformatics. Definition A fuzzy set is a pair For each finite where the value set is a set and is called the grade of membership of the fuzzy set in is often For a denoted by Let Then is called not included in the fuzzy set called fully included if .[6] The set , and is set if , is is called a fuzzy member if called the support of is called its kernel. The function and the is called the membership function of the fuzzy set Sometimes, more general variants of the notion of fuzzy set are used, with membership functions taking values in a (fixed or variable) algebra or structure it is required that of a given kind; usually be at least a poset or lattice. These are usually called L-fuzzy sets, to 202 distinguish them from those valued over the unit interval. The usual membership functions with values in [0, 1] are then called [0, 1]-valued membership functions. 10.3 Genetic Algorithms Genetic algorithm is one of the commonly used approaches on data mining. We put forward a genetic algorithm approach for classification problems. Binary coding is adopted in which an individual in a population consists of a fixed number of rules that stand for a solution candidate. The evaluation function considers four important factors which are error rate, entropy measure, rule consistency and whole ratio, respectively. or Genetic algorithms are a data mining technique. They are used to winnow relevant data from large data sets to produce the fittest data or, in the context of a proposed problem, the fittest solution. For those of us who are not computer scientists or mathematicians, a genetic algorithm may best be understood as computer based calculations based on the idea that—as in evolutionary biology, and genetics—entities in a population will over time evolve through natural selection to their optimal condition. Genetic algorithms—or sets of rules--use genetic concepts of reproduction, selection, inheritance and so forth. If you begin with a large set of data, the application of genetic algorithms will eventually have them winnowed down to those that are the most "fit." Fitness will be defined in terms of the particular problem. Genetic algorithms have been proposed, in the realm of counter terrorism, to: Extract the fittest nodes (or connection points) in terrorist networks, in order to analyze and act on that knowledge; Determine the most optimal military or other strategy to use in a particular scenario. "fitness" in this case is determined as the ability to resolve a violent conflict scenario; Create models of new threat scenarios by 'evolving' the most dangerous scenarios from component parts (fitness in this case means the ability to survive existing strategies for their defeat). Genetic Algorithms (GAs) are adaptive procedures derived from Darwin‘s principal of survival of the fittest in natural genetics. GA maintains a population of potential solutions of the candidate problem termed as individuals. By manipulation of these individuals through 203 genetic operators such as selection, crossover and mutation, GA evolves towards better solutions over a number of generations. Implementation of a genetic algorithm is shown in a flowchart in figure-1 Figure-1: Flowchart of a genetic algorithm Genetic algorithms start with randomly created initial population of individuals that involves encoding of every variable. A string of variables makes a chromosome or individual. In the beginning phase of implementation of genetic algorithm in early seventies, it was applied to solve continuous optimization problems with binary coding of variables. Binary variables are mapped to real numbers in numerical problems. Later, GA has been used to solve many combinatorial optimization problems such as 0/1 knapsack problem, travelling salesperson problem, scheduling problems, etc . Binary coding has not been found suitable to solve many of these problems. Therefore, coding other than binary have also been utilized. Continuous function optimization uses realnumber coding. Problems such as traveling salesperson problem and graph coloring use permutation coding. Genetic programming applications use tree coding.GA use fitness function derived from the objective function of the optimization problem to evaluate the individuals in a population. Fitness function is the measure of an individual‘s fitness, which is used to select individuals for reproduction. Many of the real world problems may not have a well defined objective function and require the user to define a fitness function. 204 Selection method in a GA selects parents from the population on the basis of fitness of individuals. High fitness individuals are selected with higher probability of selection to reproduce offsprings for the next population. Selection methods assign a probability P(x) to each individual in the population at current generation, which is proportional to the fitness of individual x relative to rest of the population. Fitness-proportionate selection is the most commonly used selection method. Given fi as the fitness of ith individual, P(x) in fitness-proportionate selection is calculated as: P(x)=fx/ ∑fi . After the expected values P(x) are calculated, the individuals are selected using the roulette wheel sampling in the following steps Let C be the sum of expected values of individuals in a population Repeat two or more times to select the parents for mating. The fitness-proportionate selection is extremely biased towards the fit individuals in the population and exerts high selection pressure. It causes pre-mature convergence of GA as population is made up of highly fit individuals after a few generations and there is no fitnessbias for selection procedure to work. Therefore, other selection methods such as tournament selection, rank selection are used to avoid this biasness. Tournament selection compares two or more randomly selected individuals and selects the better individual with a pre-specified probability. Rank selection calculates probability of selection of individuals on the basis of ranking according to increasing fitness values in a population. In a standard genetic algorithm, two parents are selected at a time and are used to create two new children to take part in the next generation. The offsprings are subject operator with a pre-specified probability of crossover. to crossover Single-point crossover is the most common form of this operator. It marks a random crossover spot within the size of chromosome and exchanges the bits (in binary coding) on the right of the spot as shown below. Mutation operator is applied to all the children after crossover. It flips each bit in the individual with a pre-specified probability of mutation. An example of mutation is given below where fifth bit has been mutated. 205 The procedure is repeated till number of individuals in the population is complete. It finishes one generation in genetic algorithm. GA is run till a stopping criterion is satisfied that may be defined in many ways. Pre-specified number of generations is the most used criterion. Other criteria are the desired quality of solution, the number of generations without any improvement in the results, etc. A standard genetic algorithm utilizes three genetic operators: reproduction (selection), crossover and mutation. Elitism in genetic algorithms is used to ensure that the best individual in a population is passed on unperturbed by genetic operators to the population at next generation. Values of genetic parameters such as population size, crossover probability, mutation probability, total number of generations affect convergence properties of the genetic algorithms. Values of these parameters are generally decided before start of GA execution on the basis of previous experience. Experimental studies recommend the values of these parameters as: population size equal to 20 to 30, crossover probability between 0.75 to 0.95, and mutation probability between 0.005-0.01. The parameters may also be fixed by tuning in trial GA runs before start of actual run of the GA. Deterministic control and adaptation of the parameter values to a particular application have also been used to determine values of genetic parameters. In deterministic control value of a genetic parameter is altered by some deterministic rule during the GA run Adaptation of parameters allows change in their values during the GA run on the basis of performance previous generations in the genetic algorithm. In self-adaptation, the operator settings are encoded into each individual in the population that evolves values of parameters during the GA run. Applications in Data Mining Data mining has been used to analyze large datasets and establish useful classification and patterns in the datasets. Agricultural and biological research studies have used various techniques of data mining including natural trees, statistical machine learning and other analysis methods. Genetic algorithm has been widely used in data mining applications such as classification, clustering, feature selection, etc. Two applications of GA in data mining are described below. 206 Effectiveness of the classification algorithms - Genetic algorithm, Fuzzy classification and Fuzzy clustering are compared and analyzed on the collected supervised and unsupervised soil data. Soil classification deals with the categorization of soils based on distinguishing characteristics as well as criteria that dictate choices in use. Genetic algorithm for feature selection for mining SNP‘s in association studies. Genomic studies provide large volumes of data with thousands of single nucleotide polymorphisms (SNPs). The analysis of SNPs determines relationships between genotypic and phenotypic information . It helps in identification of SNPs related to a disease an approach for predicting drug effectiveness is developed that is based on data mining and genetic algorithms. 10.4 Rough Sets Approach Rough set theory is a new mathematical approach to data analysis and data mining. After 15 year of pursuing rough set theory and its application the theory has reached a certain degree of ma-turity. In recent years we witnessed a rapid grow of interest in rough set theory and its application, worldwide. Many international workshops, conferences and seminars included rough sets in their programs. A large number of high quality papers have been published recently on various aspects of rough sets. Various real life-applications of rough set theory have shown its usefulness in many domains. Very promising new areas of application of the rough set concept seems to emerge in the near future. They include rough control, rough data bases, rough information retrieval, rough neural network and others. No doubt that rough set theory can contribute essentially to material sciences, a subject of special interest to this conference. The rough sets theory was created by Z. Pawlak in the beginning of the 1980s and it is useful in the process of data mining. It offers the mathematic tools for discovering hidden patterns in data through the use of identification of partial and total dependencies in data. It also enables work with null or missing values. Rough sets can be used separately but usually they are used together with other methods such as fuzzy sets, statistic methods, genetics algorithms etc. The rough sets theory uses different approach to uncertainty. As well as fuzzy sets this theory is only part of the classic theory, not an alternative. 207 BASIC CONCEPTS Rough set philosophy is founded on the assumption that with every object of the universe of dis-course we associate some information (data, knowledge). Objects characterized by the same infor-mation are indiscernible (similar) in view of the available information about them. The indis-cernibility relation generated in this way is the mathematical basis of rough set theory. Any set of all indiscernible (similar) objects is called an elementary set, and form a basic granule (atom) of knowledge about the universe. Any union of some elementary sets is referred to as a crisp (precise) set - otherwise the set is rough (imprecise, vague). Obviously rough sets, in contrast to precise sets, cannot be characterized in terms of information about their elements. In the proposed approach with any rough set a pair of precise sets - called the lower and the upper approximation of the rough set is associated. The lower approximation consists of all objects which surely belong to the set and the upper approximation contains all objects which possible belong to the set. The difference between the upper and the lower approximation consti-tutes the boundary region of the rough set. Approximations are two basic operations used in rough set theory. Data are often presented as a table, columns of which are labeled by attributes, rows by objects of interest and entries of the table are attribute values. Such tables are known as information systems, attribute-value tables, data tables or information tables. Usually we distinguish in information table‘s two kinds of attributes, called condition and decision attributes. Such tables are known as decision tables. Rows of a decision table are referred to as ―if...then...‖ decision rules, which give conditions necessary to make decisions specified by the de-cision attributes. An example of a decision table is shown in Table 1. The table contains data concerning six cast iron pipes exposed to high pressure endurance test. In the table C, S and P are condition attributes, displaying the percentage content in the pig-iron of coal, sulfur and phosphorus respectively, whereas the attribute Cracks revels the result of the test. The values of condition attributes are as follows (C, high) > 3.6%, 3.5% ≤ 208 (C, avg.) ≤ 3.6%, (C, low) < 3.5%, (S, high) ≥ 0.1%, (S, low) < 0.1%, (P, high) ≥ 0.3%, (P, low) < 0.3%. Main problem we are interested in is how the endurance of the pipes depend on the compounds C, S and P comprised in the pig-iron, or in other words, if there is a functional dependency between the decision attribute Cracks and the condition attributes C, S and P. In the rough set theory language this boils down to the question, if the set {2,4,5}of all pipes having no cracks after the test (or the set {1,3,6}of pipes having cracks), can be uniquely defined in terms of condition attributes values. It can be easily seen that this is impossible, since pipes 2 and 3 display the same features in terms of attributes C, S and P, but they have different values of the attribute Cracks. Thus information given in Table 1 is not sufficient to solve our problem. However we can give a partial solution. Let us observe that if the attribute C has the value high for a certain pipe, then the pipe have cracks, whereas if the value of the attribute C is low, then the pipe has no cracks. Hance employing attributes C, S and P, we can say that pipes 1 and 6 surly are good, i.e., surely belong to the set {1, 3, 6}, whereas pipes 1, 2, 3 and 6 possible are good, i.e., possible belong to the set {1, 3, 6}.Thus the sets {1, 6}, {1, 2, 3, 6} and {2, 3} are the lower, the upper approximation and the boundary region of the set {1, 3, 6}. This means that the quality of pipes cannot be determined exactly by the content of coal, sulfur and phosphorus in the pig-iron, but can be determined only with some approximation. In fact approximations determine the dependency (total or partial) between condition and decision attributes, i.e., express functional relationship between values of condition and decision attributes. The degree of dependency between condition and decision attributes can be defined as a consistency factor of the decision table, which is the number of conflicting decision rules to all decision. rules in the table. By conflicting decision rules we mean rules having the same conditions but different decisions. For example, the consistency factor for Table 1 is 4/6 = 2/3, hence the degree of dependency between cracks and the composition of the pig-iron is 2/3. That means that four out of six (ca. 60%) pipes can be properly classified as good or not good on the basis of their composition. We might be also interested in reducing some of the condition attributes, i.e. to know whether all conditions are necessary to make decisions specified in a table. To this end we will employ the no-tin of a reduct (of condition attributes). By a reduct we understand a minimal subset of condition attributes which preserves the consistency factor of the table. It is easy to 209 compute that in Table 1 we have two reducts {C, S} and {C, P}. Intersection of all reducts is called the core. In our example the core is the attribute C. `That means that in view of the data coal is the most important factor causing cracks and cannot be eliminated from our considerations, whereas sulfur and phosphorus play a minor role and can be mutually exchanged as factors causing cracks. Now we present the basic concepts more formally Suppose we are given two finite, non-empty sets U and A, where U is the universe, and A − a set attributes. With every attribute aA∈ we associate a set V, of its values, called the domain of a. Any subset B of A determines a binary relation I(B) on U which will be called an indiscernibility relation, and is defined as follows xI(B)y if and only if a(x) = a(y) for every aA∈, where a(x) denotes the value of attribute a for element x. Obviously I(B) is an equivalence relation. The family of all equivalence classes of I(B), i.e., partition determined by B, will be denoted by U/I(B), or simple U/B; an equivalence class of I(B), i.e., block of the partition U/B, containing x will be denoted by B(x). If (x,y) belong to I(B) we will say that x and y are B-indiscernible. Equivalence classes of the relation I(B) (or blocks of the partition U/B) are refereed to as B-elementary sets. In the rough set approach the elementary sets are the basic building blocks of our knowledge about reality. The indiscernibility relation will be used next to define basic concepts of rough set theory. Let us define now the following two operations on sets Assigning to every subset X of the universe U two sets ()BX∗and ()BX∗called the B-lower and the B-upper approximation of X, respectively. The set will be referred to as the B-boundary region of X. 210 If the boundary region of X is the empty set, i.e, ()BNXB=∅, then the set X is crisp (exact) with respect to B; in the opposite case, i.e., if ()BNXB≠∅, the set X is referred to as rough (inexact) with respect to B. Rough set can be also characterized numerically by the following coefficient Called accuracy of approximation, where |X| denotes the cardinality of X. Obviously ()01≤≤αBX. If X is crisp with respect to B (X is precise with respect to B), and otherwise, if X is rough with respect to B. ()αBX=1,()αBX<1, Approximation can be employed to define dependencies (total or partial) between attributes, reduction of attributes, decision rule generation and others, but will not discuss these issues here. For details we refer the reader to references. APPLICATIONS Rough set theory has found many interesting applications. The rough set approach seems to be of fundamental importance to AI and cognitive sciences, especially in the areas of machine learning, knowledge acquisition, decision analysis, knowledge discovery from databases, expert systems, inductive reasoning and pattern recognition. It seems of particular importance to decision support systems and data mining. The main advantage of rough set theory is that it does not need any preliminary or additional infor-mation about data - like probability in statistics, or basic probability assignment in Dempster-Shafer theory and grade of membership or the value of possibility in fuzzy set theory. The rough set theory has been successfully applied in many real-life problems in medicine, pharmacology, engineering, banking, financial and market analysis and others. Some exemplary applications are listed below. There are many applications in medicine. In pharmacology the analysis of relationships between the chemical structure and the antimicrobial activity of drugs has been successfully investigated. Banking applications include evaluation of a bankruptcy risk and market research. Very interesting results have been also obtained in speaker independent speech recognition and acoustics. The rough set approach seems also important for various engineering 211 applications, like diagnosis of machines using vibroacoustics symptoms (noise, vibrations) and process control. Application in linguistics, environment and databases are other important domains. Application of rough sets requires suitable software. Many software systems for workstations and personal computers based on rough set theory have been developed. The most known include LERS , Rough DAS and Rough Class and DATALOGIC. Some of them are available com-metrically. The main advantage of rough set theory in data analysis is that it does not need any preliminary or additional information about data − like probability in statistics, or basic probability assignment in Dempster-Shafer theory, grade of membership or the value of possibility in fuzzy set theory. The proposed approach provides efficient algorithms for finding hidden patterns in data, finds minimal sets of data (data reduction), evaluates significance of data, generates sets of decision rules from data, it is easy to understand, offers straightforward interpretation of obtained results, most algorithms based on the rough set theory are particularly suited for parallel Processing 10.5 Fuzzy set Approach ―Lotfi Zadeh proposed completely new, elegant approach to vagueness called fuzzy set theory‖ In his approach element can belong to a set to a degree k (0 ≤ k ≤ 1), in contrast to classical set theory where an element must definitely belong or not to a set. E.g., in classical set theory one can be definitely ill or healthy, whereas in fuzzy set theory we can say that someone is ill (or healthy) in 60 percent (i.e. in the degree 0.6). Of course, at once the question arises where we get the value of degree from. This issue raised a lot of discussion, but we will refrain from considering this problem here. Thus fuzzy membership function can be presented as 212 µX(x)∈<0,1> where, X is a set and x is an element. Let us observe that the definition of fuzzy set involves more advanced mathematical concepts, real numbers and functions, whereas in classical set theory the notion of a set is used as a fundamental notion of whole mathematics and is used to derive any other mathematical concepts, e.g., numbers and functions. Consequently fuzzy set theory cannot replace classical set theory, because, in fact, the theory is needed to define fuzzy sets. Fuzzy membership function has the following properties. a) (x) 1 (x) µU −X = −µ X for any x∈U b) (x) max( (x), (x)) µ X ∪Y = µ X µY for any x∈U c) ), (x) µY (x) min( (x )for any µ X ∩Y = µ X x∈U That means that the membership of an element to the union and intersection of sets is uniquely determined by its membership to constituent sets. This is a very nice property and allows very simple operations on fuzzy set, which is a very important feature both the oretically and practically. Fuzzy set theory and its applications developed very extensively over last years and attracted attention of practitioners, logicians and philosophers worldwide. FUZZY INFORMATION Fuzzy sets are sets whose elements have degrees of membership. Fuzzy sets were introduced simultaneously by Lotfi A. Zadeh and Dieter Klaua in 1965 as an extension of the classical notion of set. In classical set theory, the membership of elements in a set is assessed in binary terms according to a bivalent condition — an element either belongs or does not belong to the set. By contrast, fuzzy set theory permits the gradual assessment of the membership of elements in a set; this is described with the aid of a membership function valued in the real unit interval [0, 1]. Fuzzy sets generalize classical sets, since the indicator functions of classical sets are special cases of the membership functions of fuzzy sets, if the latter only take values 0 or 1. In fuzzy set theory, classical bivalent sets are usually called crisp sets. The fuzzy set theory can be used in a wide range of domains in which information is incomplete or imprecise, such as bioinformatics. 213 FUZZY LOGIC Fuzzy logic is a form of many-valued logic or probabilistic logic it deals with reasoning that is approximate rather than fixed and exact. In contrast with traditional logic theory, where binary sets have two-valued logic, true or false, fuzzy logic variables may have a truth value that ranges in degree between 0 and 1. Fuzzy logic has been extended to handle the concept of partial truth, where the truth value may range between completely true and completely false. Furthermore, when linguistic variables are used, these degrees may be managed by specific functions. Typical Applications of Fuzzy Set Theory The tools and technologies that have been developed in FST have the potential to support all of the steps that comprise a process of model induction or knowledge discovery. In particular, FST can already be used in the data selection and preparation phase, e.g., for modeling vague data in terms of fuzzy sets , to condense" several crisp observations into a single fuzzy one, or to create fuzzy summaries of the data . As the data to be analyzed thus becomes fuzzy, one subsequently faces a problem of analyzing fuzzy data, i.e., of fuzzy data analysis. The problem of analyzing fuzzy data can be approached in at least two principally di_erent ways. First, standard methods of data analysis can be extended in a rather generic way by means of an extension principle, that is, by fuzzifying"the mapping from data to models. A second, often more sophisticated approach is based on embedding the data into more complex mathematical spaces, such as fuzzy metric spaces , and to carry out data analysis in these spaces. If fuzzy methods are not used in the data preparation phase, they can still be employed in a later stage in order to analyze the original data. Thus, it is not the data to be analyzed that is fuzzy, but rather the methods used for analyzing the data (in the sense of resorting to tools from FST). Subsequently, we shall focus on this type of fuzzy data analysis Fuzzy Cluster Analysis Many conventional clustering algorithms, such as the prominent k-means algorithm, produce a clustering structure in which every object is assigned to one cluster in an unequivocal way. Consequently, the individual clusters are separated by sharp boundaries. In practice, such boundaries are often not very natural or even counterintuitive. Rather, the boundary of single 214 clusters and the transition between different clusters are usually smooth". This is the main motivation underlying fuzzy extensions to clustering algorithms. In fuzzy clustering, an object may belong to different clusters at the same time, at least to some extent, and the degree to which it belongs to a particular cluster is expressed in terms of a fuzzy membership. The membership functions of the different clusters (defined on the set of observed data points) is usually assumed to form a partition of unity. This version, often called probabilistic clustering, can be generalized further by weakening this constraint as, e.g., in possibilistic clustering. Fuzzy clustering has proved to be extremely useful in practice and is now routinely applied also outside the fuzzy community (e.g., in recent bioinformatics applications). Learning Fuzzy Rule-Based Systems The most frequent application of FST in machine learning is the induction or the adaptation of rule-based models. This is hardly astonishing, since rule based models have always been a cornerstone of fuzzy systems and a central aspect of research in the _eld, not only in ML&DM but also in many other sub fields, notably approximate reasoning and fuzzy control. (Often, the term fuzzy system implicitly refers to fuzzy rule-based system.) Fuzzy rule-based systems can represent both classification and regression functions, and different types of fuzzy models have been used for these purposes. In order to realize a regression function, a fuzzy system is usually wrapped in a fuzzifier" and a defuzziffier": The former maps a crisp input to a fuzzy one, which is then processed by the fuzzy system, and the latter maps the (fuzzy) output of the system back to a crisp value. For so-called TakagiSugeno models, which are quite popular for modeling regression functions, the defuzzification step is unnecessary, since these models output crisp values directly. In the case of classification learning, the consequent of single rules is usually a class assignment (i.e. a singleton fuzzy set). Evaluating a rule base thus becomes trivial and simply amount to maximum matching‖ that is, searching the maximally supporting rule for each class. Thus, much of the appealing interpolation and approximation properties of fuzzy inference gets lost, and fuzziness only means that rules can be activated to a certain degree. There are, however, alternative methods which combine the predictions of several rules into a classification of the query. In methods of that kind, the degree of activation of a rule provides important information. Besides, activation degrees can be very useful, e.g., for characterizing the uncertainty involved in a classification decision. 215 Fuzzy Decision Tree Induction Fuzzy variants of decision tree induction have been developed for quite a while and seem to remain a topic of interest even today (see for a recent approach and a comprehensive overview of research in this field). In fact, these approaches provide a typical example for the fuzzi_cation" of standard machine learning methods. In the case of decision trees, it is primarily the crisp" thresholds used for dining splitting predicates (constraints), such as size _181, at inner nodes that have been criticized: Such thresholds lead to hard decision boundaries in the input space, which means that a slight variation of an attribute (e.g. size = 182 instead of size = 181) can entail a completely deferent classification of an object (e.g., of a person characterized by size, weight, gender, ...) Moreover, the learning process becomes unstable in the sense that a slight variation of the training examples can change the induced decision tree drastically In order to make the decision boundaries soft", an obvious idea is to apply fuzzy predicates at the inner nodes of a decision tree, such as size 2 TALL, where TALL is a fuzzy set (rather than an interval). In other words, a fuzzy partition instead of a crisp one is used for the splitting attribute (here size) at an inner node. Since an example can satisfy a fuzzy predicate to a certain degree, the examples are partitioned in a fuzzy manner as well. That is, an object is not assigned to exactly one successor node in a unique way, but perhaps to several successors with a certain degree. 10.6 Summary Genetic algorithms (GA's) are search algorithms that work via the process of natural selection. They begin with a sample set of potential solutions which then evolves toward a set of more optimal solutions. Within the sample set, solutions that are poor tend to die out while better solutions mate and propagate their advantageous traits, thus introducing more solutions into the set that boast greater potential (the total set size remains constant; for each new solution added, an old one is removed). A little random mutation helps guarantee that a set won't stagnate and simply fill up with numerous copies of the same solution. In general, genetic algorithms tend to work better than traditional optimization algorithms because they're less likely to be led astray by local optima. This is because they don't make use of single-point transition rules to move from one single instance in the solution space to 216 another. Instead, GA's take advantage of an entire set of solutions spread throughout the solution space, all of which are experimenting upon many potential optima. However, in order for genetic algorithms to work effectively, a few criteria must be met: It must be relatively easy to evaluate how "good" a potential solution is relative to other potential solutions. It must be possible to break a potential solution into discrete parts that can vary independently. These parts become the "genes" in the genetic algorithm. Finally, genetic algorithms are best suited for situations where a "good" answer will suffice, even if it's not the absolute best answer. Rough set theory is a new mathematical approach to imperfect knowledge. The problem of imperfect knowledge has been tackled for a long time by philosophers, logicians and mathematicians. Recently it became also a crucial issue for computer scientists, particularly in the area of artificial intelligence. There are many approaches to the problem of how to understand and manipulate imperfect knowledge. The most successful one is, no doubt, the fuzzy set theory proposed by Zadeh . Rough set theory has found many interesting applications. The rough set approach seems to be of fundamental importance to AI and cognitive sciences, especially in the areas of machine learning, knowledge acquisition, decision analysis, knowledge discovery from databases, expert systems, inductive reasoning and pattern recognition. The areas of fuzzy sets and rough sets have become topics of great research interest, particularly in the last 20 or so years. The integration or hybridization of such techniques has also attracted much attention, due mainly to the fact that these distinct approaches to data and knowledge modeling are complementary when attempting to deal with uncertainty and noise. A large body of the work on fuzzy-rough set hybridization, however, has tended to focus on formal aspects of the theory and thus has been framed in that context 10.7 Keywords Genetic Algorithms, Set, Rough set, Fuzzy Set. 10.8 Exercises 1. Give an example of combinatorial problem. What is the most difficult in solving these problems? Write short notes on genetic algorithms. 2. Write short notes on rough sets. 3. Give the flowchart of a genetic algorithm. 217 4. Write short notes on fuzzy theory and its applications. 5. Explain fuzzy clustering. 6. Name and describe the main features of Genetic Algorithms (GA). 7. How can Rough sets be applied in Data Mining? 8. Describe the fuzzy methods for rule learning, 10.9 References 1. Rough Sets, Fuzzy Sets, Data Mining, and Granular Computing Slezak, D., Szczuka, Duentsch, I., Yao, Y. (Eds.) 2. Polkowski, L. (2002). "Rough sets: Mathematical foundations". Advances in Soft Computing. 3. Dubois, D.; Prade, H. (1990). "Rough fuzzy sets and fuzzy rough sets". International Journal of General Systems 17 (2–3): 191–209. doi:10.1080/03081079008935107. 4. Didier Dubois, Henri M. Prade, ed. (2000). Fundamentals of fuzzy sets. The Handbooks of Fuzzy Sets Series 7. Springer. ISBN 978-0-7923-7732-0. 218 UNIT 11: PREDICTION THEORY OF CLASSIFIERS Structure 11.1 Objectives 11.2 Introduction 11.3 Estimating the Predictive Accuracy of a Classifier 11.4 Evaluating the Accuracy of a Classifier 11.5 Multiclass Problem 11.6 Summary 11.7 Keywords 11.8 Exercises 11.9 References 11.1 Objectives The objectives covered under this unit include: Estimating the Predictive accuracy of a Classifier Evaluating the accuracy of a Classifier Multiclass Problem 11.2 Introduction Classifiers are functions which partition a set into two classes (for example, the set of rainy days and the set of sunny days). Classifiers appear to be the most simple nontrivial decision making element so their study often has implications for other learning systems. Classifiers are sufficiently complex that many phenomena observed in machine learning (theoretically or experimentally) can be observed in the classification setting. Yet, Classifiers are simple enough to make their analysis easy to understand. This combination of sufficient yet minimal complexity for capturing phenomena makes the study of Classifiers especially fruitful. The multi-class classification problem is an extension of the traditional binary class problem where a dataset consists of k classes instead of two. While imbalance is said to exist in the binary class imbalance problem when one class severely outnumbers the other class, extended to multiple classes the effects of imbalance are even more problematic. That is, given k classes, there are multiple ways for class imbalance to manifest itself in the dataset. 219 One typical way is there is one ―super majority‖ class which contains most of the instances in the dataset. Another typical example of class imbalance in multi-class datasets is the result of a single minority class. In such instances k−1 instances each make up roughly1/ (k − 1) of the dataset, and the ―minority‖ class makes up the rest. 11.3 Estimating the Predictive Accuracy of a Classifier Any algorithm which assigns a classification to unseen instances is called a classifier. A decision tree is one of the very popular types of classifier, but there are several others, some of which are described elsewhere in this book. This chapter is concerned with estimating the performance of a classifier of any kind but will be illustrated using decision trees generated with attribute selection using information gain. Although the data compression can sometimes be important, in practice the principal reason for generating a classifier is to enable unseen instances to be classified. However we have already seen that many different classifiers can be generated from a given dataset. Each one is likely to perform differently on a set of unseen instances. The most obvious criterion to use for estimating the performance of a classifier is predictive accuracy, i.e. the proportion of a set of unseen instances that it correctly classifies. This is often seen as the most important criterion but other criteria are also important, for example algorithmic complexity, efficient use of machine resources and comprehensibility. For most domains of interest the number of possible unseen instances is potentially very large (e.g. all those who might develop an illness, the weather for every possible day in the future or all the possible objects that might appear on radar display), so it is not possible ever to establish the predictive accuracy beyond dispute. Instead, it is usual to estimate the predictive accuracy of a classifier by measuring its accuracy for a sample of data not used when it was generated. There are three main strategies commonly used for this: dividing the data into training set and a test set, k-fold cross-validation and N -fold (or leave-one-out) crossvalidation. Method 1: Separate Training and Test Sets For the ‗train and test‘ method the available data is split into two parts called a training set and a test set (Figure 11.1). First, the training set is used to construct a classifier (decision tree, neural net etc.). The classifier is then used to predict the classification for the instances in the test set. If the test set contains N instances of which C are correctly classified the 220 predictive accuracy of the classifier for the test set is p = C/N. This can be used as an estimate of its performance on any unseen dataset. Figure 11.1 Training and Testing NOTE. For some datasets in the UCI Repository (and elsewhere) the data is provided as two separate files, designated as the training set and the test set. In such cases we will consider the two files together as comprising the ‗dataset‘ for that application. In cases where the dataset is only a single fie we need to divide it into a training set and a test set before using Method 1. This may be done in many ways, but a random division into two parts in proportions such as 1:1, 2:1, 70:30 or 60:40 would be customary. Standard Error It is important to bear in mind that the overall aim is not (just) to classify the instances in the test set but to estimate the predictive accuracy of the classifier for all possible unseen instances, which will generally be many times the number of instances contained in the test set. If the predictive accuracy calculated for the test set is p and we go on to use the classifier to classify the instances in a different test set, it is very likely that a different value for predictive accuracy would be obtained. All that we can say is that p is an estimate of the true predictive accuracy of the classifier for all possible unseen instances. We cannot determine the true value without collecting all the instances and running the classifier on them, which is usually an impossible task. Instead, we can use statistical methods to find a range of values within which the true value of the predictive accuracy lies, with a given probability or ‗confidence level‘. To do this we use the standard error associated with the estimated value p. If p is calculated using a test set of N instances the value of its standard error is 𝑝(1 − 𝑝)/𝑁 . The significance of standard error is that it enables us to say that with a specified probability (which we can choose) the true predictive accuracy of the classifier is within so many standard errors above or below the estimated value p. The more certain we wish to be, the 221 greater the number of standard errors. The probability is called the confidence level, denoted by CL and the number of standard errors is usually written as ZCL. Table-11.1 shows the relationship between commonly used values of CL and ZCL. Table 11.1 Value of Zcl for certain confidence levels Confidence Level (CL) ZCL 0.9 0.95 0.99 1.64 1.96 2.58 If the predictive accuracy for a test set is p, with standard error S, then using this table we can say that with probability CL (or with a confidence level CL) the true predictive accuracy lies in the interval p ± ZCL × S. Example If the classifications of 80 instances out of a test set of 100 instances were predicted accurately, the predictive accuracy on the test set would be 80/100= 0.8. The standard error would be 0.8 × 0.2/100 = 0.0016 = 0.04. We can say that with probability 0.95 the true predictive accuracy lies in the interval 0.8 ± 1.96 × 0.04, i.e. between 0.7216 and 0.8784 (to four decimal places). Instead of a predictive accuracy of 0.8 (or 80%) we often refer to an error rate of 0.2 (or 20%). The standard error for the error rate is the same as that for predictive accuracy. The value of CL to use when estimating predictive accuracy is a matter of choice, although it is usual to choose a value of at least 0.9. The predictive accuracy of a classifier is often quoted in technical papers as just p ± 𝑝(1 − 𝑝)/𝑁 without any multiplier ZCL. Repeated Train and Test: Here the classifier is used to classify k test sets, not just one. If all the test sets are of the same size, N, the predictive accuracy values obtained for the k test sets are then averaged to produce an overall estimate p. As the total number of instances in the test sets is kN, the standard error of the estimate p is 𝑝(1 − 𝑝)/𝑘𝑁 . If the test sets are not all of the same size the calculations are slightly more complicated. If there are Ni instances in the ith test set (1 ≤ i ≤ k) and the predictive accuracy calculated for the ith test set is pi the overall predictive accuracy p is 𝑖=𝑘 𝑖=1 𝑝𝑖 𝑁𝑖 /𝑇 where is the weighted average of the pi values. The standard error is 𝑖=𝑘 𝑖=1 𝑁𝑖 = 𝑇 i.e. p 𝑝(1 − 𝑝)/𝑇. Method 2: k-fold Cross-validation 222 An alternative approach to ‗train and test‘ that is often adopted when the number of instances is small (and which many prefer to use regardless of size) is known as k-fold cross-validation (Figure 11.2). If the dataset comprises N instances, these are divided into k equal parts, k typically being a small number such as 5 or 10. (If N is not exactly divisible by k, the final part will have fewer instances than the other k − 1 parts.) A series of k runs is now carried out. Each of the k parts in turn is used as a test set and the other k − 1 part is used as a training set. The total number of instances correctly classified (in all k runs combined) is divided by the total number of instances N to give an overall level of predictive accuracy p, with standard error 𝑝(1 − 𝑝)/𝑁 . Figure 11.2 k-fold Cross-validation Method 3: N-fold Cross-validation N -fold cross-validation is an extreme case of k-fold cross-validation, often known as ‗leaveone-out‘ cross-validation or jack-knifing, where the dataset is divided into as many parts as there are instances, each instance effectively forming a test set of one. N classifiers are generated, each from N − 1 instance and each is used to classify a single test instance. The predictive accuracy p is the total number correctly classified divided by the total number of instances. The standard error is 𝑝(1 − 𝑝)/𝑁 . The large amount of computation involved makes N -fold cross-validation unsuitable for use with large datasets. For other datasets, it is not clear whether any gain in the accuracy of the estimates produced by using N -fold cross- validation justifies the additional computation involved. In practice, the method is most likely to be of benefit with very small datasets where as much data as possible needs to be used to train the classifier. Experimental Results I In this section we look at experiments to estimate the predictive accuracy of classifiers generated for four datasets. All the results in this section were obtained using the TDIDT tree 223 induction algorithm, with information gain used for attribute selection. Basic information about the datasets is given in Table 11.2 below. Further information about these and most of the other datasets mentioned in this book is given in Appendix B. Table 11.2 Four datasets Dataset Description Vote 2 Attributes+ Categ Cts 16 2 8 768 * 647 214 Classes Voting in US Congress in 1984 Pima Prevalence of diabetes Indians in pima Indian women Chess Chess endgame Glass Glass identification + categ: categorical; cts: continuous 2 7 Instances Training set Test set 300 135 7 9 * Plus one ‗ignore‘ attribute The vote, pima-Indians and glass datasets are all taken from the UCI Repository. The chess dataset was constructed for a well-known series of machine learning experiments Quinlan JR 1979. The vote dataset has separate training and test sets. The other three datasets were first divided into two parts, with every third instance placed in the test set and the other two placed in the training set in both cases. The result for the vote dataset illustrates the point that TDIDT (along with some but not all other classification algorithms) is sometimes unable to classify an unseen instance (Table 11.3). Table 11.1 Train and Test Results for Four Datasets Vote Pima-Indians Test set (instances) 135 256 Chess 215 Dataset 126 (93% ± 2%) 191 (75% ± 3%) Incorrectly classified 7 65 214 (99.5% ± 0.5%) 1 Correctly classified Unclassified 2 Glass 71 50 (70% ± 5%) 21 Unclassified instances can be dealt with by giving the classifier a ‗default strategy‘, such as always allocating them to the largest class, and that will be the approach followed for the remainder of this chapter. It could be argued that it might be better to leave unclassified 224 instances as they are, rather than risk introducing errors by assigning them to a specific class or classes. In practice the number of unclassified instances is generally small and how they are handled makes little difference to the overall predictive accuracy. Table 11. 4 gives the ‗train and test‘ result for the vote dataset modified to incorporate the ‗default to largest class‘ strategy. The difference is slight. Table 11.2 Train and Test Results for vote Dataset (Modified) Dataset Vote Test set Correctly Incorrectly (instances) classified classified 135 8 127 (94% ± 2%) Table 11.5 and Table 11.6 show the results obtained using 10-fold and N –fold Crossvalidation for the four datasets. For the vote dataset the 300 instances in the training set are used. For the other two datasets all the available instances are used. Table 11.3. 10-fold Cross-validation Results for Four Datasets Dataset Vote PimaIndians Test set Correctly Incorrectly (instances) classified classified 300 275 (92% ± 2%) 25 768 536 (70% ± 3%) 232 Chess 647 Glass 214 645(99.7% ± 0.2%) 149 (70% ± 3%) 2 65 Table11.4 N-fold Cross-validation Results for Four Datasets Dataset Vote PimaIndians Test set Correctly Incorrectly (instances) classified classified 300 278 (93% ± 2%) 22 768 517 (67% ± 2%) 251 Chess 647 Glass 214 646(99.8% ± 0.2%) 144 (67% ± 3%) 1 70 225 All the tables given in this section are estimates. The 10-fold cross- validation and N -fold cross-validation results for all four datasets are based on considerably more instances than those in the corresponding test sets for the ‗train and test‘ experiments and so are more likely to be reliable. Experimental Results II: Datasets with Missing Values We now look at experiments to estimate the predictive accuracy of a classifier in the case of datasets with missing values. As before we will generate all the classifiers using the TDIDT algorithm, with Information Gain for attribute selection. Three datasets were used in these experiments, all from the UCI Repository. Basic information about each one is given in Table 7 below. Table 11.5 Three Datasets with Missing Values Datase Description Class t es Attributes+ cate cts g crx Credit Card 2 instances Training Test set set 9 6 690(37) 200(12) 22 7 2514(2514 1258(37 Application hypo Hypothyroid 5 Disorders labor- Labor Negotiations 2 8 8 ) 1) 40(39) 17(17) ne Each dataset has both a training set and a separate test set. In each case, there are missing values in both the training set and the test set. The values in parentheses in the ‗training set‘ and ‗test set‘ columns show the number of instances that have at least one missing value. The ‗train and test‘ method was used for estimating predictive accuracy. Strategy 1: Discard Instances This is the simplest strategy: delete all instances where there is at least one missing value and use the remainder. This strategy has the advantage of avoiding introducing any data errors. Its main disadvantage is that discarding data may damage the reliability of the resulting classifier. A second disadvantage is that the method cannot be used when a high proportion of the instances in the training set have missing values, as is the case for example with both the 226 hypo and the labor-ne datasets. A final disadvantage is that it is not possible with this strategy to classify any instances in the test set that have missing values. Together these weaknesses are quite substantial. Although the ‗discard instances‘ strategy may be worth trying when the proportion of missing values is small, it is not recommended in general. Of the three datasets listed in Table 11.7, the ‗discard instances‘ strategy can only be applied to crx. Doing so gives the possibly surprising result in Table 8. Table 11.6 Discard Instances Strategy with crx Dataset Dataset MV strategy Rules Test set Correct Incorrect Crx Discard 118 188 0 instances Clearly discarding the 37 instances with at least one missing value from the training set (5.4%) does not prevent the algorithm constructing a decision tree capable of classifying the 188 instances in the test set that do not have missing values correctly in every case. Strategy 2: Replace by Most Frequent/Average Value With this strategy any missing values of a categorical attribute are replaced by its most commonly occurring value in the training set. Any missing values of a continuous attribute are replaced by its average value in the training set. Table 9 shows the result of applying the ‗Most Frequent/Average Value‘ strategy to the crx dataset. As for the ‗Discard Instances‘ strategy all instances in the test set are correctly classified, but this time all 200 instances in the test set are classified, not just the 188 instances in the test set that do not have missing values. Table 11.7 Comparison of Strategies with crx Dataset Dataset MV strategy Rules Test set Correct Incorrect crx Discard instances 118 188 0 crx Most Frequent/ Average 139 200 0 Value With this strategy we can also construct classifiers from the hypo and crx datasets. In the case of the hypo dataset, we get a decision tree with just 15 rules. The average number of terms per rule is 4.8. When applied to the test data this tree is able to classify correctly 1251 of the 1258 instances in the test set (99%; Table 11.10). This is a remarkable result with 227 so few rules, especially as there are missing values in every instance in the training set. It gives considerable credence to the belief that using entropy for constructing a decision tree is an effective approach. Table 11.8 Most Frequent Value/Average Strategy with hypo Dataset Dataset MV strategy Rules Test set Correct Incorrect crx Most Frequent/ Average 5 1251 7 Value In the case of the labor-ne dataset, we obtain a classifier with five rules, which correctly classifies 14 out of the 17 instances in the test set (Table 11.11). Table 11.9 Most Frequent Value/Average Strategy with labor-ne Dataset Dataset MV strategy Rules Test set Correct Incorrect crx Most Frequent/ Average 5 14 3 Value Missing Classifications It is worth noting that for each dataset given in Table 7 the missing values are those of attributes, not classifications. Missing classifications in the training set are a far larger problem than missing attribute values. One possible approach would be to replace them all by the most frequently occurring classification but this is unlikely to prove successful in most cases. The best approach is probably to discard any instances with missing classifications. Confusion Matrix As well as the overall predictive accuracy on unseen instances it is often helpful to see a breakdown of the classifier‘s performance, i.e. how frequently instances of class X were correctly classified as class X or misclassified as some other class. This information is given in a confusion matrix. The confusion matrix in Table 11.12 gives the results obtained in ‗train and test‘ mode from the TDIDT algorithm (using information gain for attribute se- lection) for the vote test set, which has two possible classifications: ‗republican‘ and ‗democrat‘. Table 11.10 Example of a Confusion Matrix Correct Classified as classification Democrat Republican 228 Democrat 81(97.6%) 2(2.4%) Republican 6(11.5%) 46(88.5%) The body of the table has one row and column for each possible classification. The rows correspond to the correct classifications. The columns correspond to the predicted classifications. The value in the ith row and jth column gives the number of instances for which the correct classification is the ith class which are classified as belonging to the jth class. If all the instances were correctly classified, the only non-zero entries would be on the ‗leading diagonal‘ running from top left (i.e. row 1, column 1) down to bottom right. To demonstrate that the use of a confusion matrix is not restricted to datasets with two classifications, Table 11.13 shows the results obtained using 10-fold cross-validation with the TDIDT algorithm (using information gain for attribute section) for the glass dataset, which has six classifications: 1, 2,3, 5, 6 and 7 (there is also a class 4 but it is not used for the training data). Table 11.11 Confusion Matrix for glass Dataset Correct Classified as classification 1 2 3 5 6 7 1 52 10 7 0 0 1 2 15 50 6 2 1 2 3 5 6 6 0 0 0 5 0 2 0 10 0 1 6 0 1 0 0 7 1 7 1 3 0 1 0 24 True and False Positives When a dataset has only two classes, one is often regarded as ‗positive‘ (i.e. the class of principal interest) and the other as ‗negative‘. In this case the entries in the two rows and columns of the confusion matrix are referred to as true and false positives and true and false negatives (Table 11.14). Table 11.12 True and False Positives and Negatives Correct classification Classified as + - 229 + True positives False negatives - False positives True negatives When there are more than two classes, one class is sometimes important enough to be regarded as positive, with all the other classes combined treated as negative. For example we might consider class 1 for the glass dataset as the ‗positive‘ class and classes 2, 3, 5, 6 and 7 combined as ‗negative‘. The confusion matrix given as Table 13 can then be rewritten as shown in Table 15. Of the 73 instances classified as positive, 52 genuinely are positive (true positives) and the other 21 are really negative (false positives). Of the 141 instances classified as negative, 18 are really positive (false negatives) and the other 123 are genuinely negative (true negatives). With a perfect classifier there would be no false positives or false negatives. Table 11.13 Revised Confusion Matrix for glass Dataset Correct Classified as classification + - + 52 18 - 21 123 False positives and false negatives may not be of equal importance, e.g. we may be willing to accept some false positives as long as there are no false negatives or vice versa. 11.4 Evaluating the Accuracy of a Classifier or Predictor Holdout, random sub-sampling, cross-validation, and the bootstrap are common techniques for assessing accuracy based on randomly sampled partitions of the given data. The use of such techniques to estimate accuracy increases the overall computation time, yet is useful for model selection. Figure 11.1 Estimating accuracy with the holdout method 230 Holdout Method and Random Subsampling The holdout method is what we have alluded to so far in our discussions about accuracy. In this method, the given data are randomly partitioned into two independent sets, a training set and a test set. Typically, two-thirds of the data are allocated to the training set, and the remaining one-third is allocated to the test set. The training set is used to derive the model, whose accuracy is estimated with the test set (Figure 3). The estimate is pessimistic because only a portion of the initial data is used to derive the model. Random sub-sampling is a variation of the holdout method in which the holdout method is repeated k times. The overall accuracy estimate is taken as the average of the accuracies obtained from each iteration. (For prediction, we can take the average of the predictor error rates.) Cross-validation In k-fold cross-validation, the initial data are randomly partitioned into k mutually exclusive subsets or ―folds,‖ D1, D2, . . . , Dk, each of approximately equal size. Training and testing is performed k times. In iteration i, partition Di is reserved as the test set, and the remaining partitions are collectively used to train the model. That is, in the first iteration, subsets D2, . . . , Dk collectively serve as the training set in order to obtain a first model, which is tested on D1; the second iteration is trained on subsets D1, D2, . . . , Dk, and tested on D2; and so on. Unlike the holdout and random subsampling methods above, here, each sample is used the same number of times for training and once for testing. For classification, the accuracy estimate is the overall number of correct classifications from the k iterations, divided by the total number of tuples in the initial data. For prediction, the error estimate can be computed as the total loss from the k iterations, divided by the total number of initial tuples. Leave-one-out is a special case of k-fold cross-validation where k is set to the number of initial tuples. That is, only one sample is ―left out‖ at a time for the test set. In stratified crossvalidation, the folds are stratified so that the class distribution of the tuples in each fold is approximately the same as that in the initial data. In general, stratified 10-fold cross-validation is recommended for estimating accuracy (even if computation power allows using more folds) due to its relatively low bias and variance. Bootstrap Unlike the accuracy estimation methods mentioned above, the bootstrap method samples the given training tuples uniformly with replacement. That is, each time a tuple is selected, it is 231 equally likely to be selected again and readied to the training set. For instance, imagine a machine that randomly selects tuples for our training set. In sampling with replacement, the machine is allowed to select the same tuple more than once. There are several bootstrap methods. A commonly used one is the .632 bootstrap, which works as follows. Suppose we are given a data set of d tuples. The data set is sampled d times, with replacement, resulting in a bootstrap sample or training set of d samples. It is very likely that some of the original data tuples will occur more than once in this sample. The data tuples that did not make it into the training set end up forming the test set. Suppose we were to try this out several times. As it turns out, on average, 63.2% of the original data tuples will end up in the bootstrap, and the remaining 36.8% will form the test set (hence, the name, .632 bootstrap.) ―Where does the figure, 63.2%, come from?‖ Each tuple has a probability of 1/d of being selected, so the probability of not being chosen is (1-1/d). We have to select d times, so the probability that a tuple will not be chosen during this whole time is (1-1/d)d. If d is large, the probability approaches e 1 = 0:368.14 Thus, 36.8% of tuples will not be selected for training and thereby end up in the test set, and the remaining 63.2% will form the training set. We can repeat the sampling procedure k times, where in each iteration, we use the current test set to obtain an accuracy estimate of the model obtained from the current bootstrap sample. The overall accuracy of the model is then estimated as 𝑘 𝐴𝑐𝑐 𝑀 = (0.632 ∗ 𝐴𝑐𝑐 (𝑀𝑖 )𝑡𝑒𝑠𝑡 _𝑠𝑒𝑡 )) + 0.368 ∗ 𝐴𝑐𝑐 (𝑀𝑖 )𝑡𝑟𝑎𝑖𝑛 _𝑠𝑒𝑡 )) (1) 𝑖=1 Where Acc(Mi)test_set is the accuracy of the model obtained with bootstrap sample i when it is applied to test set i. Acc(Mi)train_set is the accuracy of the model obtained with bootstrap sample i when it is applied to the original set of data tuples. The bootstrap method works well with small data sets. 232 Figure 11.2 Increasing model accuracy: Bagging and boosting each generate a set of classification or prediction models M1,M2,. . . Mk. Voting strategies are used to combine the predictions for a given unknown tuple. Ensemble Methods—Increasing the Accuracy Bagging and boosting are two techniques to improve the accuracy (Figure 4). They are examples of ensemble methods, or methods that use a combination of models. Each combines a series of k learned models (classifiers or predictors), M1, M2, . . ., Mk, with the aim of creating an improved composite model, M . Both bagging and boosting can be used for classification as well as prediction. Bagging We first take an intuitive look at how bagging works as a method of increasing accuracy. For ease of explanation, we will assume at first that our model is a classifier. Suppose that you are a patient and would like to have a diagnosis made based on your symptoms. Instead of asking one doctor, you may choose to ask several. If a certain diagnosis occurs more than any of the others, you may choose this as the final or best diagnosis. That is, the final diagnosis is made based on a majority vote, where each doctor gets an equal vote. Now replace each doctor by a classifier, and you have the basic idea behind bagging. Intuitively, a majority vote made by a large group of doctors may be more reliable than a majority vote made by a small group. Given a set, D, of d tuples, bagging works as follows. For iteration i (i = 1, 2, . . ., k), a training set, Di, of d tuples is sampled with replacement from the original set of tuples, D. Note that the term bagging stands for bootstrap aggregation. Each training set is a bootstrap sample, as described in Section Bootstrap. Because sampling with replacement is used, some of the original tuples of D may not be included in Di, whereas others may occur more than once. A classifier model, Mi, is learned for each training set, Di. To classify an unknown tuple, X, each classifier, Mi, returns its class prediction, which counts as one vote. The 233 bagged classifier, M , counts the votes and assigns the class with the most votes to X. Bagging can be applied to the prediction of continuous values by taking the average value of each prediction for a given test tuple. The algorithm is summarized as follows. Algorithm: Bagging. The bagging algorithm—creates an ensemble of models (classifiers or pre-dictors) for a learning scheme where each model gives an equally-weighted prediction. Input: D, a set of d training tuples; k, the number of models in the ensemble; a learning scheme (e.g., decision tree algorithm, backpropagation, etc.) Output: A composite model, M. Method: for i = 1 to k do // create k models: create bootstrap sample, Di, by sampling D with replacement; use Di to derive a model, Mi; end for To use the composite model on a tuple, X: if classification then let each of the k models classify X and return the majority vote; if prediction then let each of the k models predict a value for X and return the average predicted value; The bagged classifier often has significantly greater accuracy than a single classifier derived from D, the original training data. It will not be considerably worse and is more robust to the effects of noisy data. The increased accuracy occurs because the composite model reduces the variance of the individual classifiers. For prediction, it was theoretically proven that a bagged predictor will always have improved accuracy over a single predictor derived from D. Boosting We now look at the ensemble method of boosting. As in the previous section, suppose that as a patient, you have certain symptoms. Instead of consulting one doctor, you choose to consult several. Suppose you assign weights to the value or worth of each doctor‘s diagnosis, based on the accuracies of previous diagnoses they have made. The final diagnosis is then a combination of the weighted diagnoses. This is the essence behind boosting. 234 In boosting, weights are assigned to each training tuple. A series of k classifiers is iteratively learned. After a classifier Mi is learned, the weights are updated to allow the subsequent classifier, Mi+1, to ―pay more attention‖ to the training tuples that were misclassified by Mi. The final boosted classifier, M , combines the votes of each individual classifier, where the weight of each classifier‘s vote is a function of its accuracy. The boosting algorithm can be extended for the prediction of continuous values. Adaboost is a popular boosting algorithm. Suppose we would like to boost the accuracy of some learning method. We are given D, a data set of d class-labeled tuples, (X1, y1), (X2, y2), . . ., (Xd, yd ), where yi is the class label of tuple Xi. Initially, Adaboost assigns each training tuple an equal weight of 1/d. Generating k classifiers for the ensemble requires k rounds through the rest of the algorithm. In round i, the tuples from D are sampled to form a training set, Di, of size d. Sampling with replacement is used the same tuple may be selected more than once. Each tuple‘s chance of being selected is based on its weight. A classifier model, Mi, is derived from the training tuples of Di. Its error is then calculated using Di as a test set. The weights of the training tuples are then adjusted according to how they were classified. If a tuple was incorrectly classified, its weight is increased. If a tuple was correctly classified, its weight is decreased. A tuple‘s weight reflects how hard it is to classify the higher the weight, the more often it has been misclassified. These weights will be used to generate the training samples for the classifier of the next round. The basic idea is that when we build a classifier, we want it to focus more on the misclassified tuples of the previous round. Some classifiers may be better at classifying some ―hard‖ tuples than others. In this way, we build a series of classifiers that complement each other. The algorithm is summarized as follows Algorithm: Adaboost. A boosting algorithm—create an ensemble of classifiers. Each one gives a weighted vote. Adaboost Algorithm: Input: D, a set of d class-labeled training tuples; k, the number of rounds (one classifier is generated per round); a classification learning scheme. Output: A composite model. Method: initialize the weight of each tuple in D to 1/d; 235 for i = 1 to k do // for each round: sample D with replacement according to the tuple weights to obtain Di; use training set Di to derive a model, Mi; compute error(Mi), the error rate of Mi (Equation 2) if error(Mi) > 0:5 then reinitialize the weights to 1/d go back to step 3 and try again; end if for each tuple in Di that was correctly classified do multiply the weight of the tuple by error(Mi)/(1- error(Mi)); // update weights normalize the weight of each tuple; end for To use the composite model to classify tuple, X: initialize weight of each class to 0; for i = 1 to k do // for each classifier: 𝑤𝑖 = 𝑙𝑜𝑔 1−𝑒𝑟𝑟𝑜𝑟 (𝑀𝑖 ) 𝑒𝑟𝑟𝑜𝑟 (𝑀𝑖 ) ; // weight of the classifier‘s vote c = Mi(X); // get class prediction for X from Mi add wi to weight for class c end for return the class with the largest weight; Now, let‘s look at some of the math that‘s involved in the algorithm. To compute the error rate of model Mi, we sum the weights of each of the tuples in Di that Mi misclassified. That is, 𝑘 𝑒𝑟𝑟𝑜𝑟 𝑀𝑖 = 𝑤𝑗 ∗ 𝑒𝑟𝑟(𝑋𝑗 ) (2) 𝑗 where err(Xj) is the misclassification error of tuple Xj: If the tuple was misclassified, then err(Xj) is 1. Otherwise, it is 0. If the performance of classifier Mi is so poor that its error exceeds 0.5, then we abandon it. Instead, we try again by generating a new Di training set, from which we derive a new Mi. The error rate of Mi affects how the weights of the training tuples are updated. If a tuple in round i was correctly classified, its weight is multiplied by error (Mi)/(1-error(Mi)). Once the weights of all of the correctly classified tuples are updated, the weights for all tuples 236 (including the misclassified ones) are normalized so that their sum remains the same as it was before. To normalize a weight, we multiply it by the sum of the old weights, divided by the sum of the new weights. As a result, the weights of misclassified tuples are increased and the weights of correctly classified tuples are decreased, as described above. ―Once boosting is complete, how is the ensemble of classifiers used to predict the class label of a tuple, X?‖ Unlike bagging, where each classifier was assigned an equal vote, boosting assigns a weight to each classifier‘s vote, based on how well the classifier performed. The lower a classifier‘s error rate, the more accurate it is, and therefore, the higher its weight for voting should be. The weight of classifier Mi‘s vote is 𝑙𝑜𝑔 1 − 𝑒𝑟𝑟𝑜𝑟(𝑀𝑖 ) 𝑒𝑟𝑟𝑜𝑡(𝑀𝑖 ) (3) For each class, c, we sum the weights of each classifier that assigned class c to X. The class with the highest sum is the ―winner‖ and is returned as the class prediction for tuple X. ―How does boosting compare with bagging?‖ Because of the way boosting focuses on the misclassified tuples, it risks over fitting the resulting composite model to such data. Therefore, sometimes the resulting ―boosted‖ model may be less accurate than a single model derived from the same data. Bagging is less susceptible to model over fitting. While both can significantly improve accuracy in comparison to a single model, boosting tends to achieve greater accuracy. Model Selection Suppose that we have generated two models, M1 and M2 (for either classification or prediction), from our data. We have performed 10-fold cross-validation to obtain a mean error rate for each. How can we determine which model is best? It may seem intuitive to select the model with the lowest error rate; however, the mean error rates are just estimates of error on the true population of future data cases. There can be considerable variance between error rates within any given 10-fold cross-validation experiment. Although the mean error rates obtained for M1 and M2 may appear different, that difference may not be statistically significant. What if any difference between the two may just be attributed to chance? This section addresses these questions. Estimating Confidence Intervals To determine if there is any ―real‖ difference in the mean error rates of two models, we need to employ a test of statistical significance. In addition, we would like to obtain some confidence limits for our mean error rates so that we can make statements like ―any observed 237 mean will not vary by +/- two standard errors 95% of the time for future samples‖ or ―one model is better than the other by a margin of error of +/- 4%.‖ What do we need in order to perform the statistical test? Suppose that for each model, we did 10-fold cross-validation, say, 10 times, each time using a different 10-fold partitioning of the data. Each partitioning is independently drawn. We can average the 10 error rates obtained each for M1 and M2, respectively, to obtain the mean error rate for each model. For a given model, the individual error rates calculated in the cross-validations may be considered as different, independent samples from a probability distribution. In general, they follow a t distribution with k-1 degrees of freedom where, here, k = 10. (This distribution looks very similar to a normal, or Gaussian, distribution even though the functions defining the two are quite different. Both are unimodal, symmetric, and bell-shaped.) This allows us to do hypothesis testing where the significance test used is the t-test, or Student‘s t-test. Our hypothesis is that the two models are the same, or in other words, that the difference in mean error rate between the two is zero. If we can reject this hypothesis (referred to as the null hypothesis), then we can conclude that the difference between the two models is statistically significant, in which case we can select the model with the lower error rate. In data mining practice, we may often employ a single test set, that is, the same test set can be used for both M1 and M2. In such cases, we do a pairwise comparison of the two models for each 10-fold cross-validation round. That is, for the ith round of 10-fold crossvalidation, the same cross-validation partitioning is used to obtain an error rate for M1 and an error rate for M2. Let err(M1)i (or err(M2)i) be the error rate of model M1 (or M2) on round i. The error rates for M1 are averaged to obtain a mean error rate for M1, denoted 𝑒𝑟𝑟(M1). Similarly, we can obtain 𝑒𝑟𝑟 (M2). The variance of the difference between the two models is denoted var(M1-M2). The t-test computes the t-statistic with k-1 degrees of freedom for k samples. In our example we have k = 10 since, here, the k samples are our error rates obtained from ten 10-fold cross-validations for each model. The t-statistic for pairwise comparison is computed as follows: 𝑡= 𝑒𝑟𝑟 𝑀1 − 𝑒𝑟𝑟(𝑀2 ) 𝑣𝑎𝑟(𝑀1 − 𝑀2 )/𝑘 (4) Where 𝑣𝑎𝑟 𝑀1 − 𝑀2 1 = 𝑘 𝑘 [𝑒𝑟𝑟 𝑀1 𝑖 − 𝑀2 𝑖 − (𝑒𝑟𝑟 𝑀1 − 𝑒𝑟𝑟(𝑀2 )]2 (5) 𝑖=1 238 To determine whether M1 and M2 are significantly different, we compute t and select a significance level, sig. In practice, a significance level of 5% or 1% is typically used. We then consult a table for the t distribution, available in standard textbooks on statistics. This table is usually shown arranged by degrees of freedom as rows and significance levels as columns. Suppose we want to ascertain whether the difference between M1 and M2 is significantly different for 95% of the population, that is, sig = 5% or 0.05. We need to find the t distribution value corresponding to k-1 degrees of freedom (or 9 degrees of freedom for our example) from the table. However, because the t distribution is symmetric, typically only the upper percentage points of the distribution are shown. Therefore, we look up the table value for z = sig/2, which in this case is 0.025, where z is also referred to as a confidence limit. If t > z or t < -z, then our value of t lies in the rejection region, within the tails of the distribution. This means that we can reject the null hypothesis that the means of M1 and M2 are the same and conclude that there is a statistically significant difference between the two models. Otherwise, if we cannot reject the null hypothesis, we then conclude that any difference between M1 and M2 can be attributed to chance. If two test sets are available instead of a single test set, then a nonpaired version of the ttest is used, where the variance between the means of the two models is estimated as 𝑣𝑎𝑟 𝑀1 − 𝑀2 = 𝑣𝑎𝑟(𝑀1 ) 𝑣𝑎𝑟(𝑀2 ) + 𝑘1 𝑘2 (6) and k1 and k2 are the number of cross-validation samples (in our case, 10-fold crossvalidation rounds) used for M1 and M2, respectively. When consulting the table of t distribution, the number of degrees of freedom used is taken as the minimum number of degrees of the two models. ROC Curves ROC curves are a useful visual tool for comparing two classification models. The name ROC stands for Receiver Operating Characteristic. ROC curves come from signal detection theory that was developed during World War II for the analysis of radar images. An ROC curve shows the trade-off between the true positive rate or sensitivity (proportion of positive tuples that are correctly identified) and the false-positive rate (proportion of negative tuples that are incorrectly identified as positive) for a given model. That is, given a two-class problem, it allows us to visualize the trade-off between the rate at which the model can 239 accurately recognize ‗yes‘ cases versus the rate at which it mistakenly identifies ‗no‘ cases as ‗yes‘ for different ―portions‖ of the test set. Any increase in the true positive rate occurs at the cost of an increase in the false-positive rate. The area under the ROC curve is a measure of the accuracy of the model. In order to plot an ROC curve for a given classification model, M, the model must be able to return a probability or ranking for the predicted class of each test tuple. That is, we need to rank the test tuples in decreasing order, where the one the classifier thinks is most likely to belong to the positive or ‗yes‘ class appears at the top of the list. Naive Bayesian and back propagation classifiers are appropriate, whereas others, such as decision tree classifiers, can easily be modified so as to return a class probability distribution for each prediction. The vertical axis of an ROC curve represents the true positive rate. The horizontal axis represents the false-positive rate. An ROC curve for M is plotted as follows. Starting at the bottom lefthand corner (where the true positive rate and false-positive rate are both 0), we check the actual class label of the tuple at the top of the list. If we have a true positive (that is, a positive tuple that was correctly classified), then on the ROC curve, we move up and plot a point. If, instead, the tuple really belongs to the ‗no‘ class, we have a false positive. On the ROC curve, we move right and plot a point. This process is repeated for each of the test tuples, each time moving up on the curve for a true positive or toward the right for a false positive. Figure 3 the ROC curves of two classification models Figure 5 shows the ROC curves of two classification models. The plot also shows a diagonal line where for every true positive of such a model, we are just as likely to encounter a false positive. Thus, the closer the ROC curve of a model is to the diagonal line, the less accurate the model. If the model is really good, initially we are more likely to encounter true positives as we move down the ranked list. Thus, the curve would move steeply up from zero. Later, as we start to encounter fewer and fewer true positives, and more and more false positives, the curve cases off and becomes more horizontal. 240 To assess the accuracy of a model, we can measure the area under the curve. Several software packages are able to perform such calculation. The closer the area is to 0.5, the less accurate the corresponding model is. A model with perfect accuracy will have an area of 1.0. 11.5 Multiclass Problem In machine learning, multiclass or multinomial classification is the problem of classifying instances into more than two classes. While some classification algorithms naturally permit the use of more than two classes, others are by nature binary algorithms; these can, however, be turned into multinomial classifiers by a variety of strategies. Multiclass classification should not be confused with multi-label classification, where multiple classes are to be predicted for each problem instance. General strategies One-vs.-all Among these strategies are the one-vs.-all (or one-vs.-rest, OvA or OvR) strategy, where a single classifier is trained per class to distinguish that class from all other classes. Prediction is then performed by predicting using each binary classifier, and choosing the prediction with the highest confidence score (e.g., the highest probability of a classifier such as naive Bayes). In pseudocode, the training algorithm for an OvA learner constructed from a binary classification learner L is as follows: Inputs: L, a learner (training algorithm for binary classifiers) samples X labels y where yᵢ ∈ {1, … K} is the label for the sample Xᵢ Output: a list of classifiers fk for k ∈ {1, … K} Procedure: for each k in {1 … K}: Construct a new label vector yᵢ' = 1 where yᵢ = k, 0 (or -1) elsewhere Apply L to X, y' to obtain fk End for Making decisions proceeds by applying all classifiers to an unseen sample x and predicting the label k for which the corresponding classifier reports the highest confidence score: 𝑦 = 𝑎𝑟𝑔𝑘∈1…𝑘 max 𝑓𝑘 𝑥 241 11.6 Summary We discussed basic prediction theory and its impact on classification success evaluation, implications for learning algorithm design, and uses in learning algorithm execution. There are several important aspects of learning which the theory here casts light on. Perhaps the most important of these is the problem of performance reporting for Classifiers. Many people use some form of empirical variance to estimate upper and lower bounds. Databases are rich with hidden information that can be used for making intelligent business decisions. Classification and prediction are two forms of data analysis that can be used to extract models describing important data classes or to predict future data trends. Whereas classification predicts categorical labels, prediction models continuous-valued functions. For example, a classification model may be built to predict the expenditures of potential customers on computer equipment given their income and occupation. Many classification and prediction methods have been proposed by researches in machine learning, expert systems, statistics, and neurobiology. Most algorithms are memory resident, typically assuming a small data size. Recent database mining research has built on such work, developing scalable classification and prediction techniques capable of handling large diskresident data. These techniques often consider parallel and distributed processing. One of the fundamental problems in data mining classification problems is that of class imbalance. In the typical binary class imbalance problem one class (negative class) vastly outnumbers the other (positive class). The difficulty of learning under such conditions lies in the induction bias of most learning algorithms. That is, most learning algorithms, when presented with a dataset in which there is a severely underrepresented class, ignore the minority class. This is due to the fact that one can achieve very high accuracy by always predicting the majority class, especially if the majority class represent 95+% of the dataset. 11.7 Key words Classification, Multi class problem, Predictive accuracy, Adaboost 11.8 Exercises 1. Explain the multi-class classification problem 2. Explain the Estimating Predictive accuracy of classification methods. 3. How ROC curves are useful in classification models. 242 4. Explain Adaboost algorithm. 5. Explain Holdout Method and Random Sub-sampling concept. 6. Explain k-fold cross validation with an example. 7. Explain n-fold cross validation with an example. 8. Write short notes on confusion matrix. 11.9 References 1. Quinlan, J.R. ―Discovering Rules by Induction from Large Collections of Examples‖. In Michie, D. (ed.), Expert Systems in the Microelectronic Age. Edinburgh University Press, pp. 168–201. 1979. 2. Han, Jiawei and Kamber, Micheline and Pei, Jian. ―Data mining: concepts and techniques‖. Morgan Kaufmann. 2006. 243 UNIT 12: ALGORITHMS FOR DATA CLUSTERING Structure 12.1 Objectives 12.2 Introduction 12.3 Overview of cluster analysis 12.4 Distance measures 12.5 Different Algorithms for data clustering 12.6 Partitional Methods 12.7 Hierarchical methods 12.8 Summary 12.9 Keywords 12.10 Exercises 12.11 References 12.1 Objectives The objectives covered under this unit include: Overview of cluster analysis Types of Data and Computing Distance Different algorithms for data clustering Partitional Methods and Hierarchical methods 12.2 Introduction Cluster analysis is a statistical technique used to identify how various units (people, groups, or societies), can be grouped together because of characteristics they have in common. It is an exploratory data analysis tool that aims to sort different objects into groups in such a way that when they belong to the same group they have a maximal degree of association and when they do not belong to the same group their degree of association is minimal. Cluster analysis is typically used in the exploratory phase of research when the researcher does not have any pre-conceived hypotheses. It is commonly not the only statistical method used, but rather is done toward the beginning phases of a project to help guide the rest of the analysis. For this reason, significance testing is usually neither relevant nor appropriate. 244 12.3 Overview of Cluster Analysis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics. The term cluster analysis (first used by Tryon, 1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective categories. A general question facing researchers in many areas of inquiry is how to organize observed data into meaningful structures, that is, to develop taxonomies. In other words cluster analysis is an exploratory data analysis tool which aims at sorting different objects into groups in a way that the degree of association between two objects is maximal if they belong to the same group and minimal otherwise. Given the above, cluster analysis can be used to discover structures in data without providing an explanation / interpretation. In other words, cluster analysis simply discovers structures in data without explaining why they exist. We deal with clustering in almost every aspect of daily life. For example, a group of diners sharing the same table in a restaurant may be regarded as a cluster of people. In food stores items of similar nature, such as different types of meat or vegetables are displayed in the same or nearby locations. Cluster analysis itself is not one specific algorithm, but the general task to be solved. It can be achieved by various algorithms that differ significantly in their notion of what constitutes a cluster and how to efficiently find them. Popular notions of clusters include groups with small distances among the cluster members, dense areas of the data space, intervals or particular statistical distributions. Clustering can therefore be formulated as a multi-objective optimization problem. The appropriate clustering algorithm and parameter settings (including values such as the distance function to use, a density threshold or the number of expected clusters) depend on the individual data set and intended use of the results. Cluster analysis as such is not an automatic task, but an iterative process of knowledge discovery or interactive multi-objective optimization that involves trial and failure. It will often be necessary to modify data pre-processing and model parameters until the result achieves the desired properties. 245 12.4 Distance Measures The joining or tree clustering method uses the dissimilarities or distances between objects when forming the clusters. These distances can be based on a single dimension or multiple dimensions. For example, if we were to cluster fast foods, we could take into account the number of calories they contain, their price, subjective ratings of taste, etc. The most straightforward way of computing distances between objects in a multi-dimensional space is to compute Euclidean distances. If we had a two- or three-dimensional space this measure is the actual geometric distance between objects in the space (i.e., as if measured with a ruler). However, the joining algorithm does not "care" whether the distances that are "fed" to it are actual real distances, or some other derived measure of distance that is more meaningful to the researcher; and it is up to the researcher to select the right method for his/her specific application. Euclidean distance. This is probably the most commonly chosen type of distance. It simply is the geometric distance in the multidimensional space. It is computed as: distance(x,y) = { i (xi - yi)2 }½ Note that Euclidean (and squared Euclidean) distances are usually computed from raw data, and not from standardized data. This method has certain advantages (e.g., the distance between any two objects is not affected by the addition of new objects to the analysis, which may be outliers). However, the distances can be greatly affected by differences in scale among the dimensions from which the distances are computed. For example, if one of the dimensions denotes a measured length in centimetres, and you then convert it to millimeters (by multiplying the values by 10), the resulting Euclidean or squared Euclidean distances (computed from multiple dimensions) can be greatly affected, and consequently, the results of cluster analyses may be very different. Squared Euclidean distance. One may want to square the standard Euclidean distance in order to place progressively greater weight on objects that are further apart. This distance is computed as (see also the note in the previous paragraph): distance(x,y) = i (xi - yi)2 City-block (Manhattan) distance. This distance is simply the average difference across dimensions. In most cases, this distance measure yields results similar to the simple Euclidean distance. However, note that in this measure, the effect of single large differences (outliers) is dampened (since they are not squared). The city-block distance is computed as: 246 distance(x,y) = i |xi - yi| Chebychev distance. This distance measure may be appropriate in cases when one wants to define two objects as "different" if they are different on any one of the dimensions. The Chebychev distance is computed as: distance(x,y) = Maximum|xi - yi| Power distance. Sometimes one may want to increase or decrease the progressive weight that is placed on dimensions on which the respective objects are very different. This can be accomplished via the power distance. The power distance is computed as: distance(x,y) = ( i |xi - yi|p)1/r Where r and p are user-defined parameters. A few example calculations may demonstrate how this measure "behaves." Parameter p controls the progressive weight that is placed on differences on individual dimensions; parameter r controls the progressive weight that is placed on larger differences between objects. If r and p are equal to 2, then this distance is equal to the Euclidean distance. Percent disagreement. This measure is particularly useful if the data for the dimensions included in the analysis are categorical in nature. This distance is computed as: distance(x,y) = (Number of xi yi)/ i Amalgamation or Linkage Rules At the first step, when each object represents its own cluster, the distances between those objects are defined by the chosen distance measure. However, once several objects have been linked together, how do we determine the distances between those new clusters? In other words, we need a linkage or amalgamation rule to determine when two clusters are sufficiently similar to be linked together. There are various possibilities: for example, we could link two clusters together when any two objects in the two clusters are closer together than the respective linkage distance. Put another way, we use the "nearest neighbors" across clusters to determine the distances between clusters; this method is called single linkage. This rule produces "stringy" types of clusters, that is, clusters "chained together" by only single objects that happen to be close together. Alternatively, we may use the neighbors across clusters that are furthest away from each other; this method is called complete linkage. There are numerous other linkage rules such as these that have been proposed. Single linkage (nearest neighbor). As described above, in this method the distance between two clusters is determined by the distance of the two closest objects (nearest neighbors) in the 247 different clusters. This rule will, in a sense, string objects together to form clusters, and the resulting clusters tend to represent long "chains." Complete linkage (furthest neighbor). In this method, the distances between clusters are determined by the greatest distance between any two objects in the different clusters (i.e., by the "furthest neighbors"). This method usually performs quite well in cases when the objects actually form naturally distinct "clumps". If the clusters tend to be somehow elongated or of a "chain" type nature, then this method is inappropriate. Unweighted pair-group average. In this method, the distance between two clusters is calculated as the average distance between all pairs of objects in the two different clusters. This method is also very efficient when the objects form natural distinct "clumps," however, it performs equally well with elongated, "chain" type clusters. Note that in their book, Sneath and Sokal (1973) introduced the abbreviation UPGMA to refer to this method asunweighted pair-group method using arithmetic averages. Weighted pair-group average. This method is identical to the unweighted pair-group average method, except that in the computations, the size of the respective clusters (i.e., the number of objects contained in them) is used as a weight. Thus, this method (rather than the previous method) should be used when the cluster sizes are suspected to be greatly uneven. Note that in their book, Sneath and Sokal (1973) introduced the abbreviationWPGMA to refer to this method as weighted pair-group method using arithmetic averages. Unweighted pair-group centroid. The centroid of a cluster is the average point in the multidimensional space defined by the dimensions. In a sense, it is the center of gravity for the respective cluster. In this method, the distance between two clusters is determined as the difference between centroids. Sneath and Sokal (1973) use the abbreviation UPGMC to refer to this method as unweighted pair-group method using the centroid average. Weighted pair-group centroid (median). This method is identical to the previous one, except that weighting is introduced into the computations to take into consideration differences in cluster sizes (i.e., the number of objects contained in them). Thus, when there are (or one suspects there to be) considerable differences in cluster sizes, this method is preferable to the previous one. Sneath and Sokal (1973) use the abbreviation WPGMCto refer to this method as weighted pair-group method using the centroid average. Ward's method. This method is distinct from all other methods because it uses an analysis of variance approach to evaluate the distances between clusters. In short, this method attempts to minimize the Sum of Squares (SS) of any two (hypothetical) clusters that can be formed at 248 each step. Refer to Ward (1963) for details concerning this method. In general, this method is regarded as very efficient; however, it tends to create clusters of small size. 12.5 Different Algorithms for Data Clustering Affinity propagation In statistics and data mining, affinity propagation (AP) is a clustering algorithm based on the concept of "message passing" between data points. Unlike clustering algorithms such as k-means or k-medoids, AP does not require the number of clusters to be determined or estimated before running the algorithm. Like k-medoids, AP finds "exemplars", members of the input set that are representative of clusters. BIRCH (data clustering) BIRCH (balanced iterative reducing and clustering using hierarchies) is an unsupervised data mining algorithm used to perform hierarchical clustering over particularly large data-sets. An advantage of Birch is its ability to incrementally and dynamically cluster incoming, multidimensional metric data points in an attempt to produce the best quality clustering for a given set of resources (memory and time constraints). In most cases, Birch only requires a single scan of the database. In addition, Birch is recognized as the, "first clustering algorithm proposed in the database area to handle 'noise' (data points that are not part of the underlying pattern) effectively". Canopy clustering algorithm The canopy clustering algorithm is an unsupervised pre-clustering algorithm, often used as pre-processing step for the K-means algorithm or the Hierarchical clustering algorithm. It is intended to speed up clustering operations on large data sets, where using another algorithm directly may be impractical due to the size of the data set. The algorithm proceeds as follows: Cheaply partitioning the data into overlapping subsets (called "canopies") Perform more expensive clustering, but only within these canopies Since the algorithm uses distance functions and requires the specification of distance thresholds, its applicability for high-dimensional data is limited by the curse of dimensionality. Only when a cheap and approximative - low dimensional - distance function is available, the produced canopies will preserve the clusters produced by K-means. Benefits The number of instances of training data that must be compared at each step is reduced 249 There is some evidence that the resulting clusters are improved Cobweb (clustering) COBWEB is an incremental system for hierarchical conceptual clustering. COBWEB was invented by Professor Douglas H. Fisher, currently at Vanderbilt University. COBWEB incrementally organizes observations into a classification tree. Each node in a classification tree represents a class (concept) and is labeled by a probabilistic concept that summarizes the attribute-value distributions of objects classified under the node. This classification tree can be used to predict missing attributes or the class of a new object. There are four basic operations COBWEB employs in building the classification tree. Which operation is selected depends on the category utility of the classification achieved by applying it. The operations are: MergingTwoNodes: Merging two nodes means replacing them by a node whose children is the union of the original nodes' sets of children and which summarizes the attribute-value distributions of all objects classified under them. Splitting a node: A node is split by replacing it with its children. Inserting a new node: A node is created corresponding to the object being inserted into the tree. Passing an object down the hierarchy: Effectively calling the COBWEB algorithm on the object and the sub tree rooted in the node. The COBWEB Algorithm Algorithm COBWEB COBWEB(root, record): Input: A COBWEB node root, an instance to insert record if root has no children then children := {copy(root)} newcategory(record) \\ adds child with record‘s feature values. insert(record, root) \\ update root‘s statistics else insert(record, root) for child in root‘s children do calculate Category Utility for insert(record, child), 250 set best1, best2 children w. best CU. end for if newcategory(record) yields best CU then newcategory(record) else if merge(best1, best2) yields best CU then merge(best1, best2) COBWEB(root, record) else if split(best1) yields best CU then split(best1) COBWEB(root, record) else COBWEB(best1, record) end if end "←" is a shorthand for "changes to". For instance, "largest ← item" means that the value of largest changes to the value of item. "return" terminates the algorithm and outputs the value that follows. 12.6 Partitioning Methods Partitioning methods relocate instances by moving them from one cluster to another, starting from an initial partitioning. Such methods typically require that the number of clusters will be pre-set by the user. To achieve global optimality in partitioned-based clustering, an exhaustive enumeration process of all possible partitions is required. Because this is not feasible, certain greedy heuristics are used in the form of iterative optimization. Namely, a relocation method iteratively relocates points between the k clusters. The following subsections present various types of partitioning methods. Error Minimization Algorithms. These algorithms, which tend to work well with isolated and compact clusters, are the most intuitive and frequently used methods. The basic idea is to find a clustering structure that minimizes a certain error criterion which measures the ―distance‖ of each instance to its representative value. The most well-known criterion is the Sum of Squared Error (SSE), which measures the total squared Euclidian distance of 251 instances to their representative values. SSE may be globally optimized by exhaustively enumerating all partitions, which is very time-consuming, or by giving an approximate solution (not necessarily leading to a global minimum) using heuristics. The latter option is the most common alternative. The simplest and most commonly used algorithm, employing a squared error criterion is the K-means algorithm. This algorithm partitions the data into K clusters (C1;C2; : : : ;CK), represented by their centers or means. The center of each cluster is calculated as the mean of all the instances belonging to that cluster. The K-means algorithm. starts with an initial set of cluster centers, chosen at random or according to some heuristic procedure. In each iteration, each instance is assigned to its nearest cluster center according to the Euclidean distance between the two. Then the cluster centers are re-calculated. The center of each cluster is calculated as the mean of all the instances belonging to that cluster: Where Nk is the number of instances belonging to cluster k and ¹k is the mean of the cluster k. A number of convergence conditions are possible. For example, the search may stop when the partitioning error is not reduced by the relocation of the centers. This indicates that the present partition is locally optimal. Other stopping criteria can be used also such as exceeding a pre-defined number of iterations. Input: S (instance set), K (number of cluster) Output: clusters 1: Initialize K cluster centers. 2: while termination condition is not satisfied do 3: Assign instances to the closest cluster center. 4: Update cluster centers based on the assignment. 5: end while The K-means algorithm may be viewed as a gradient-decent procedure, which begins with an initial set of K cluster-centers and iteratively updates it so as to decrease the error function. A rigorous proof of the finite convergence of the K-means type algorithms is given in (Selim and Ismail, 1984). The complexity of T iterations of the K-means algorithm performed on a 252 sample size of m instances, each characterized by N attributes, is: O(T ¤ K ¤ m ¤ N).This linear complexity is one of the reasons for the popularity of the K-means algorithms. Even if the number of instances is substantially large (which often is the case nowadays), this algorithm is computationally attractive. Thus, the K-means algorithm has an advantage in comparison to other clustering methods (e.g. hierarchical clustering methods), which have non-linear complexity. Other reasons for the algorithm‘s popularity are its ease of interpretation, simplicity of implementation, speed of convergence and adaptability to sparse data. The Achilles heel of the K-means algorithm involves the selection of the initial partition. The algorithm is very sensitive to this selection, which may make the difference between global and local minimum. Being a typical partitioning algorithm, the K-means algorithm works well only on data sets having isotropic clusters, and is not as versatile as single link algorithms, for instance. In addition, this algorithm is sensitive to noisy data and outliers (a single outlier can increase the squared error dramatically); it is applicable only when mean is defined (namely, for numeric attributes); and it requires the number of clusters in advance, which is not trivial when no prior knowledge is available. The use of the K-means algorithm is often limited to numeric attributes. Haung (1998) presented the K-prototypes algorithm, which is based on the K-means algorithm but removes numeric data limitations while preserving its efficiency. The algorithm clusters objects with numeric and categorical attributes in a way similar to the K-means algorithm. The similarity measure on numeric attributes is the square Euclidean distance; the similarity measure on the categorical attributes is the number of mismatches between objects and the cluster prototypes. Another partitioning algorithm, which attempts to minimize the SSE, is the K-medoids or PAM (partition around medoids—(Kaufmann and Rousseeuw, 1987)). This algorithm is very similar to the K-means algorithm. It differs from the latter mainly in its representation of the different clusters. Each cluster is represented by the most centric object in the cluster, rather than by the implicit mean that may not belong to the cluster. The K-medoids method is more robust than the K-means algorithm in the presence of noise and outliers because a medoid is less influenced by outliers or other extreme values than a mean. However, its processing is more costly than the K-means method. Both methods require the user to specify K, the number of clusters. Other error criteria can be used instead of the SSE. Estivill-Castro (2000) analyzed the total absolute error criterion. Namely, instead of summing up the squared error, he suggests to 253 summing up the absolute error. While this criterion is superior in regard to robustness, it requires more computational effort. Graph-Theoretic Clustering. Graph theoretic methods are methods that produce clusters via graphs. The edges of the graph connect the instances represented as nodes. A well-known graph-theoretic algorithm is based on the Minimal Spanning Tree—MST (Zahn, 1971). Inconsistent edges are edges whose weight (in the case of clustering-length) is significantly larger than the average of nearby edge lengths. Another graph-theoretic approach constructs graphs based on limited neighborhood sets. There is also a relation between hierarchical methods and graph theoretic clustering: Single-link clusters are sub graphs of the MST of the data instances. Each sub graph is a connected component, namely a set of instances in which each instance is connected to at least one other member of the set, so that the set is maximal with respect to this property. These sub graphs are formed according to some similarity threshold. Complete-link clusters are maximal complete subgraphs, formed using a similarity threshold. A maximal complete subgraph is a subgraph such that each node is connected to every other node in the subgraph and the set is maximal with respect to this property. 12.7 Hierarchical clustering Methods In data mining, hierarchical clustering is a method of cluster analysis which seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two types: Agglomerative: This is a "bottom up" approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. Divisive: This is a "top down" approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy. In general, the merges and splits are determined in a greedy manner. The results of hierarchical clustering are usually presented in a dendrogram. In the general case, the complexity of agglomerative clustering is , which makes them too slow for large data sets. Divisive clustering with an exhaustive search is , which is even worse. However, for some special cases, optimal efficient agglomerative methods (of 254 complexity ) are known: SLINK[1] for single-linkage and CLINK[2] for complete- linkage clustering. Cluster dissimilarity In order to decide which clusters should be combined (for agglomerative), or where a cluster should be split (for divisive), a measure of dissimilarity between sets of observations is required. In most methods of hierarchical clustering, this is achieved by use of an appropriate metric (a measure of distance between pairs of observations), and a linkage criterion which specifies the dissimilarity of sets as a function of the pair wise distances of observations in the sets. Metric The choice of an appropriate metric will influence the shape of the clusters, as some elements may be close to one another according to one distance and farther away according to another. For example, in a 2-dimensional space, the distance between the point (1,0) and the origin (0,0) is always 1 according to the usual norms, but the distance between the point (1,1) and the origin (0,0) can be 2, or 1 under Manhattan distance, Euclidean distance or maximum distance respectively. Some commonly used metrics for hierarchical clustering are: Names Formula Euclidean distance squared Euclidean distance Manhattan distance maximum distance Mahalanobis distance where S is the covariance matrix 255 Cosine similarity For text or other non-numeric data, metrics such as the Hamming distance or Levenshtein distance are often used. A review of cluster analysis in health psychology research found that the most common distance measure in published studies in that research area is the Euclidean distance or the squared Euclidean distance Linkage criteria The linkage criterion determines the distance between sets of observations as a function of the pair wise distances between observations. Some commonly used linkage criteria between two sets of observations A and B are: Names Formula Maximum or complete linkage clustering Minimum or singlelinkage clustering Mean or average linkage clustering, or UPGMA Minimum energy clustering Where d is the chosen metric. Other linkage criteria include: The sum of all intra-cluster variance. 256 The decrease in variance for the cluster being merged (Ward's criterion). The probability that candidate clusters spawn from the same distribution function (Vlinkage). The product of in-degree and out-degree on a k-nearest-neighbor graph (graph degree linkage). The increment of some cluster descriptor (i.e., a quantity defined for measuring the quality of a cluster) after merging two clusters.[ Example for Agglomerative Clustering For example, suppose this data is to be clustered, and the Euclidean distance is the distance metric. Cutting the tree at a given height will give a partitioning clustering at a selected precision. In this example, cutting after the second row of the dendrogram will yield clusters {a} {b c} {d e} {f}. Cutting after the third row will yield clusters {a} {b c} {d e f}, which is a coarser clustering, with a smaller number of larger clusters. Raw data The hierarchical clustering dendrogram would be as such: Traditional representation This method builds the hierarchy from the individual elements by progressively merging clusters. In our example, we have six elements {a} {b} {c} {d} {e} and {f}. The first step is to determine which elements to merge in a cluster. Usually, we want to take the two closest elements, according to the chosen distance. 257 Optionally, one can also construct a distance matrix at this stage, where the number in the i-th row j-th column is the distance between the i-th and j-th elements. Then, as clustering progresses, rows and columns are merged as the clusters are merged and the distances updated. This is a common way to implement this type of clustering, and has the benefit of caching distances between clusters. A simple agglomerative clustering algorithm is described in the single-linkage clustering page; it can easily be adapted to different types of linkage (see below). Suppose we have merged the two closest elements b and c, we now have the following clusters {a}, {b, c}, {d}, {e} and {f}, and want to merge them further. To do that, we need to take the distance between {a} and {b c}, and therefore define the distance between two clusters. Usually the distance between two clusters and is one of the following: The maximum distance between elements of each cluster (also called completelinkage clustering): The minimum distance between elements of each cluster (also called single-linkage clustering): The mean distance between elements of each cluster (also called average linkage clustering, used e.g. in UPGMA): The sum of all intra-cluster variance. The increase in variance for the cluster being merged (Ward's method[6]) The probability that candidate clusters spawn from the same distribution function (V-linkage). Each agglomeration occurs at a greater distance between clusters than the previous agglomeration, and one can decide to stop clustering either when the clusters are too far apart to be merged (distance criterion) or when there is a sufficiently small number of clusters (number criterion). k-means clusterning k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with 258 the nearest mean, serving as aprototype of the cluster. This results in a partitioning of the data space into Voronoi cells. The problem is computationally difficult (NP-hard); however, there are efficient heuristic algorithms that are commonly employed and converge quickly to a local optimum. These are usually similar to the expectation-maximization algorithm for mixtures of Gaussian distributions via an iterative refinement approach employed by both algorithms. Additionally, they both use cluster centers to model the data; however, k-means clustering tends to find clusters of comparable spatial extent, while the expectation-maximization mechanism allows clusters to have different shapes. Description Given a set of observations (x1, x2, …, xn), where each observation is a d-dimensional real vector, k-means clustering aims to partition the nobservations into k sets (k ≤ n) S = {S1, S2, …, Sk} so as to minimize the within-cluster sum of squares (WCSS): where μi is the mean of points in Si. Algorithms Standard algorithm The most common algorithm uses an iterative refinement technique. Due to its ubiquity it is often called the k-means algorithm; it is also referred to as Lloyd's algorithm, particularly in the computer science community. Given an initial set of k means m1(1),…,mk(1) (see below), the algorithm proceeds by alternating between two steps:[ Assignment step: Assign each observation to the cluster whose mean yields the least withincluster sum of squares (WCSS). Since the sum of squares is the squared Euclidean distance, this is intuitively the "nearest" mean. (Mathematically, this means partitioning the observations according to the Voronoi diagram generated by the means). where each is assigned to exactly one , even if it could be is assigned to two or more of them. 259 Update step: Calculate the new means to be the centroids of the observations in the new clusters. Since the arithmetic mean is a least-squares estimator, this also minimizes the within-cluster sum of squares (WCSS) objective. The algorithm has converged when the assignments no longer change. Since both steps optimize the WCSS objective, and there only exists a finite number of such partitionings, the algorithm must converge to a (local) optimum. There is no guarantee that the global optimum is found using this algorithm. The algorithm is often presented as assigning objects to the nearest cluster by distance. This is slightly inaccurate: the algorithm aims at minimizing the WCSS objective, and thus assigns by "least sum of squares". Using a different distance function other than (squared) Euclidean distance may stop the algorithm from converging. It is correct that the smallest Euclidean distance yields the smallest squared Euclidean distance and thus also yields the smallest sum of squares. Various modifications of k-means such as spherical k-means and k-medoids have been proposed to allow using other distance measures. Initialization methods Commonly used initialization methods are Forgy and Random Partition.The Forgy method randomly chooses k observations from the data set and uses these as the initial means. The Random Partition method first randomly assigns a cluster to each observation and then proceeds to the update step, thus computing the initial mean to be the centroid of the cluster's randomly assigned points. The Forgy method tends to spread the initial means out, while Random Partition places all of them close to the center of the data set. According to Hamerly et al., the Random Partition method is generally preferable for algorithms such as the kharmonic means and fuzzy k-means. For expectation maximization and standard k-means algorithms, the Forgy method of initialization is preferable. Demonstration of the standard algorithm 1) k initial "means" (in this casek=3) are randomly generated within the data domain (shown in color). 260 2) k clusters are created by associating every observation with the nearest mean. The partitions here represent the Voronoi diagram generated by the means. 3) The centroid of each of the kclusters becomes the new mean. 4) Steps 2 and 3 are repeated until convergence has been reached. As it is a heuristic algorithm, there is no guarantee that it will converge to the global optimum, and the result may depend on the initial clusters. As the algorithm is usually very fast, it is common to run it multiple times with different starting conditions. However, in the worst case, k-means can be very slow to converge: in particular it has been shown that there exist certain point sets, even in 2 dimensions, on which k-means takes exponential time, that is 2Ω(n), to converge.These point sets do not seem to arise in practice: this is corroborated by the fact that the smoothed running time of k-means is polynomial. The "assignment" step is also referred to as expectation step, the "update step" as maximization step, making this algorithm a variant of the generalized expectationmaximization algorithm. Variations k-medians clustering uses the median in each dimension instead of the mean, and this way minimizes norm (Taxicab geometry). k-medoids (also: Partitioning Around Medoids, PAM) uses the medoid instead of the mean, and this way minimizes the sum of distances for arbitrary distance functions. Fuzzy C-Means Clustering is a soft version of K-means, where each data point has a fuzzy degree of belonging to each cluster. 261 Gaussian mixture models trained with expectation-maximization algorithm (EM algorithm) maintains probabilistic assignments to clusters, instead of deterministic assignments, and multivariate Gaussian distributions instead of means. Several methods have been proposed to choose better starting clusters. One recent proposal is k-means++. The filtering algorithm uses kd-trees to speed up each k-means step. Some methods attempt to speed up each k-means step using coresets[ or the triangle inequality. Escape local optima by swapping points between clusters. The Spherical k-means clustering algorithm is suitable for directional data. The Minkowski metric weighted k-means deals with irrelevant features by assigning cluster specific weights to each feature The two key features of k-means which make it efficient are often regarded as its biggest drawbacks: Euclidean distance is used as a metric and variance is used as a measure of cluster scatter. The number of clusters k is an input parameter: an inappropriate choice of k may yield poor results. That is why, when performing k-means, it is important to run diagnostic checks for determining the number of clusters in the data set. Convergence to a local minimum may produce counterintuitive ("wrong") results (see example in Fig.). A key limitation of k-means is its cluster model. The concept is based on spherical clusters that are separable in a way so that the mean value converges towards the cluster center. The clusters are expected to be of similar size, so that the assignment to the nearest cluster center is the correct assignment. When for example applying k-means with a value of onto the well-known Iris flower data set, the result often fails to separate the three Iris species contained in the data set. With will be discovered, whereas with parts. In fact, , the two visible clusters (one containing two species) one of the two clusters will be split into two even is more appropriate for this data set, despite the data set containing 3 classes. As with any other clustering algorithm, the k-means result relies on the data set to satisfy the assumptions made by the clustering algorithms. It works well on some data sets, while failing on others. 262 Cluster analysis In cluster analysis, the k-means algorithm can be used to partition the input data set into k partitions (clusters). However, the pure k-means algorithm is not very flexible, and as such of limited use (except for when vector quantization as above is actually the desired use case!). In particular, the parameter k is known to be hard to choose (as discussed below) when not given by external constraints. In contrast to other algorithms, k-means can also not be used with arbitrary distance functions or be use on non-numerical data. For these use cases, many other algorithms have been developed since. Feature learning k-means clustering has been used as a feature learning (or dictionary learning) step, which can be used in the for (semi-)supervised learning or unsupervised learning. The basic approach is first to train a k-means clustering representation, using the input training data (which need not be labelled). Then, to project any input datum into the new feature space, we have a choice of "encoding" functions, but we can use for example the thresholded matrixproduct of the datum with the centroid locations, the distance from the datum to each centroid, or simply an indicator function for the nearest centroid,[or some smooth transformation of the distance.[Alternatively, by transforming the sample-cluster distance through a Gaussian RBF, one effectively obtains the hidden layer of a radial basis function network. Relation to other statistical machine learning algorithms k-means clustering, and its associated expectation-maximization algorithm, is a special case of a Gaussian mixture model, specifically, the limit of taking all covariance‘s as diagonal, equal, and small. It is often easy to generalize a k-means problem into a Gaussian mixture model. Another generalization of the k-means algorithm is the K-SVD algorithm, which estimates data points as a sparse linear combination of "codebook vectors". K-means corresponds to the special case of using a single codebook vector, with a weight of 1. Mean shift clustering Basic mean shift clustering algorithms maintain a set of data points the same size as the input data set. Initially, this set is copied from the input set. Then this set is iteratively replaced by 263 the mean of those points in the set that are within a given distance of that point. By contrast, k-means restricts this updated set to k points usually much less than the number of points in the input data set, and replaces each point in this set by the mean of all points in the input set that are closer to that point than any other (e.g. within the Voronoi partition of each updating point). A mean shift algorithm that is similar then to k-means, called likelihood mean shift, replaces the set of points undergoing replacement by the mean of all points in the input set that are within a given distance of the changing set. One of the advantages of mean shift over k-means is that there is no need to choose the number of clusters, because mean shift is likely to find only a few clusters if indeed only a small number exist. However, mean shift can be much slower than k-means, and still requires selection of a bandwidth parameter. Mean shift has soft variants much as k-means does. Principal component analysis (PCA) It was asserted in that the relaxed solution of k-means clustering, specified by the cluster indicators, is given by the PCA (principal component analysis) principal components, and the PCA subspace spanned by the principal directions is identical to the cluster centroid subspace. However, that PCA is a useful relaxation of k-means clustering was not a new result (see, for example,, and it is straightforward to uncover counterexamples to the statement that the cluster centroid subspace is spanned by the principal directions Bilateral filtering k-means implicitly assumes that the ordering of the input data set does not matter. The bilateral filter is similar to K-means and mean shift in that it maintains a set of data points that are iteratively replaced by means. However, the bilateral filter restricts the calculation of the (kernel weighted) mean to include only points that are close in the ordering of the input data. This makes it applicable to problems such as image denoising, where the spatial arrangement of pixels in an image is of critical importance. 12.8 Summary Identifying groups of individuals or objects that are similar to each other but different from individuals in other groups can be intellectually satisfying, profitable, or sometimes both. Using your customer base, you may be able to form clusters of customers who have similar buying habits or demographics. You can take advantage of these similarities to target offers 264 to subgroups that are most likely to be receptive to them. Based on scores on psychological inventories, you can cluster patients into subgroups that have similar response patterns. This may help you in targeting appropriate treatment and studying typologies of diseases. By analyzing the mineral contents of excavated materials, you can study their origins and spread. Distance metrics play an important role in data mining. Distance metric gives a numerical value that measures the similarity between two data objects. In classification, the class of a new data object having unknown class label is predicted as the class of its similar objects. In clustering, the similar objects are grouped together. The most common distance metrics are Euclidian distance, Manhattan distance, Max distance. There are also some other distances such as Canberra distance, Cord distance and Chi-squared distance that are also used for some specific purposes. A distance metric measures the dissimilarity between two data points in terms of some numerical value. It also measures similarity; we can say that more distance less similar and less distance more similar. Another strategy for dealing with large state spaces is to treat them as a hierarchy of learning problems. In many cases, hierarchical solutions introduce slight sub-optimality in performance, but potentially gain a good deal of efficiency in execution time, learning time, and space. 12.9 Key words Cluster analysis, computing distance, Partitional methods, Hierarchical method. 12.10 Exercises 1. Describe Cluster analysis. 2. Discuss the different distance measures. 3. Discuss the different types of linkages. 4. Explain COBWEB algorithm. 5. Compare agglomerative clustering to divisive clustering. 6. What are the metrics used in hierarchical clustering. 7. Compare hierarchical clustering to partitional clustering. 8. Explain different linkage criteria. 265 9. Explain agglomerative clustering with a suitable example. 10. Explain k-means clustering with a suitable example. 11. Write short notes on cluster analysis. 12. Write short notes on other machine learning algorithms. 12.11 References 1. Data Mining and Analysis: Fundamental Concepts and Algorithms, by Mohammed J. Zaki, and Wagner Meira Jr., 2. An Introduction to Data Mining by Dr. Saed Sayad, Publisher: University of Toronto 266 UNIT-13: CLUSTER ANALYSIS Structure 13.1 Objectives 13.2 Introduction 13.3 Types of Clustering Methods 13.4 Clustering High-Dimensional Data 13.5 Constraint-Based Cluster Analysis 13.6 Outlier Analysis 13.7 Cluster Validation Techniques 13.8 Summary 13.9 Keywords 13.10 Exercises 13.11 References 13.1 Objectives The objectives covered under this unit include: The introduction to the Cluster Analysis Different Categories of Clustering Methods Clustering High-Dimensional Data Constraint-Based Cluster Analysis Outlier Analysis Cluster Validation Techniques 267 13.2 Introduction What Is Cluster Analysis? The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering. A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters. A cluster of data objects can be treated collectively as one group and so may be considered as a form of data compression. Although classification is an effective means for distinguishing groups or classes of objects, it requires the often costly collection and labeling of a large set of training tuples or patterns, which the classifier uses to model each group. It is often more desirable to proceed in the reverse direction: First partition the set of data into groups based on data similarity (e.g., using clustering), and then assign labels to the relatively small number of groups. Additional advantages of such a clustering-based process are that it is adaptable to changes and helps single out useful features that distinguish different groups. Clustering is also called data segmentation in some applications because clustering partitions large data sets into groups according to their similarity. Clustering can also be used for outlier detection, where outliers (values that are ―far away‖ from any cluster) may be more interesting than common cases. Applications of outlier detection include the detection of credit card fraud and The monitoring of criminal activities in electronic commerce. 13.3 Types of Clustering Methods Han and Kamber (2001) suggest categorizing the methods into additional three main categories: Density-based methods, Model-based clustering Grid based methods. 268 13.3.1 Density-based Methods Density-based methods assume that the points that belong to each cluster are drawn from a specific probability distribution. The overall distribution of the data is assumed to be a mixture of several distributions. The aim of these methods is to identify the clusters and their distribution parameters. These methods are designed for discovering clusters of arbitrary shape which are not necessarily convex, namely: This does not necessarily imply that: The idea is to continue growing the given cluster as long as the density (number of objects or data points) in the neighborhood exceeds some threshold. Namely, the neighborhood of a given radius has to contain at least a minimum number of objects. When each cluster is characterized by local mode or maxima of the density function, these methods are called mode-seeking. Much work in this field has been based on the underlying assumption that the component densities are multivariate Gaussian (in case of numeric data) or multinominal (in case of nominal data). An acceptable solution in this case is to use the maximum likelihood principle. According to this principle, one should choose the clustering structure and parameters such that the probability of the data being generated by such clustering structure and parameters is maximized. The expectation maximization algorithm — EM - which is a general-purpose maximum likelihood algorithm for missing-data problems, has been applied to the problem of parameter estimation. 269 This algorithm begins with an initial estimate of the parameter vector and then alternates between two steps (Farley and Raftery, 1998): an ―E-step‖, in which the conditional expectation of the complete data likelihood given the observed data and the current parameter estimates is computed, and an ―M-step‖, in which parameters that maximize the expected likelihood from the E-step are determined. This algorithm was shown to converge to a local maximum of the observed data likelihood. The K-means algorithm may be viewed as a degenerate EM algorithm, in which: Assigning instances to clusters in the K-means may be considered as the E-step; computing new cluster centers may be regarded as the M-step. The DBSCAN algorithm (density-based spatial clustering of applications with noise) discovers clusters of arbitrary shapes and is efficient for large spatial databases. The algorithm searches for clusters by searching the neighborhood of each object in the database and checks if it contains more than the minimum number of objects. Density-based clustering may also employ nonparametric methods, such as searching for bins with large counts in a multidimensional histogram of the input instance space (Jain et al., 1999). DBSCAN: A Density-Based Clustering Method Based on Connected Regions with sufficiently High Density: It is a density based clustering algorithm. The algorithm grows regions with sufficiently high density into clusters and discovers clusters of arbitrary shape in spatial databases with noise. It defines a cluster as a maximal set of density-connected points. OPTICS: Ordering Points to Identify the Clustering Structure: Although DBSCAN can cluster objects given input parameters such as andMinPts, it still leaves the user with the responsibility of selecting parameter values that will lead to the discovery of acceptable clusters. Actually, this is a problem associated with many other clustering algorithms. Such parameter settings are usually empirically set and difficult to determine, especially for realworld, high-dimensional data sets. Most algorithms are very sensitive to such parameter values: slightly different settings may lead to very different clusterings of the data. Moreover, high-dimensional real data sets often have very skewed distributions, such that their intrinsic clustering structure may not be characterized by global density parameters. 270 To help overcome this difficulty, a cluster analysis method called OPTICS was proposed. Rather than produce a data set clustering explicitly, OPTICS computes an augmented cluster ordering for automatic and interactive cluster analysis. This ordering represents the densitybased clustering structure of the data. It contains information that is equivalent to densitybased clustering obtained from a wide range of parameter settings. The cluster ordering can be used to extract basic clustering information (such as cluster centers or arbitrary-shaped clusters) as well as provide the intrinsic clustering structure. DENCLUE: Clustering Based on Density Distribution Functions: It is a clustering method based on a set of density distribution functions. The method is built on the following ideas: The influence of each data point can be formally modeled using a mathematical function, called an influence function, which describes the impact of a data point within its neighborhood. the overall density of the data space can be modeled analytically as the sum of the influence function applied to all data points Clusters can then be determined mathematically by identifying density attractors, where density attractors are local maxima of the overall density function. 13.3.2 Grid-based Methods These methods partition the space into a finite number of cells that form a grid structure on which all of the operations for clustering are performed. The main advantage of the approach is its fast processing time (Han and Kamber, 2001). The grid-based clustering approach uses a multi resolution grid data structure. It quantizes the object space into a finite number of cells that form a grid structure on which all of the operations for clustering are performed. The main advantage of the approach is its fast processing time, which is typically independent of the number of data objects, yet dependent on only the number of cells in each dimension in the quantized space. Some typical examples of the grid-based approach include STING, which explores statistical information stored in the grid cells; Wave Cluster, which clusters objects using a wavelet 271 transform method; and CLIQUE, which represents a grid-and density-based approach for clustering in high-dimensional data space. STING: Statistical Information Grid: STING is a grid-based multi resolution clustering technique in which the spatial area is divided into rectangular cells. There are usually several levels of such rectangular cells corresponding to different levels of resolution, and these cells form a hierarchical structure: each cell at a high level is partitioned to form a number of cells at the next lower level. Statistical information regarding the attributes in each grid cell (such as the mean, maximum, and minimum values) is pre computed and stored. These statistical parameters are useful for query processing, as described below. Wave Cluster: Clustering Using Wavelet Transformation: Wave Cluster is a multi resolution clustering algorithm that first summarizes the data by imposing a multidimensional grid structure onto the data space. It then uses a wavelet transformation to transform the original feature space, finding dense regions in the transformed space. In this approach, each grid cell summarizes the information of a group of points that map into the cell. This summary information typically fits into main memory for use by the multi resolution wavelet transform and the subsequent cluster analysis. A wavelet transform is a signal processing technique that decomposes a signal into different frequency sub bands. The wavelet model can be applied to d-dimensional signals by applying a one-dimensional wavelet transforms d times. In applying a wavelet transform, data are transformed so as to preserve the relative distance between objects at different levels of resolution. This allows the natural clusters in the data to become more distinguishable. Clusters can then be identified by searching for dense regions in the new domain. ―Why is wavelet transformation useful for clustering?‖ It offers the following advantages: It provides unsupervised clustering. The multi-resolution property of wavelet transformations can help detect clusters at varying levels of accuracy. Wavelet-based clustering is very fast, 13.3.3 Model-Based Clustering Methods Model-based clustering methods attempt to optimize the fit between the given data and some mathematical model. Such methods are often based on the assumption that the data are generated by a mixture of underlying probability distributions. These methods attempt to 272 optimize the fit between the given data and some mathematical models. Unlike conventional clustering, which identifies groups of objects; model-based clustering methods also find characteristic descriptions for each group, where each group represents a concept or class. The most frequently used induction methods are: Decision trees Neural networks. Decision Trees: In decision trees, the data is represented by a hierarchical tree, where each leaf refers to a concept and contains a probabilistic description of that concept. Several algorithms produce classification trees for representing the unlabelled data. The most well-known algorithms are: COBWEB: this algorithm assumes that all attributes are independent (an often too naive assumption). Its aim is to achieve high predictability of nominal variable values, given a cluster. This algorithm is not suitable for clustering large database data (Fisher, 1987). CLASSIT: an extension of COBWEB for continuous-valued data, unfortunately has similar problems as the COBWEB algorithm. Neural Networks: This type of algorithm represents each cluster by a neuron or ―prototype‖. The input data is also represented by neurons, which are connected to the prototype neurons. Each such connection has a weight, which is learned adaptively during learning. A very popular neural algorithm for clustering is the self-organizing map (SOM). This algorithm constructs a single-layered network. The learning process takes place in a ―winner-takes-all‖ fashion: The prototype neurons compete for the current instance. The winner is the neuron whose weight vector is closest to the instance currently presented. The winner and its neighbors learn by having their weights adjusted. The SOM algorithm is successfully used for vector quantization and speech recognition. It is useful for visualizing high-dimensional data in 2D or 3D space. However, it is sensitive to the initial selection of weight vector, as well as to its different parameters, such as the learning rate and neighborhood radius. 273 13.4 Clustering High-Dimensional Data Most clustering methods are designed for clustering low-dimensional data and encounter Challenges when the dimensionality of the data grows really high (say, over 10 dimensions, or even over thousands of dimensions for some tasks). This is because when the dimensionality increases, usually only a small number of dimensions are relevant to certain clusters, but data in the irrelevant dimensions may produce much noise and mask the real clusters to be discovered. Moreover, when dimensionality increases, data usually become increasingly sparse because the data points are likely located in different dimensional subspaces. When the data become really sparse, data points located at different dimensions can be considered as all equally distanced, and the distance measure, which is essential for cluster analysis, becomes meaningless. To overcome this difficulty, we may consider using feature (or attribute) transformation and feature (or attribute) selection techniques. Feature transformation methods, such as principal component analysis and singular value decomposition, transform the data onto a smaller space while generally preserving the original relative distance between objects. They summarize data by creating linear combinations of the attributes, and may discover hidden structures in the data. However, such techniques do not actually remove any of the original attributes from analysis. This is problematic when there are a large number of irrelevant attributes. The irrelevant information may mask the real clusters, even after transformation. Moreover, the transformed features (attributes) are often difficult to interpret, making the clustering results less useful. Thus, feature transformation is only suited to data sets where most of the dimensions are relevant to the clustering task. Unfortunately, real-world data sets tend to have many highly correlated, or redundant, dimensions. Another way of tackling the curse of dimensionality is to try to remove some of the dimensions. Attribute subset selection (or feature subset selection) is commonly used for data reduction by removing irrelevant or redundant dimensions (or attributes). Given a set of attributes, attribute subset selection finds the subset of attributes that are most relevant to the data mining task. Attribute subset selection involves searching through various attribute subsets and evaluating these subsets using certain criteria. It is most commonly performed by supervised learning—the most relevant set of attributes are found with respect to the given class labels. It can also be performed by an unsupervised process, such as 274 entropy analysis, which is based on the property that entropy tends to be low for data that contain tight clusters. Other evaluation functions, such as category utility, may also be used. Subspace clustering is an extension to attribute subset selection that has shown its strength at high-dimensional clustering. It is based on the observation that different subspaces may contain different, meaningful clusters. Subspace clustering searches for groups of clusters within different subspaces of the same data set. The problem becomes how to find such subspace clusters effectively and efficiently. In this section, we introduce three approaches for effective clustering of high-dimensional data: dimension-growth subspace clustering, represented by CLIQUE, Dimension-reduction projected clustering, represented by PROCLUS, Frequent pattern based clustering, represented by pCluster. CLIQUE: A Dimension-Growth Subspace Clustering Method: CLIQUE (Clustering InQUEst) was he first algorithm proposed for dimension-growth subspace clustering in highdimensional space. In dimension-growth subspace clustering, the clustering process starts at single-dimensional subspaces and grows upward to higher-dimensional ones. Because CLIQUE partitions each dimension like a grid structure and determines whether a cell is dense based on the number of points it contains, it can also be viewed as an integration of density-based and grid-based clustering methods. However, its overall approach is typical of subspace clustering for high-dimensional space, and so it is introduced in this section. The ideas of the CLIQUE clustering algorithm are outlined as follows. Given a large set of multidimensional data points, the data space is usually not uniformly occupied by the data points. CLIQUE‘s clustering identifies the sparse and the ―crowded‖ areas in space (or units), thereby discovering the overall distribution patterns of the data set. A unit is dense if the fraction of total data points contained in it exceeds an input model parameter. In CLIQUE, a cluster is defined as a maximal set of connected dense units. PROCLUS: A Dimension-Reduction Subspace Clustering Method: PROCLUS (PROjected CLUStering) is a typical dimension-reduction subspace clustering method. That is, instead of starting from single-dimensional spaces, it starts by finding an initial approximation of the 275 clusters in the high-dimensional attribute space. Each dimension is then assigned a weight for each cluster, and the updated weights are used in the next iteration to regenerate the clusters. This leads to the exploration of dense regions in all subspaces of some desired dimensionality and avoids the generation of a large number of overlapped clusters in projected dimensions of lower dimensionality. PROCLUS finds the best set of medoids by a hill-climbing process similar to that used in CLARANS, but generalized to deal with projected clustering. It adopts a distance measure called Manhattan segmental distance, which is the Manhattan distance on a set of relevant dimensions. The PROCLUS algorithm consists of three phases: initialization, iteration, and cluster refinement. In the initialization phase, it uses a greedy algorithm to select a set of initial medoids that are far apart from each other so as to ensure that each cluster is represented by at least one object in the selected set. More concretely, it first chooses a random sample of data points proportional to the number of clusters we wish to generate, and then applies the greedy algorithm to obtain an even smaller final subset for the next phase. The iteration phase selects a random set of k medoids from this reduced set (of medoids), and replaces ―bad‖ medoids with randomly chosen new medoids if the clustering is improved. For each medoid, a set of dimensions is chosen whose average distances are small compared to statistical expectation. The total number of dimensions associated to medoids must be k_l, where l is an input parameter that selects the average dimensionality of cluster subspaces. The refinement phase computes new dimensions for each medoid based on the clusters found, reassigns points to medoids, and removes outliers. Experiments on PROCLUS show that the method is efficient and scalable at finding high-dimensional clusters. Unlike CLIQUE, which outputs many overlapped clusters, PROCLUS finds non overlapped partitions of points. The discovered clusters may help better understand the high-dimensional data and facilitate other subsequence analyses. Frequent Pattern–Based Clustering Methods: Frequent pattern mining, as the name implies, searches for patterns (such as sets of items or objects) that occur frequently in large data sets. Frequent pattern mining can lead to the discovery of interesting associations and correlations among data objects. Methods for frequent pattern mining were introduced in Chapter 5. The idea behind frequent pattern–based cluster analysis is that the frequent patterns discovered may also indicate clusters. Frequent pattern–based cluster analysis is well suited to high-dimensional data. It can be viewed as an extension of the dimension-growth 276 subspace clustering approach. However, the boundaries of different dimensions are not obvious, since here they are represented by sets of frequent itemsets. That is, rather than growing the clusters dimension by dimension, we grow sets of frequent itemsets, which eventually lead to cluster descriptions. Typical examples of frequent pattern– based cluster analysis include the clustering of text documents that contain thousands of distinct keywords, and the analysis of microarray data that contain tens of thousands of measured values or ―features.‖ In this section, we examine two forms of frequent pattern–based cluster analysis: Frequent term–based text clustering and Clustering by pattern similarity in microarray data analysis. In frequent term–based text clustering, text documents are clustered based on the frequent terms they contain. Using the vocabulary of text document analysis, a term is any sequence of characters separated from other terms by a delimiter. A term can be made up of a single word or several words. In general, we first remove non text information (such as HTML tags and punctuation) and stop words. Terms are then extracted. A stemming algorithm is then applied to reduce each term to its basic stem. In this way, each document can be represented as a set of terms. Each set is typically large. Collectively, a large set of documents will contain a very large set of distinct terms. If we treat each term as a dimension, the dimension space will be of very high dimensionality! This poses great challenges for document cluster analysis. The dimension space can be referred to as term vector space, where each document is represented by a term vector. This difficulty can be overcome by frequent term–based analysis. That is, by using an efficient frequent itemset mining algorithm, we can mine a set of frequent terms from the set of text documents. Then, instead of clustering on high-dimensional term vector space, we need only consider the low-dimensional frequent term sets as ―cluster candidates.‖ Notice that a frequent term set is not a cluster but rather the description of a cluster. The corresponding cluster consists of the set of documents containing all of the terms of the frequent term set. A well-selected subset of the set of all frequent term sets can be considered as a clustering. 277 13.5 Constraint-Based Cluster Analysis In the above discussion, we assume that cluster analysis is an automated, algorithmic computational process, based on the evaluation of similarity or distance functions among a set of objects to be clustered, with little user guidance or interaction. However, users often have a clear view of the application requirements, which they would ideally like to use to guide the clustering process and influence the clustering results. Thus, in many applications, it is desirable to have the clustering process take user preferences and constraints into consideration. Examples of such information include the expected number of clusters, the minimal or maximal cluster size, weights for different objects or dimensions, and other desirable characteristics of the resulting clusters. Moreover, when a clustering task involves a rather high-dimensional space, it is very difficult to generate meaningful clusters by relying solely on the clustering parameters. User input regarding important dimensions or the desired results will serve as crucial hints or meaningful constraints for effective clustering. In general, we contend that knowledge discovery would be most effective if one could develop an environment for human-centered, exploratory mining of data, that is, where the human user is allowed to play a key role in the process. Foremost, a user should be allowed to specify a focus—directing the mining algorithm toward the kind of ―knowledge‖ that the user is interested in finding. Clearly, user-guided mining will lead to more desirable results and capture the application semantics. Constraintbased clustering finds clusters that satisfy user-specified preferences or constraints. Depending on the nature of the constraints, constraint-based clustering may adopt rather different approaches. Here are a few categories of constraints. Constraints on individual objects: We can specify constraints on the objects to be clustered. In a real estate application, for example, one may like to spatially cluster only those luxury mansions worth over a million dollars. This constraint confines the set of objects to be clustered. It can easily be handled by preprocessing (e.g., performing selection using an SQL query), after which the problem reduces to an instance of unconstrained clustering. Constraints on the selection of clustering parameters: A user may like to set a desired range for each clustering parameter. Clustering parameters are usually quite 278 specific to the given clustering algorithm. Examples of parameters include k, the desired number of clusters in a k-means algorithm; or (the radius) andMinPts (the minimum number of points) in the DBSCAN algorithm. Although such user-specified parameters may strongly influence the clustering results, they are usually confined to the algorithm itself. Thus, their fine tuning and processing are usually not considered a form of constraint-based clustering. Constraints on distance or similarity functions: We can specify different distance or similarity functions for specific attributes of the objects to be clustered or different distance measures for specific pairs of objects. When clustering sportsmen, for example, we may use different weighting schemes for height, body weight, age, and skill level. Although this will likely change the mining results, it may not alter the clustering process per se. However, in some cases, such changes may make the evaluation of the distance function nontrivial, especially when it is tightly intertwined with the clustering process. This can be seen in the following example. 13.6 Outlier Analysis An outlier is a data point which is significantly different from the remaining data. Hawkins formally defined [205] the concept of an outlier as follows: ―An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism.‖ Outliers are also referred to as abnormalities, discordants, deviants, or anomalies in the data mining and statistics literature. In most applications, the data is created by one or more generating processes, which could either reflect activity in the system or observations collected about entities. When the generating process behaves in an unusual way, it results in the creation of outliers. Therefore, an outlier often contains useful information about abnormal characteristics of the systems and entities, which impact the data generation process. The recognition of such unusual characteristics provides useful application-specific insights. Some examples are as follows: Intrusion Detection Systems: In many host-based or networked computer systems, different kinds of data are collected about the operating system calls, network traffic, 279 or other activity in the system. This data may show unusual behavior because of malicious activity. The detection of such activity is referred to as intrusion detection. Credit Card Fraud: Credit card fraud is quite prevalent, because of the ease with which sensitive information such as a credit card number may be compromised. This typically leads to unauthorized use of the credit card. In many cases, unauthorized use may show different patterns, such as a buying spree from geographically obscure locations. Such patterns can be used to detect outliers in credit card transaction data. Interesting Sensor Events: Sensors are often used to track various environmental and location parameters in many real applications. The sudden changes in the underlying patterns may represent events of interest. Event detection is one of the primary motivating applications in the field of sensor networks. Medical Diagnosis: In many medical applications the data is collected from a variety of devices such as MRI scans, PET scans or ECG time-series. Unusual patterns in such data typically reflect disease conditions. Law Enforcement: Outlier detection finds numerous applications to law enforcement, especially in cases, where unusual patterns can only be discovered over time through multiple actions of an entity. Determining fraud in financial transactions, trading activity, or insurance claims typically requires the determination of unusual patterns in the data generated by the actions of the criminal entity. Earth Science: A significant amount of spatiotemporal data about weather patterns, climate changes, or land cover patterns is collected through a variety of mechanisms such as satellites or remote sensing. Anomalies in such data provide significant insights about hidden human or environmental trends, which may have caused such anomalies. In all these applications, the data has a ―normal‖ model, and anomalies are recognized as deviations from this normal model. In many cases such as intrusion or fraud detection, the outliers can only be discovered as a sequence of multiple data points, rather than as an individual data point. For example, a fraud event may often reflect the actions of an individual in a particular sequence. The specificity of the sequence is relevant to identifying the anomalous event. Such anomalies are also referred to as collective anomalies, because they can only be inferred collectively from a set or sequence of data points. Such collective anomalies typically represent unusual events, which need to be discovered from the data. This book will address these different kinds of anomalies. 280 The output of an outlier detection algorithm can be one of two types: Most outlier detection algorithm output a score about the level of ―outlierness‖ of a data point. This can be used in order to determine a ranking of the data points in terms of their outlier tendency. This is a very general form of output, which retains all the information provided by a particular algorithm, but does not provide a concise summary of the small number of data points which should be considered outliers. A second kind of output is a binary label indicating whether a data point is an outlier or not. While some algorithms may directly return binary labels, the outlier scores can also be converted into binary labels. This is typically done by imposing thresholds on outlier scores, based on their statistical distribution. A binary labeling contains less information than a scoring mechanism, but it is the final result which is often needed for decision making in practical applications. The problem of outlier detection finds applications in numerous domains, where it is desirable to determine interesting and unusual events in the activity which generates such data. The core of all outlier detection methods is the creation of a probabilistic, statistical or algorithmic model which characterizes the normal behavior of the data. The deviations from this model are used to determine the outliers. A good domain-specific knowledge of the underlying data is often crucial in order to design simple and accurate models which do not over fit the underlying data. The problem of outlier detection becomes especially challenging, when significant relationships exist among the different data points. This is the case for time-series and network data in which the patterns in the relationships among the data points (whether temporal or structural) play the key role in defining the outliers. Outlier analysis has tremendous scope for research, especially in the area of structural and temporal analysis. 13.7 Clustering Validation Techniques The correctness of clustering algorithm results is verified using appropriate criteria and techniques. Since clustering algorithms define clusters that are not known a priori, irrespective of the clustering methods, the final partition of data requires some kind of evaluation in most applications. One of the most important issues in cluster analysis is the 281 evaluation of clustering results to find the partitioning that best fits the underlying data. This is the main subject of cluster validity. In the sequel we discuss the fundamental concepts of this area while we present the various cluster validity approaches proposed in literature. Fundamental concepts of cluster validity The procedure of evaluating the results of a clustering algorithm is known under the term cluster validity. In general terms, there are three approaches to investigate cluster validity. The first is based on external criteria. This implies that we evaluate the results of a clustering algorithm based on a pre-specified structure, which is imposed on a data set and reflects our intuition about the clustering structure of the data set. The second approach is based on internal criteria. We may evaluate the results of a clustering algorithm in terms of quantities that involve the vectors of the data set themselves (e.g. proximity matrix). The third approach of clustering validity is based on relative criteria. Here the basic idea is the evaluation of a clustering structure by comparing it to other clustering schemes, resulting by the same algorithm but with different parameter values. There are two criteria proposed for clustering evaluation and selection of an optimal clustering scheme. External criteria: In this approach the basic idea is to test whether the points of the data set are randomly structured or not. This analysis is based on the Null Hypothesis, H0, expressed as a statement of random structure of a dataset, let X. To test this hypothesis we are based on statistical tests, which lead to a computationally complex procedure. In the sequel Monde Carlo techniques are used as a solution to high computational problems. Internal criteria: Using this approach of cluster validity our goal is to evaluate the clustering result of an algorithm using only quantities and features inherent to the dataset. There are two cases in which we apply internal criteria of cluster validity depending on the clustering structure: a) hierarchy of clustering schemes, and b) single clustering scheme. Relative criteria: The basis of the above described validation methods is statistical testing. Thus, the major drawback of techniques based on internal or external criteria is their high computational demands. A different validation approach is discussed in this section. It is based on relative criteria and does not involve statistical tests. The fundamental idea of this approach is to choose the best clustering scheme of a set of defined schemes according to a pre-specified criterion. 282 13.8 Summary A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters. The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering. Cluster analysis has wide applications, including market or customer segmentation, pattern recognition, biological studies, spatial data analysis, Web document classification, and many others. This can be categorized into partitioning methods, hierarchical methods, density-based methods, gridbased methods, model-based methods, methods for high-dimensional data (including frequent pattern–based methods), and constraint based methods. 13.9 Keywords Density Based Methods, Grid-Based Methods, Model-Based Clustering Methods, Clustering High-Dimensional Data, Constraint-Based Cluster Analysis, Outlier Analysis, Cluster Validation Techniques 13.10 Exercises a) What is Clustering? Explain? b) Explain Different types of Clustering? c) Different types of Model-based methods of Clustering? d) Give 4 examples for Outlier analysis? e) Briefly describe the following approaches to clustering: a. Partitioning methods, b. Hierarchical methods, c. Density-based methods, d. Grid-based methods, e. Model-based methods, f. Methods for high-dimensional data, g. Constraint-based methods. Give examples in each case. f) Why is outlier mining important? g) Different types of approaches of Clustering Validation Techniques? 283 13.11 References 1. Data Mining: Concepts and Techniques by Jiawei Han and Micheline Kamber, Morgan Kaufmann Publisher, Second Edition, 2006. 2. Introduction to Data Mining with Case Studies by G. K. Gupta, Eastern Economy Edition (PHI, New Delhi), Third Edition, 2009. 3. Data Mining Techniques by Arun K Pujari, University Press, Second Edition, 2009. 284 UNIT-14: SPATIAL DATAMINING Structure 14.1 Objectives 14.2 Introduction 14.3 Spatial Data Cube Construction and Spatial OLAP 14.4 Mining Spatial Association and Co-location Patterns 14.5 Spatial Clustering Methods 14.6 Spatial Classification and Spatial Trend Analysis 14.7 Mining Raster Databases 14.8 Summary 14.9 Keywords 14.10 Exercises 14.11 References 14.1 Objectives The objectives covered under this unit include: The introduction to Spatial Data Mining Spatial Data Cube Construction Mining Spatial Association and Co-location Patterns Spatial Clustering Methods Spatial Classification and Spatial Trend Analysis 14.2 Introduction What is Spatial Data Mining A spatial database stores a large amount of space-related data, such as maps, preprocessed remote sensing or medical imaging data, and VLSI chip layout data. Spatial 285 databases have many features distinguishing them from relational databases. They carry topological and/or distance information, usually organized by sophisticated, multidimensional spatial indexing structures that are accessed by spatial data access methods and often require spatial reasoning, geometric computation, and spatial knowledge representation techniques. Spatial data mining refers to the extraction of knowledge, spatial relationships, or other interesting patterns not explicitly stored in spatial databases. Such mining demands an integration of data mining with spatial database technologies. It can be used for understanding spatial data, discovering spatial relationships and relationships between spatial and non spatial data, constructing spatial knowledge bases, reorganizing spatial databases, and optimizing spatial queries. It is expected to have wide applications in geographic information systems, geomarketing, remote sensing, image database exploration, medical imaging, navigation, traffic control, environmental studies, and many other areas where spatial data are used. A crucial challenge to spatial data mining is the exploration of efficient spatial data mining techniques due to the huge amount of spatial data and the complexity of spatial data types and spatial access methods. ―What about using statistical techniques for spatial data mining?‖ Statistical spatial data analysis has been a popular approach to analyzing spatial data and exploring geographic information. The term geostatistics is often associated with continuous geographic space, whereas the term spatial statistics is often associated with discrete space. In a statistical model that handles non spatial data, one usually assumes statistical independence among different portions of data. However, different from traditional data sets, there is no such independence among spatially distributed data because in reality, spatial objects are often interrelated, or more exactly spatially co-located, in the sense that the closer the two objects are located, the more likely they share similar properties. For example, nature resource, climate, temperature, and economic situations are likely to be similar in geographically closely located regions. People even consider this as the first law of geography: ―Everything is related to everything else, but nearby things are more related than distant things.‖ Such a property of close interdependency across nearby space leads to the notion of spatial autocorrelation. Based on this notion, spatial statistical modeling methods have been developed with good success. Spatial data mining will further develop spatial statistical analysis methods and extend them for huge amounts of spatial data, with more emphasis on efficiency, scalability, cooperation with database and data warehouse systems, improved user interaction, and the discovery of new types of knowledge. 286 14.3 Spatial Data Cube Construction and Spatial OLAP ―Can we construct a spatial data warehouse?‖ Yes, as with relational data, we can integrate spatial data to construct a data warehouse that facilitates spatial data mining. A spatial data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of both spatial and non spatial data in support of spatial data mining and spatialdata-related decision-making processes. Let‘s look at the following example. Example 1 Spatial data cube and spatial OLAP. There are about 3,000 weather probes distributed in British Columbia (BC), Canada, each recording daily temperature and precipitation for a designated small area and transmitting signals to a provincial weather station. With a spatial data warehouse that supports spatial OLAP, a user can view weather patterns on a map by month, by region, and by different combinations of temperature and precipitation, and can dynamically drill down or roll up along any dimension to explore desired patterns, such as ―wet and hot regions in the Fraser Valley in summer 1999.‖ There are several challenging issues regarding the construction and utilization of spatial data warehouses. The first challenge is the integration of spatial data from heterogeneous sources and systems. Spatial data are usually stored in different industry firms and government agencies using various data formats. Data formats are not only structure-specific (e.g., rastervs. vector-based spatial data, object-oriented vs. relational models, different spatial storage and indexing structures), but also vendor-specific (e.g., ESRI, MapInfo, Intergraph). Figure 4: A star schema of the BC weather spatial data warehouse and corresponding BC weather probes map 287 There has been a great deal of work on the integration and exchange of heterogeneous spatial data, which has paved the way for spatial data integration and spatial data warehouse construction. The second challenge is the realization of fast and flexible on-line analytical processing in spatial data warehouses. The star schema model is a good choice for modeling spatial data warehouses because it provides a concise and organized warehouse structure and facilitates OLAP operations. However, in a spatial warehouse, both dimensions and measures may contain spatial components. There are three types of dimensions in a spatial data cube: A non spatial dimension contains only non spatial data. Non spatial dimensions temperature and precipitation can be constructed for the warehouse in Example 10.5, since each contains non spatial data whose generalizations are non spatial (such as ―hot‖ for temperature and ―wet‖ for precipitation). A spatial-to-non spatial dimension is a dimension whose primitive-level data are spatial but whose generalization, starting at a certain high level, becomes nons patial. For example, the spatial dimension city relays geographic data for the U.S. map. Suppose that the dimension‘s spatial representation of, say, Seattle is generalized to the string ―pacific northwest.‖ Although ―pacific northwest‖ is a spatial concept, its representation is not spatial (since, in our example, it is a string). It therefore plays the role of a non spatial dimension. A spatial-to-spatial dimension is a dimension whose primitive level and all of its high-level generalized data are spatial. For example, the dimension equi_temperature region contains spatial data, as do all of its generalizations, such as with regions covering 0-5 degrees (Celsius), 5-10 degrees, and so on. We distinguish two types of measures in a spatial data cube: A numerical measure contains only numerical data. For example, one measure in a spatial data warehouse could be the monthly revenue of a region, so that a roll-up may compute the total revenue by year, by county, and so on. Numerical measures can be further classified into distributive, algebraic, and holistic, as discussed in Chapter 3. A spatial measure contains a collection of pointers to spatial objects. For example, in a generalization (or roll-up) in the spatial data cube of Example 10.5, 288 the regions with the same range of temperature and precipitation will be grouped into the same cell, and the measure so formed contains a collection of pointers to those regions. A non-spatial data cube contains only non-spatial dimensions and numerical measures. If a spatial data cube contains spatial dimensions but no spatial measures, its OLAP operations, such as drilling or pivoting, can be implemented in a manner similar to that for non-spatial data cubes. ―But what if I need to use spatial measures in a spatial data cube?‖ This notion raises some challenging issues on efficient implementation, as shown in the following example. Example 2: Numerical versus spatial measures: A star schema for the BC weather warehouse of Example 1 is shown in Figure 1. It consists of four dimensions: region temperature, time, and precipitation, and three measures: region map, area, and count. A concept hierarchy for each dimension can be created by users or experts, or generated automatically by data clustering analysis. Figure 2 presents hierarchies for each of the dimensions in the BC weather warehouse. Of the three measures, area and count are numerical measures that can be computed similarly as for nonspatial data cubes; region map is a spatial measure that represents a collection of spatial pointers to the corresponding regions. Since different spatial OLAP operations result in different collections of spatial objects in region map, it is a major challenge to compute the merges of a large number of regions flexibly and dynamically. Figure 5: Hierarchies for each dimension of the BC weather data warehouse. 289 For example, two different roll-ups on the BC weather map data (Figure 1) may produce two different generalized region maps, as shown in Figure 3, each being the result of merging a large number of small (probe) regions from Figure 1. Figure 6: Generalized regions after different roll-up operations. ―Can we pre-compute all of the possible spatial merges and store them in the corresponding cuboid cells of a spatial data cube?‖ The answer is - probably not. Unlike a numerical measure where each aggregated value requires only a few bytes of space, a merged region map of BC may require multi-megabytes of storage. Thus, we face a dilemma in balancing the cost of on-line computation and the space overhead of storing computed measures: the substantial computation cost for on-the-fly computation of spatial aggregations calls for precomputation, yet substantial overhead for storing aggregated spatial values discourages it. There are at least three possible choices in regard to the computation of spatial measures in spatial data cube construction: Collect and store the corresponding spatial object pointers but do not perform precomputation of spatial measures in the spatial data cube. This can be implemented by storing, in the corresponding cube cell, a pointer to a collection of spatial object pointers, and invoking and performing the spatial merge (or other computation) of the corresponding spatial objects, when necessary, on the fly. This method is a good choice if only spatial display is required (i.e., no real spatial merge has to be performed), or if there are not many regions to be merged in any pointer collection (so that the on-line merge is not very costly), or if online spatial merge computation is fast (recently, some efficient spatial merge methods have been developed for fast spatial OLAP). Since OLAP results are 290 often used for on-line spatial analysis and mining, it is still recommended to precompute some of the spatially connected regions to speed up such analysis. Precompute and store a rough approximation of the spatial measures in the spatial data cube. This choice is good for a rough view or coarse estimation of spatial merge results under the assumption that it requires little storage space. For example, a minimum bounding rectangle (MBR), represented by two points, can be taken as a rough estimate of a merged region. Such a precomputed result is small and can be presented quickly to users. If higher precision is needed for specific cells, the application can either fetch precomputed high-quality results, if available, or compute them on the fly. Selectively precompute some spatial measures in the spatial data cube. This can be a smart choice. The question becomes, ―Which portion of the cube should be selected for materialization?‖ The selection can be performed at the cuboid level, that is, either precompute and store each set of mergeable spatial regions for each cell of a selected cuboid, or precompute none if the cuboid is not selected. Since a cuboid usually consists of a large number of spatial objects, it may involve precomputation and storage of a large number of mergeable spatial objects, some of which may be rarely used. Therefore, it is recommended to perform selection at a finer granularity level: examining each group of mergeable spatial objects in a cuboid to determine whether such a merge should be precomputed. The decision should be based on the utility (such as access frequency or access priority), shareability of merged regions, and the balanced overall cost of space and on-line computation. With efficient implementation of spatial data cubes and spatial OLAP, generalizationbased descriptive spatial mining, such as spatial characterization and discrimination, can be performed efficiently. 14.4 Mining Spatial Association and Co-location Patterns Similar to the mining of association rules in transactional and relational databases, spatial association rules can be mined in spatial databases. A spatial association rule is of the form A B [s%,c%] 291 where A and B are sets of spatial or non spatial predicates, s% is the support of the rule, and c% is the confidence of the rule. For example, the following is a spatial association rule: is_a(X, ―school‖)˄ close_to(X, ―sports_center‖) close_to(X, ―park‖) [0.5%,80%]. This rule states that 80% of schools that are close to sports centers are also close to parks, and 0.5% of the data belongs to such a case. Various kinds of spatial predicates can constitute a spatial association rule. Examples include distance information (such as close to and far away), topological relations (like intersect, overlap, and disjoint), and spatial orientations (like left of and west of). Since spatial association mining needs to evaluate multiple spatial relationships among a large number of spatial objects, the process could be quite costly. An interesting mining optimization method called progressive refinement can be adopted in spatial association analysis. The method first mines large data sets roughly using a fast algorithm and then improves the quality of mining in a pruned data set using a more expensive algorithm. To ensure that the pruned data set covers the complete set of answers when applying the high-quality data mining algorithms at a later stage, an important requirement for the rough mining algorithm applied in the early stage is the superset coverage property: that is, it preserves all of the potential answers. In other words, it should allow a false-positive test, which might include some data sets that do not belong to the answer sets, but it should not allow a false-negative test, which might exclude some potential answers. For mining spatial associations related to the spatial predicate close to, we can first collect the candidates that pass the minimum support threshold by Applying certain rough spatial evaluation algorithms, for example, using an MBR structure (which registers only two spatial points rather than a set of complex polygons), and Evaluating the relaxed spatial predicate, g close to, which is a generalized close to covering a broader context that includes close to, touch, and intersect. If two spatial objects are closely located, their enclosing MBRs must be closely located, matching g close to. However, the reverse is not always true: if the enclosing MBRs are closely located, the two spatial objects may or may not be located so closely. Thus, the MBR pruning is a false-positive testing tool for closeness: only those that pass the rough test need 292 to be further examined using more expensive spatial computation algorithms. With this preprocessing, only the patterns that are frequent at the approximation level will need to be examined by more detailed and finer, yet more expensive, spatial computation. Besides mining spatial association rules, one may like to identify groups of particular features that appear frequently close to each other in a geospatial map. Such a problem is essentially the problem of mining spatial co-locations. Finding spatial co-locations can be considered as a special case of mining spatial associations. However, based on the property of spatial autocorrelation, interesting features likely coexist in closely located regions. Thus spatial co-location can be just what one really wants to explore. Efficient methods can be developed for mining spatial co-locations by exploring the methodologies like aprior and progressive refinement, similar to what has been done for mining spatial association rules. 14.5 Spatial Clustering Methods Spatial data clustering identifies clusters, or densely populated regions, according to some distance measurement in a large, multidimensional data set. Spatial clustering is a process of grouping a set of spatial objects into groups called clusters. Objects within a cluster show a high degree of similarity, whereas the clusters are as much dissimilar as possible. Clustering is a very well known technique in statistics and clustering algorithm to deal with the large geographical datasets. Clustering algorithms can be separated into four general categories: Partitioning method, Hierarchical method, Density-based method Grid-based method. The categorization is based on different cluster definition techniques. Partitioning Method This method partitioning algorithm organizes the objects into clusters such that the total deviation of each object from its cluster center is minimized. At the beginning each object is classified as a single cluster. In the next steps, all data points are iteratively reallocated to 293 every cluster until a stopping criterion is met. K-Means is commonly used fundamental partitioning algorithm. Hierarchical Method This method hierarchically decomposes the dataset by splitting or merging all clusters until a stopping criterion is met. Some of the recently used hierarchical clustering algorithms are Balanced Iterative Reducing and Clustering using Hierarchies and Clustering Using Representatives. Density-Based Method The method regards clusters as dense regions of objects that are separated by regions of low density (representing noise). In contrast to partitioning methods, clusters of arbitrary shapes can be discovered. Density-based methods can be used to filter out noise and outliers. Grid-Based Method Grid-based clustering algorithms first quantize the clustering space into a finite number of cells and then perform the required operations on the grid structure. Cells that contain more than a certain number of points are treated as dense. The main advantage of the approach is its fast processing time, since the time is independent on the number of data objects, but dependent on the number of cells. 14.6 Spatial Classification and Spatial Trend Analysis Spatial classification analyzes spatial objects to derive classification schemes in relevance to certain spatial properties, such as the neighborhood of a district, highway, or river. Example 3 Spatial classification: Suppose that you would like to classify regions in a province into rich versus poor according to the average family income. In doing so, you would like to identify the important spatial related factors that determine a region‘s classification. Many properties are associated with spatial objects, such as hosting a university, containing interstate highways, being near a lake or ocean, and so on. These properties can be used for relevance analysis and to find interesting classification schemes. Such classification schemes may be represented in the form of decision trees or rules. 294 Spatial trend analysis deals with another issue: the detection of changes and trends along a spatial dimension. Typically, trend analysis detects changes with time, such as the changes of temporal patterns in time-series data. Spatial trend analysis replaces time with space and studies the trend of non spatial or spatial data changing with space. For example, we may observe the trend of changes in economic situation when moving away from the center of a city, or the trend of changes of the climate or vegetation with the increasing distance from an ocean. For such analyses, regression and correlation analysis methods are often applied by utilization of spatial data structures and spatial access methods. 14.7 Mining Raster Databases Spatial database systems usually handle vector data that consist of points, lines, polygons (regions), and their compositions, such as networks or partitions. Typical examples of such data include maps, design graphs, and 3-D representations of the arrangement of the chains of protein molecules. However, a huge amount of space-related data is in digital raster (image) forms, such as satellite images, remote sensing data, and computer tomography. It is important to explore data mining in raster or image databases. Methods for mining raster and image data are examined in the following section regarding the mining of multimedia data. There are also many applications where patterns are changing with both space and time. For example, traffic flows on highways and in cities are both time and space related. Weather patterns are also closely related to both time and space. Although there have been a few interesting studies on spatial classification and spatial trend analysis, the investigation of spatiotemporal data mining is still in its early stage. More methods and applications of spatial classification and trend analysis, especially those associated with time, need to be explored. Multimedia Data Mining: ―What is a multimedia database?‖ A multimedia database system stores and manages a large collection of multimedia data, such as audio, video, image, graphics, speech, text, document, and hypertext data, which contain text, text markups, and linkages. Multimedia database systems are increasingly common owing to the popular use of audio video equipment, digital cameras, CD-ROMs, and the Internet. Typical multimedia database systems include NASA‘s EOS (Earth Observation System), various kinds of image and audio-video databases, and Internet databases. Similarity Search in Multimedia Data: ―When searching for similarities in multimedia data, can we search on either the data description or the data content?‖ That is correct. For 295 similarity searching in multimedia data, we consider two main families of multimedia indexing and retrieval systems: (1) description-based retrieval systems, which build indices and perform object retrieval based on image descriptions, such as keywords, captions, size, and time of creation. (2) content-based retrieval systems, which support retrieval based on the image content, such as color histogram, texture, pattern, image topology, and the shape of objects and their layouts and locations within the image. Description-based retrieval is labor-intensive if performed manually. If automated, the results are typically of poor quality. For example, the assignment of keywords to images can be a tricky and arbitrary task. Multidimensional Analysis of Multimedia Data: ―Can we construct a data cube for multimedia data analysis?‖ To facilitate the multidimensional analysis of large multimedia databases, multimedia data cubes can be designed and constructed in a manner similar to that for traditional data cubes from relational data. A multimedia data cube can contain additional dimensions and measures for multimedia information, such as color, texture, and shape. Mining Associations in Multimedia Data: ―What kinds of associations can be mined in multimedia data?‖ Association rules involving multimedia objects can be mined in image and video databases. At least three categories can be observed: Associations between image content and non image content features: A rule like ―If at least 50% of the upper part of the picture is blue, then it is likely to represent sky‖ belongs to this category since it links the image content to the keyword sky. Associations among image contents that are not related to spatial relationships: A rule like ―If a picture contains two blue squares, then it is likely to contain one red circle as well‖ belongs to this category since the associations are all regarding image contents. Associations among image contents related to spatial relationships: A rule like ―If a red triangle is between two yellow squares, then it is likely a big oval-shaped object is underneath‖ belongs to this category since it associates objects in the image with spatial relationships. 14.8 Summary Vast amounts of data are stored in various complex forms, such as structured or unstructured, hyper text, and multimedia. Thus, mining complex types of data, including object data, 296 spatial data, multimedia data, text data, and Web data, has become an increasingly important task in data mining. Spatial data mining is the discovery of interesting patterns from large geospatial databases. Spatial data cubes that contain spatial dimensions and measures can be constructed. Spatial OLAP can be implemented to facilitate multidimensional spatial data analysis. Spatial data mining includes mining spatial association and co-location patterns, clustering, classification, and spatial trend and outlier analysis. Multimedia data mining is the discovery of interesting patterns from multimedia databases that store and manage large collections of multimedia objects, including audio data, image data, video data, sequence data, and hypertext data containing text, text markups, and linkages. Issues in multimedia data mining include content based retrieval and similarity search, and generalization and multidimensional analysis. 14.9 Keywords Multimedia Data Mining, Raster Databases, Spatial Classification, Spatial Clustering Methods, Mining Spatial Association, Co-location Patterns, Spatial Data Cube Construction, Spatial OLAP, Spatial Data Mining 14.10 Exercises 1. Define Spatial Data Mining? 2. Define Spatial Data Cube Construction and Spatial OLAP? Give Example? 3. Explain Mining Spatial Association? 4. What all are Different types of Spatial Clustering Methods? Explain? 5. What is Spatial Classification and Spatial Trend Analysis? 6. Explain Multimedia Data Mining? 14.11 References 1. Data Mining: Concepts and Techniques by Jiawei Han and Micheline Kamber, Morgan Kaufmann Publisher, Second Edition, 2006. 2. Introduction to Data Mining with Case Studies by G. K. Gupta, Eastern Economy Edition (PHI, New Delhi), Third Edition, 2009. 3. Data Mining Techniques by Arun K Pujari, University Press, Second Edition, 2009. 297 UNIT-15: TEXT MINING Structure 15.1 Objectives 15.2 Introduction 15.3 Mining Text Data 15.4 Text Data Analysis and Information Retrieval 15.5 Dimensionality Reduction for Text 15.6 Text Mining Approaches 15.7 Summary 15.8 Keywords 15.9 Exercises 15.10 References 15.1 Objectives The objectives covered under this unit include: An introduction to Text Mining Techniques for Text Mining Text Data Analysis Information Retrieval Dimensionality Reduction for Text Text Mining Approaches. 15.2 Introduction What is text mining? Text mining, also referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as statistical 298 pattern learning. Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output. 15.3 Mining Text Data 'High quality' in text mining usually refers to some combination of relevance, novelty, and interestingness. Typical clustering, concept/entity text extraction, mining tasks production include text of granular categorization, text taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities). Text analysis involves information retrieval, lexical analysis to study word frequency distributions, pattern recognition, tagging/annotation, information extraction, data mining techniques including link and association analysis, visualization, and predictive analytics. The overarching goal is, essentially, to turn text into data for analysis, via application of natural language processing (NLP) and analytical methods. A key element of text mining is its focus on the document collection. At its simplest, a document collection can be any grouping of text-based documents. Practically speaking, however, most text mining solutions are aimed at discovering patterns across very large document collections. The number of documents in such collections can range from the many thousands to the tens of millions. Document collections can be either static, in which case the initial complement of documents remains unchanged, or dynamic, which is a term applied to document collections characterized by their inclusion of new or updated documents over time. Extremely large document collections, as well as document collections with very high rates of document change, can pose performance optimization challenges for various components of a text mining system. Data stored in most text databases are semistructured data in that they are neither completely unstructured nor completely structured. For example, a document may contain a few structured fields, such as title, authors, publication date, category, and so on, but also contain some largely unstructured text components, such as abstract and contents. There have been a great deal of studies on the modeling and implementation of semistructured data in recent 299 database research. Moreover, information retrieval techniques, such as text indexing methods, have been developed to handle unstructured documents. Traditional information retrieval techniques become inadequate for the increasingly vast amounts of text data. Typically, only a small fraction of the many available documents will be relevant to a given individual user. Without knowing what could be in the documents, it is difficult to formulate effective queries for analyzing and extracting useful information from the data. Users need tools to compare different documents, rank the importance and relevance of the documents, or find patterns and trends across multiple documents. Thus, text mining has become an increasingly popular and essential theme in data mining. 15.4 Text Data Analysis and Information Retrieval The meaning of the term information retrieval can be very broad. Just getting a credit card out of your wallet so that you can type in the card number is a form of information retrieval. Information retrieval (IR) is a field that has been developing in parallel with database systems for many years. Unlike the field of database systems, which has focused on query and transaction processing of structured data, information retrieval is concerned with the organization and retrieval of information from a large number of text-based documents. Since information retrieval and database systems each handle different kinds of data, some database system problems are usually not present in information retrieval systems, such as concurrency control, recovery, transaction management, and update. Also, some common information retrieval problems are usually not encountered in traditional database systems, such as unstructured documents, approximate search based on keywords, and the notion of relevance. Due to the abundance of text information, information retrieval has found many applications. There exist many information retrieval systems, such as on-line library catalog systems, online document management systems, and the more recently developed Web search engines. A typical information retrieval problem is to locate relevant documents in a document collection based on a user‘s query, which is often some keywords describing an information need, although it could also be an example relevant document. In such a search problem, a user takes the initiative to ―pull‖ the relevant information out from the collection; this is most appropriate when a user has some ad hoc (i.e., short-term) information need, such as finding information to buy a used car. When a user has a long-term information need (e.g., a 300 researcher‘s interests), a retrieval system may also take the initiative to ―push‖ any newly arrived information item to a user if the item is judged as being relevant to the user‘s information need. Such an information access process is called information filtering, and the corresponding systems are often called filtering systems or recommender systems. From a technical viewpoint, however, search and filtering share many common techniques. Below we briefly discuss the major techniques in information retrieval with a focus on search techniques. Basic Measures for Text Retrieval: Precision and Recall Text retrieval system has just retrieves a number of documents based on input in the form of a query. How to assess how accurate or correct the system retrieves the documents. Let the set of documents relevant to a query be denoted as {Relevant}, and the set of documents retrieved be denoted as {Retrieved}. The set of documents that are both relevant and retrieved is denoted as {Relevant}∩{Retrieved}, as shown below. Figure: Relationship between the set of relevant documents and the set of retrieved documents. There are two basic measures for assessing the quality of text retrieval: Precision: This is the percentage of retrieved documents that are in fact relevant to the query (i.e., ―correct‖ responses). It is formally defined as Precision = {Relevant} ∩ {Retrieved} {Retrieved} Recall: This is the percentage of documents that are relevant to the query and were, in fact, retrieved. It is formally defined as 301 Recall = {Relevant} ∩ {Retrieved} {Relevant} An information retrieval system often needs to trade off recall for precision or vice versa. One commonly used trade-off is the F-score, which is defined as the harmonic mean of recall and precision: F_score = Precision × Recall (Precision + Recall) 2 Precision, recall, and F-score are the basic measures of a retrieved set of documents. These three measures are not directly useful for comparing two ranked lists of documents because they are not sensitive to the internal ranking of the documents in a retrieved set. In order to measure the quality of a ranked list of documents, it is common to compute an average of precisions at all the ranks where a new relevant document is returned. It is also common to plot a graph of precisions at many different levels of recall; a higher curve represents a betterquality information retrieval system. Text Retrieval Methods Text retrieval methods fall into two categories: They generally either view the retrieval problem as a document selection problem or as a document ranking problem. In document selection methods, the query is regarded as specifying constraints for selecting relevant documents. A typical method of this category is the Boolean retrieval model, in which a document is represented by a set of keywords and a user provides a Boolean expression of keywords, such as ―car and repair shops,‖ ―tea or coffee,‖ or ―database systems but not Oracle.‖ The retrieval system would take such a Boolean query and return documents that satisfy the Boolean expression. Because of the difficulty in prescribing a user‘s information need exactly with a Boolean query, the Boolean retrieval method generally only works well when the user knows a lot about the document collection and can formulate a good query in this way. Document ranking methods use the query to rank all documents in the order of relevance. For ordinary users and exploratory queries, these methods are more appropriate than document selection methods. Most modern information retrieval systems present a ranked list of documents in response to a user‘s keyword query. There are many different ranking methods based on a large spectrum of mathematical foundations, including algebra, logic, probability, 302 and statistics. The common intuition behind all of these methods is that we may match the keywords in a query with those in the documents and score each document based on how well it matches the query. The goal is to approximate the degree of relevance of a document with a score computed based on information such as the frequency of words in the document and the whole collection. Notice that it is inherently difficult to provide a precise measure of the degree of relevance between a set of keywords. For example, it is difficult to quantify the distance between data mining and data analysis. Comprehensive empirical evaluation is thus essential for validating any retrieval method. The basic idea of the vector space model is to represent a document and a query both as vectors in a high-dimensional space corresponding to all the keywords and use an appropriate similarity measure to compute the similarity between the query vector and the document vector. The similarity values can then be used for ranking documents. Tokenize text The first step in most retrieval systems is to identify key- words for representing documents, a preprocessing step often called tokenization. To avoid indexing useless words, a text retrieval system often associates a stop list with a set of documents. A stop list is a set of words that are deemed ―irrelevant.‖ For example, a, the, of, for, with, and so on are stop words, even though they may appear frequently. Stop lists may vary per document set. For example, database systems could be an important keyword in a newspaper. However, it may be considered as a stop word in a set of research papers presented in a database systems conference. A group of different words may share the same word stem. A text retrieval system needs to identify groups of words where the words in a group are small syntactic variants of one another and collect only the common word stem per group. For example, the group of words drug, drugged, and drugs, share a common word stem, drug, and can be viewed as different occurrences of the same word. Model a document to facilitate information retrieval Starting with a set of d documents and a set of t terms, we can model each document as a vector v in the t dimensional space R t, which is why this method is called the vector-space model. Let the term frequency be the number of occurrences of term t in the document d, that is, freq (d, t ). The (weighted) term-frequency matrix TF(d, t ) measures the association of a 303 term t with respect to the given document d: it is generally defined as 0 if the document does not contain the term, and nonzero otherwise. There are many ways to define the termweighting for the nonzero entries in such a vector. For example, we can simply set TF(d, t ) = 1 if the term t occurs in the document d, or use the term frequency freq(d, t ), or the relative term frequency, that is, the term frequency versus the total number of occurrences of all the terms in the document. There are also other ways to normalize the term frequency. For example, the Cornell SMART system uses the following formula to compute the (normalized) term frequency: 0 if freq d; t = 0 TF d, t = 1 + log(1 + log(freq(d; t ))) otherwise. Besides the term frequency measure, there is another important measure, called inverse document frequency (IDF), that represents the scaling factor, or the importance, of a term t . If a term t occurs in many documents, its importance will be scaled down due to its reduced discriminative power. For example, the term database systems may likely be less important if it occurs in many research papers in a database system conference. According to the same Cornell SMART system, IDF (t ) is defined by the following formula: 𝐼𝐷𝐹 𝑡 = 𝑙𝑜𝑔 1 + [𝑑] [𝑑𝑡 ] where d is the document collection, and dt is the set of documents containing term t . If 𝑑𝑡 ≪ 𝑑 the term t will have a large IDF scaling factor and vice versa. In a complete vector-space model, TF and IDF are combined together, which forms the TFIDF measure: TF-IDF(d, t ) = TF(d, t ) × IDF(t ). Let us examine how to compute similarity among a set of documents based on the notions of term frequency and inverse document frequency. Text Indexing Techniques Text indexing is the act of processing a text in order to extract statistics considered important for representing the information available and to allow fast search on its content. Text indexing operations can be performed not only on natural language texts, but virtually on any type of textual information, such as source code of computer programs, DNA or protein 304 databases and textual data stored in traditional database systems. There are several popular text retrieval indexing techniques, including inverted indices and signature files. Text index compression is the problem of designing a reduced-space data structure that provides fast search of a text collection, seen as a set of documents. In Information Retrieval (IR) the searches to support are usually for whole words or phrases, either to retrieve the list of all documents where they appear (full-text searching) or to retrieve a ranked list of the documents where those words or phrases are most relevant according to some criterion (relevance ranking). As inverted indexes (sometimes also called inverted lists or inverted files) are by far the most popular type of text index in IR, this entry focuses on different techniques to compress inverted indexes, depending on whether they are oriented to full-text searching or to relevance ranking. Query Processing Techniques Once an inverted index is created for a document collection, a retrieval system can answer a keyword query quickly by looking up which documents contain the query keywords. Specifically, we will maintain a score accumulator for each document and update these accumulators as we go through each query term. For each query term, we will fetch all of the documents that match the term and increase their scores. When examples of relevant documents are available, the system can learn from such examples to improve retrieval performance. This is called relevance feedback and has proven to be effective in improving retrieval performance. When we do not have such relevant examples, a system can assume the top few retrieved documents in some initial retrieval results to be relevant and extract more related keywords to expand a query. Such feedback is called pseudo-feedback or blind feedback and is essentially a process of mining useful keywords from the top retrieved documents. Pseudo-feedback also often leads to improved retrieval performance. One major limitation of many existing retrieval methods is that they are based on exact keyword matching. However, due to the complexity of natural languages, keyword- based retrieval can encounter two major difficulties. The first is the synonymy problem: two words with identical or similar meanings may have very different surface forms. For example, a user‘s query may use the word ―automobile,‖ but a relevant document may use ―vehicle‖ 305 instead of ―automobile.‖ The second is the polysemy problem: the same keyword, such as mining, or Java, may mean different things in different contexts. We now discuss some advanced techniques that can help solve these problems as well as reduce the index size. 15.5 Dimensionality Reduction for Text Text-based queries can then be represented as vectors, which can be used to search for their nearest neighbors in a document collection. However, for any nontrivial document database, the number of terms T and the number of documents D are usually quite large. Such high dimensionality leads to the problem of inefficient computation, since the resulting frequency table will have size T × D. Furthermore, the high dimensionality also leads to very sparse vectors and increases the difficulty in detecting and exploiting the relationships among terms (e.g., synonymy). To overcome these problems, dimensionality reduction techniques such as latent semantic indexing, probabilistic latent semantic analysis, and locality preserving indexing can be used. We now briefly introduce these methods. To explain the basic idea beneath latent semantic indexing and locality preserving indexing, we need to use some matrix and vector notations. In the following part, we use x1 , . . . , xtn ∈ Rm to represent the n documents with m features (words). They can be represented as a termdocument matrix X = [x1 , x2 , . . . , xn ]. Latent Semantic Indexing Latent semantic indexing (LSI) is one of the most popular algorithms for document dimensionality reduction. It is fundamentally based on SVD (singular value decomposition). Suppose the rank of the term-document X is r, then LSI decomposes X using SVD as follows: X = U ΣV T where Σ = diag(σ1 , . . . , σr ) and σ1 ≥ σ2 ≥ · · · ≥ σr are the singular values of X , U = [a1 , . . . , ar ] and ai is called the left singular vector, and V = [v1 , . . . , vr ], and vi is called the right singular vector. LSI uses the first k vectors in U as the transformation matrix to embed the original documents into a k-dimensional subspace. It can be easily checked that the column vectors of U are the eigenvectors of X X T . The basic idea of LSI is to extract the 306 most representative features, and at the same time the reconstruction error can be minimized. Let a be the transformation vector. The objective function of LSI can be stated as follows: 𝑎𝑟𝑔𝑚𝑖𝑛 2 𝑎𝑜𝑝𝑡 = 𝑋 − 𝑎𝑎𝑇 𝑋 2 = 𝑎𝑟𝑔𝑚𝑖𝑛 2 𝑎𝑇 𝑋𝑋 𝑇 𝑎 with the constrain, 𝑎𝑎𝑇 = 1 Since X X T is symmetric, the basis functions of LSI are orthogonal. Locality Preserving Indexing Different from LSI, which aims to extract the most representative features, Locality Preserving Indexing (LPI) aims to extract the most discriminative features. The basic idea of LPI is to preserve the locality information (i.e., if two documents are near each other in the original document space, LPI tries to keep these two documents close together in the reduced dimensionality space). Since the neighboring documents (data points in high- dimensional space) probably relate to the same topic, LPI is able to map the documents related to the same semantics as close to each other as possible. Given the document set x1 , . . . , xn ∈ Rm , LPI constructs a similarity matrix S ∈ Rn×n .The transformation vectors of LPI can be obtained by solving 𝑎𝑟𝑔𝑚𝑖𝑛 2 𝑎𝑇 𝑋𝐿𝑋 𝑇 𝑎 the following minimizationproblem: 𝑎𝑇 𝑥𝑖 − 𝑎𝑇 𝑥𝑗 𝑎𝑜𝑝𝑡 = arg min 2 𝑆𝑖𝑗 = 2 𝑖𝑗 where L = D − S is the Graph Laplacian and Dii = ∑ j Si j . Dii measures the local density around x. LPI constructs the similarity matrix S as 𝑥𝑖𝑇 𝑥𝑗 𝑆𝑖𝑗 = 𝑥𝑖𝑇 𝑥𝑗 0 , 𝐼𝑓 𝑥𝑖 𝑖𝑠 𝑎𝑚𝑜𝑢𝑛𝑔 𝑡𝑒 𝑝 𝑛𝑒𝑎𝑟𝑒𝑠𝑡 𝑛𝑒𝑖𝑔𝑏𝑜𝑢𝑟 𝑜𝑓 𝑥𝑗 𝑜𝑟 𝑥𝑗 𝑖𝑠 𝑎𝑚𝑜𝑢𝑛𝑔 𝑡𝑒 𝑝 𝑛𝑒𝑎𝑟𝑒𝑠𝑡 𝑛𝑒𝑖𝑔𝑏𝑜𝑢𝑟 𝑜𝑓 𝑥𝑖 𝑜𝑡𝑒𝑟𝑤𝑖𝑠𝑒 Thus, the objective function in LPI incurs a heavy penalty if neighboring points xi and x j are mapped far apart. Therefore, minimizing it is an attempt to ensure that if xi and x j are ―close‖ then yi (= aT xi ) and y j (= aT x j ) are close as well. Finally, the basis functions of 307 LPI are the eigenvectors associated with the smallest eigenvalues of the following generalized eigen-problem: 𝑋𝐿𝑋 𝑇 𝑎 = 𝜆𝑋𝐷𝑋 𝑇 𝑎 LSI aims to find the best subspace approximation to the original document space in the sense of minimizing the global reconstruction error. In other words, LSI seeks to uncover the most representative features. LPI aims to discover the local geometrical structure of the document space. Since the neighboring documents (data points in high- dimensional space) probably relate to the same topic, LPI can have more discriminating power than LSI. Theoretical analysis of LPI shows that LPI is an unsupervised approximation of the supervised Linear Discriminant Analysis (LDA). Therefore, for document clustering and document classification, we might expect LPI to have better performance than LSI. This was confirmed empirically. Probabilistic Latent Semantic Indexing The probabilistic latent semantic indexing (PLSI) method is similar to LSI, but achieves dimensionality reduction through a probabilistic mixture model. Specifically, we assume there are k latent common themes in the document collection, and each is characterized by a multinomial word distribution. A document is regarded as a sample of a mixture model with these theme models as components. We fit such a mixture model to all the documents, and the obtained k component multinomial models can be regarded as defining k new semantic dimensions. The mixing weights of a document can be used as a new representation of the document in the low latent semantic dimensions. Formally, let C = {d1 , d2 , . . . , dn } be a collection of n documents. Let θ1 , . . . , θk be k theme multinomial distributions. A word w in document di is regarded as a sample of the following mixture model. 𝑘 𝑃𝑑𝑖 𝑊 = 𝜋𝑑𝑖 𝑗𝑃 𝑤 𝜃𝑗 𝑗 =1 where πd , j is a document-specific mixing weight for the j-th aspect theme, and 𝑘 𝑗 =1 𝜋𝑑𝑖 𝑗 =1 308 n k logp C Λ = [c w, di log i=1 w∈V (πdi , jP(W|θj )) ] j=1 where V is the set of all the words (i.e., vocabulary), c(w ;𝑑𝑖 ) is the count of word w in document 𝑑𝑖 and Λ = ({𝜃𝑗 , { 𝜋𝑑𝑖 , 𝑗} i=1n} ) is the set of all the theme model parameters. The model can be estimated using the Expectation-Maximization (EM) algorithm which computes the following maximum likelihood estimate: Λ = argmaxΛ logp C Λ Once the model is estimated, 𝜃1 … . . 𝜃𝑘 define k new semantic dimensions and 𝜋𝑑𝑖 , 𝑗 gives a representation of 𝑑𝑖 in this low-dimension space 15.6 Text Mining Approaches There are many approaches to text mining, which can be classified from different perspectives, based on the inputs taken in the text mining system and the data mining tasks to be performed. In general, the major approaches, based on the kinds of data they take as input, are: (1) the keyword-based approach, where the input is a set of keywords or terms in the documents, (2) the tagging approach, where the input is a set of tags, and (3) the information-extraction approach, which inputs semantic information, such as events, facts, or entities uncovered by information extraction. A simple keyword-based approach may only discover relationships at a relatively shallow level, such as rediscovery of compound nouns (e.g., ―database‖ and ―systems‖) or co-occurring patterns with less significance (e.g., ―terrorist‖ and ―explosion‖). It may not bring much deep understanding to the text. The tagging approach may rely on tags obtained by manual tagging (which is costly and is unfeasible for large collections of documents) or by some automated categorization algorithm (which may process a relatively small set of tags and require defining the categories beforehand). The information-extraction approach is more advanced and may lead to the discovery of some deep knowledge, but it requires semantic analysis of text by natural language understanding and machine learning methods. This is a challenging knowledge discovery task. Various text mining tasks can be performed on the extracted keywords, tags, or seman- tic information. These include document clustering, classification, information extrac- tion, 309 association analysis, and trend analysis. We examine a few such tasks in the following discussion. Keyword-Based Association Analysis Like most of the analyses in text databases, association analysis first preprocess the text data by parsing, stemming, removing stop words, and so on, and then evokes association mining algorithms. In a document database, each document can be viewed as a transaction, while a set of keywords in the document can be considered as a set of items in the transaction. That is, the database is in the format {Document id, a set of keywords}. The problem of keyword association mining in document databases is thereby mapped to item association mining in transaction databases, where many interesting methods have been developed. A set of frequently occurring consecutive or closely located keywords may form a term or a phrase. The association mining process can help detect compound associations, that is, domain-dependent terms or phrases, such as [Stanford, University] or [U.S., President, George W. Bush], or noncompound associations, such as [dollars, shares, exchange, total, commission, stake, securities]. Mining based on these associations is referred to as ―termlevel association mining‖ (as opposed to mining on individual words). Term recognition and term-level association mining enjoy two advantages in text analysis: (1) terms and phrases are automatically tagged so there is no need for human effort in tagging documents; and (2) the number of meaningless results is greatly reduced, as is the execution time of the mining algorithms. With such term and phrase recognition, term-level mining can be evoked to find associations among a set of detected terms and keywords. Some users may like to find associations between pairs of keywords or terms from a given set of keywords or phrases, whereas others may wish to find the maximal set of terms occurring together. Therefore, based on user mining requirements, standard association mining or max-pattern mining algorithms may be evoked. Document Classification Analysis 310 Automated document classification is an important text mining task because, with the existence of a tremendous number of on-line documents, it is tedious yet essential to be able to automatically organize such documents into classes to facilitate document retrieval and subsequent analysis. Document classification has been used in automated topic tagging (i.e., assigning labels to documents), topic directory construction, identification of the document writing styles (which may help narrow down the possible authors of anonymous documents), and classifying the purposes of hyperlinks associated with a set of documents. A general procedure is as follows: First, a set of pre-classified documents is taken as the training set. The training set is then analyzed in order to derive a classification scheme. Such a classification scheme often needs to be refined with a testing process. The so-derived classification scheme can be used for classification of other on-line documents. This process appears similar to the classification of relational data. However, there is a fundamental difference. Relational data are well structured: each tuple is defined by a set of attribute-value pairs. For example, in the tuple {sunny, warm, dry, not windy, play tennis}, the value ―sunny‖ corresponds to the attribute weather outlook, ―warm‖corresponds to the attribute temperature, and so on. The classification analysis decides which set of attributevalue pairs has the greatest discriminating power in determining whether a person is going to play tennis. On the other hand, document databases are not structured according to attributevalue pairs. That is, a set of keywords associated with a set of documents is not organized into a fixed set of attributes or dimensions. If we view each distinct keyword, term, or feature in the document as a dimension, there may be thousands of dimensions in a set of documents. Therefore, commonly used relational data-oriented classification methods, such as decision tree analysis, may not be effective for the classification of document databases. According to the vector-space model, two documents are similar if they share similar document vectors. This model motivates the construction of the k-nearest-neighbor classifier, based on the intuition that similar documents are expected to be assigned the same class label. We can simply index all of the training documents, each associated with its corresponding class label. When a test document is submitted, we can treat it as a query to the IR system and retrieve from the training set k documents that are most similar to the query, where k is a tunable constant. The class label of the test document can be determined based on the class label distribution of its k nearest neighbors. Such class label distribution can also be refined, such as based on weighted counts instead of raw counts, or setting aside a portion of labeled 311 documents for validation. By tuning k and incorporating the suggested refinements, this kind of classifier can achieve accuracy comparable with the best classifier. However, since the method needs nontrivial space to store (possibly redundant) training information and additional time for inverted index lookup, it has additional space and time overhead in comparison with other kinds of classifiers. The vector-space model may assign large weight to rare items disregarding its class distribution characteristics. Such rare items may lead to ineffective classification. Let‘s examine an example in the TF-IDF measure computation. Suppose there are two terms t1 and t2 in two classes C1 and C2, each having 100 training documents. Term t1 occurs in five documents in each class (i.e., 5% of the overall corpus), but t2 occurs in 20 documents in class C1 only (i.e., 10% of the overall corpus). Term t1 will have a higher TF-IDF value because it is rarer, but it is obvious t2 has stronger discriminative power in this case. A feature selection process can be used to remove terms in the training documents that are statistically uncorrelated with the class labels. This will reduce the set of terms to be used in classification, thus improving both efficiency and accuracy. After feature selection, which removes nonfeature terms, the resulting ―cleansed‖ training documents can be used for effective classification. Bayesian classification is one of several popular techniques that can be used for effective document classification. Since document classification can be viewed as the calculation of the statistical distribution of documents in specific classes, a Bayesian classifier first trains the model by calculating a generative document distribution P(d|c) to each class c of document d and then tests which class is most likely to generate the test document. Since both methods handle high-dimensional data sets, they can be used for effective document classification. Other classification methods have also been used in documentation classification. For example, if we represent classes by numbers and construct a direct map- ping function from term space to the class variable, support vector machines can be used to perform effective classification since they work well in highdimensional space. The least-square linear regression method is also used as a method for discriminative classification. Association-based classification, which classifies documents based on a set of associated, frequently occurring text patterns. Notice that very frequent terms are likely poor discriminators. Thus only those terms that are not very frequent and that have good discriminative power will be used in document classification. Such an association-based 312 classification method proceeds as follows: First, keywords and terms can be extracted by information retrieval and simple association analysis techniques. Second, concept hierarchies of keywords and terms can be obtained using available term classes, such as WordNet, or relying on expert knowledge, or some keyword classification systems. Documents in the training set can also be classified into class hierarchies. A term association mining method can then be applied to discover sets of associated terms that can be used to maximally distinguish one class of documents from others. This derives a set of association rules associated with each document class. Such classification rules can be ordered based on their discriminative power and occurrence frequency, and used to classify new documents. Such kind of association-based document classifier has been proven effective. Document Clustering Analysis Document clustering is one of the most crucial techniques for organizing documents in an unsupervised manner. When documents are represented as term vectors, the clustering methods can be applied. However, the document space is always of very high dimensionality, ranging from several hundreds to thousands. Due to the curse of dimensionality, it makes sense to first project the documents into a lower- dimensional subspace in which the semantic structure of the document space becomes clear. In the low-dimensional semantic space, the traditional clustering algorithms can then be applied. To this end, spectral clustering, mixture model clustering, clustering using Latent Semantic Indexing, and clustering using Locality Preserving Indexing are the most well-known techniques. We discuss each of these methods here. The spectral clustering method first performs spectral embedding (dimensionality reduction) on the original data, and then applies the traditional clustering algorithm (e.g., k-means) on the reduced document space. Recently, work on spectral clustering shows its capability to handle highly nonlinear data (the data space has high curvature at every local area). Its strong connections to differential geometry make it capable of discovering the manifold structure of the document space. One major drawback of these spectral clustering algorithms might be that they use the nonlinear embedding (dimensionality reduction), which is only defined on ―training‖ data. They have to use all of the data points to learn the embedding. When the data set is very large, it is computationally expensive to learn such an embedding. This restricts the application of spectral clustering on large data sets. 313 The mixture model clustering method models the text data with a mixture model, often involving multinomial component models. Clustering involves two steps: (1) estimating the model parameters based on the text data and any additional prior knowledge, and (2) inferring the clusters based on the estimated model parameters. Depending on how the mixture model is defined, these methods can cluster words and documents at the same time. Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA) are two examples of such techniques. One potential advantage of such clustering methods is that the clusters can be designed to facilitate comparative analysis of documents. We can acquire the transformation vectors (embedding function) in LSI and LPI. Such embedding functions are defined everywhere; thus, we can use part of the data to learn the embedding function and embed all of the data to low-dimensional space. With this trick, clustering using LSI and LPI can handle large document data corpus. As discussed in the previous section, LSI aims to find the best subspace approximation to the original document space in the sense of minimizing the global reconstruction error. In other words, LSI seeks to uncover the most representative features rather than the most discriminative features for document representation. Therefore, LSI might not be optimal in discriminating documents with different semantics, which is the ultimate goal of clustering. LPI aims to discover the local geometrical structure and can have more discriminating power. Experiments show that for clustering, LPI as a dimensionality reduction method is more suitable than LSI. Compared with LSI and LPI, the PLSI method reveals the latent semantic dimensions in a more interpretable way and can easily be extended to incorporate any prior knowledge or preferences about clustering. 15.7 Summary A substantial portion of the available information is stored in text or document databases that consist of large collections of documents, such as news articles, technical papers, books, digital libraries, e-mail messages, and Web pages. Text information retrieval and data mining has thus become increasingly important. Precision, recall, and the F-score are three based measures from Information Retrieval (IR). Various text retrieval methods have been developed. These typically either focus on document selection (where the query is regarded as providing constraints) or document ranking (where the query is used to rank documents in order of relevance). The vector-space model is a popular example of the latter kind. Latex Semantic Indexing (LSI), Locality Preserving Indexing (LPI), and Probabilistic LSI can be 314 used for text dimensionality reduction. Text mining goes one step beyond keyword-based and similarity-based information retrieval and discovers knowledge from semistructured text data using methods such as keyword-based association analysis, document classification, and document clustering. 15.8 Keywords Text mining, F-score, Recall, Precision, Information retrieval, Text Indexing, Dimensionality Reduction, Document Clustering Analysis, Probabilistic Latent Semantic Indexing, Locality Preserving Indexing, Latent Semantic Indexing. 15.9 Exercises 1. What is text mining? 2. Explain Mining Text Data? 3. Briefly Explain Information Retrieval? 4. What are Basic Measures for Text Retrieval? Explain? 5. Explain Text Retrieval Methods? 6. Write a note on Text Indexing Techniques? 7. Write a note on Query Processing Techniques? 8. Write a note on Dimensionality Reduction for Text? 9. Why Dimensionality Reduction for Text Required? 10. Explain Text Mining Approaches? 11. Explain Document Classification Analysis & Document Clustering Analysis? 15.10 References 1. Data Mining: Concepts and Techniques by Jiawei Han and Micheline Kamber, Morgan Kaufmann Publisher, Second Edition, 2006. 2. Introduction to Data Mining with Case Studies by G. K. Gupta, Eastern Economy Edition (PHI, New Delhi), Third Edition, 2009. 315 UNIT-16: MULTIMEDIA DATA MINING Structure 16.1 Objectives 16.2 Introduction 16.3 Mining Multimedia Data 16.4 Similarity Search in Multimedia Data 16.5 Multidimensional Analysis of Multimedia Data 16.6 Mining Associations in Multimedia Data 16.7 Summary 16.8 Keywords 16.9 Exercises 16.10 References 16.1 Objectives The objectives covered under this unit include: The introduction Multimedia Data Techniques for Mining Multimedia Data Similarity Search in Multimedia Data Multidimensional Analysis of Multimedia Data Mining Associations in Multimedia Data 16.2 Introduction What is multimedia data? ―What is a multimedia database?‖ A multimedia database system stores and manages a large collection of multimedia data, such as audio, video, image, graphics, speech, text, document, and hypertext data, which contain text, text mark-ups, and linkages. Multimedia database systems are increasingly common owing to the popular use of audio/video equipment, digital cameras, CD-ROMs, and the Internet. Typical multimedia database systems include NASA‘s EOS (Earth Observation System), various kinds of image and audio-video databases, and 316 Internet databases. In digital data acquisition and storage technology, the rapid progress has led to the fast growing tremendous and amount of data stored in databases. Although valuable information may be hiding behind the data, the overwhelming data volume makes it difficult (if not impossible) for human beings to extract them without powerful tools. Multimedia mining systems that can automatically extract semantically meaningful information (knowledge) from multimedia files are increasingly in demand. For this reason, a large number of techniques have been proposed ranging from simple measures (e.g. color histogram for image, energy estimates for audio signal) to more sophisticated systems like speaker emotion recognition in audio, automatic summarization of TV programs. Generally, multimedia database systems store and manage a large collection of multimedia objects, such as image, video, audio and hypertext data. 16.3 Multimedia Data Mining In multimedia documents, knowledge discovery deals with non-structured information. For this reason, we need tools for discovering relationships between objects or segments within multimedia document components, such as classifying images based on their content, extracting patterns in sound, categorizing speech and music, and recognizing and tracking objects in video streams. In general, the multimedia files from a database must be first pre-processed to improve their quality. Subsequently, these multimedia files undergo various transformations and features extraction to generate the important features from the multimedia files. With the generated features, mining can be carried out using data mining techniques to discover significant patterns. These resulting patterns are then evaluated and interpreted in order to obtain the final application‘s knowledge. In Figure 1, we present the model of applying multimedia mining in different multimedia types. Data collection is the starting point of a learning system, as the quality of raw data determines the overall achievable performance. Then, the goal of data pre-processing is to discover important features from raw data. Data preprocessing includes data cleaning, normalization, transformation, feature selection, etc. Learning can be straightforward, if informative features can be identified at pre-processing stage. Detailed procedure depends highly on the nature of raw data and problem‘s domain. In some cases, prior knowledge can be extremely valuable. For many systems, this stage is still primarily conducted by domain experts. The product of data pre-processing is the training set. Given a training set, a learning model has to be chosen to learn from it. It must be 317 mentioned that the steps of multimedia mining are often iterative. The analyst can also jump back and forth between major tasks in order to improve the results. Multimedia mining reaches much higher complexity resulting from: The huge volume of data, The variability and heterogeneity of the multimedia data (e.g. diversity of sensors, time or conditions of acquisition etc) and The multimedia content‘s meaning is subjective. The high dimensionality of the feature spaces and the size of the multimedia datasets make the feature extraction a challenging problem. In the following section, we analyze the feature extraction process for multimedia data. Feature extraction: There are two kinds of features: description-based and content-based. The former uses metadata, such as keywords, caption, size and time of creation. The latter is based on the content of the object itself. Feature extraction from text: Text categorization is a conventional classification problem applied to the textual domain. It solves the problem of assigning text content to predefined categories. In the learning stage, the labelled training data are first pre-processed to remove unwanted details and to ―normalize‖ the data. For example, in text documents punctuation symbols and non-alphanumeric characters are usually discarded, because they do not help in 318 classification. Moreover, all characters are usually converted to lower case to simplify matters. The next step is to compute the features that are useful to distinguish one class from another. For a text document, this usually means identifying the keywords that summarize the contents of the document. How are these keywords learned? One way is to look for words that occur frequently in the document. These words tend to be what the document is about. Of course, words that occur too frequently, such as ―the‖, ―is‖, ―in‖, ―of‖ are no help at all, since they are prevalent in every document. These common English words may be removed using a ―stop-list‖ of words during the pre-processing stage. From the remaining words, a good heuristic is to look for words that occur frequently in documents of the same class, but rarely in documents of other classes. In order to cope with documents of different lengths, relative frequency is preferred over absolute frequency. Some authors used phrases, rather than individual words, as indexing terms, but the experimental results found to date have not been uniformly encouraging results. Another problem of text is the variant. Variant refers to the different forms of the same word, e.g. ―go‖, ―goes‖, ―went‖, ―gone‖, ―going‖. This may be solved by stemming, which means replacing all variants of a word by a standard one. Feature extraction from images: Image categorization classifies images into semantic databases that are manually precategorized. In the same semantic databases, images may have large variations with dissimilar visual descriptions (e.g. images of persons, images of industries etc.). In addition images from different semantic databases might share a common background (some flowers and sunset have similar colours). Authors distinguish three types of feature vectors for image description: 1. Pixel level features, 2. Region level features, and 3. Tile level features. Pixel level features store spectral and textural information about each pixel of the image. For example, the fraction of the end members, such as concrete or water, can describe the content of the pixels. Region level features describe groups of pixels. Following the segmentation process, each region is described by its boundary and a number of attributes, which present information about the content of the region in terms of the end members and texture, shape, size, fractal scale, etc. Tile level for image features present information about whole images 319 using texture, percentages of end members, fractal scale and others. Moreover, other researchers proposed an information-driven framework that aims to highlight the role of information at various levels of representation. This framework adds one more level of information: the Pattern and Knowledge Level that integrates domain, related alphanumeric data and the semantic relationships discovered from the image data. Feature extraction from Audio: Audio data play an important role in multimedia applications. Music information has two main branches: symbolic and audio information. Attack, duration, volume, velocity and instrument type of every single note are available information. Therefore, it is possible to easily access statistical measures such as tempo and mean key for each music item. Moreover, it is possible to attach to each item high-level descriptors, such as instrument kind and number. On the other hand, audio information deals with real world signals and any features need to be extracted through signal analysis. Some of the most frequently used features for audio classification are: Total Energy: The temporal energy of an audio frame is defined by the rms of the audio signal magnitude within each frame. Zero Crossing Rate (ZCR): ZCR is also a commonly used temporal feature. ZCR counts the number of times that an audio signal crosses its zero axis. Frequency Centroid (FC): It indicates the weighted average of all frequency components of a frame. Bandwidth (BW): Bandwidth is the weighted average of the squared differences between each frequency component and its frequency Centroid. Pitch Period: It is a feature that measures the fundamental frequency of an audio signal Feature extraction from Video: In video mining, there are three types of videos: a) The produced (e.g. movies, news videos, and dramas), b) The raw (e.g. traffic videos, surveillance videos etc), and c) The medical video (e.g. ultra sound videos including echocardiogram). Higher-level information from video includes: • detecting trigger events (e.g. any vehicles entering a particular area, people exiting or entering a particular building) 320 • determining typical and anomalous patterns of activity, generating person-centric or objectcentric views of an activity • classifying activities into named categories (e.g. walking, riding a bicycle), The first stage for mining raw video data is grouping input frames to a set of basic units, which are relevant to the structure of the video. In produced videos, the most widely used basic unit is a shot, which is defined as a collection of frames recorded from a single camera operation. Shot detection methods can be classified into many categories: pixel based, statistics based, transform based, feature based and histogram based. Color or grayscale histograms (such as in image mining) can also be used. To segment video, color histograms, as well as motion and texture features can be used. Generally, if the difference between the two consecutive frames is larger than a certain threshold value, then a shot boundary is considered between two corresponding frames. The difference can be determined by comparing the corresponding pixels of two images. Data pre-processing: In a multimedia database, there are numerous objects that have many different dimensions of interests. For example, only the color attribute can have 256 dimensions, with each counting the frequency of a given color in images. The image may still have other dimensions. Selecting a subset of features is a method for reducing the problem size. This reduces the dimensionality of the data and enables learning algorithms to operate faster and more effectively. The problem of feature interaction can also be addressed by constructing new features from the basic features set. This technique is called feature construction/transformation. Sampling is also well accepted by the statistics community that argues ―a powerful computationally intense procedure operating on a sub-sample of the data may in fact provide superior accuracy than a less sophisticated one using the entire data base‖. Moreover, discretization can significantly reduce the number of possible values of the continuous feature, as large number of possible feature values contributes to slow and ineffective process of machine learning. Furthermore, normalization (―scaling down" transformation of the features) is also beneficial since there is often a large difference between the maximum and minimum values of the features. 16.4 Similarity Search in Multimedia Data ―When searching for similarities in multimedia data, can we search on either the data description or the data content?‖ That is correct. For similarity searching in multimedia data, we consider two main families of multimedia indexing and retrieval systems: 321 Description-based retrieval systems, which build indices and perform object retrieval based on image descriptions, such as keywords, captions, size, and time of creation; Content-based retrieval systems, which support retrieval based on the image content, such as color histogram, texture, pattern, image topology, and the shape of objects and their layouts and locations within the image. Description-based retrieval is labour-intensive if performed manually. If automated, the results are typically of poor quality. For example, the assignment of keywords to images can be a tricky and arbitrary task. Recent development of Web-based image clustering and classification methods has improved the quality of description-based Web image retrieval, because image surrounded text information as well as Web linkage information can be used to extract proper description and group images describing a similar theme together. Content-based retrieval uses visual features to index images and promotes object retrieval based on feature similarity, which is highly desirable in many applications. In a content based image retrieval system, there are often two kinds of queries: image sample based queries and image feature specification queries. Image-sample based queries find all of the images that are similar to the given image sample. This search compares the feature vector (or signature) extracted from the sample with the feature vectors of images that have already been extracted and indexed in the image database. Based on this comparison, images that are close to the sample image are returned. Image feature specification queries specify or sketch image features like color, texture, or shape, which are translated into a feature vector to be matched with the feature vectors of the images in the database. Content-based retrieval has wide applications, including medical diagnosis, weather prediction, TV production, Web search engines for images, and e-commerce. Some systems, such as QBIC (Query By Image Content), support both sample-based and image feature specification queries. There are also systems that support both content based and description-based retrieval. Several approaches have been proposed and studied for similarity-based retrieval in image databases, based on image signature: Color histogram–based signature: In this approach, the signature of an image Includes color histograms based on the color composition of an image regardless of its scale or orientation. This method does not contain any information about shape, image topology, or texture. Thus, two images with similar color composition but that contains very different 322 shapes or textures may be identified as similar, although they could be completely unrelated semantically. Multi feature composed signature: In this approach, the signature of an image includes a composition of multiple features: color histogram, shape, image topology, and texture. The extracted image features are stored as metadata, and images are indexed based on such metadata. Often, separate distance functions can be defined for each feature and subsequently combined to derive the overall results. Multidimensional content-based search often uses one or a few probe features to search for images containing such (similar) features. It can therefore be used to search for similar images. This is the most popularly used approach in practice. Wavelet-based signature: This approach uses the dominant wavelet coefficients of an image as its signature. Wavelets capture shape, texture, and image topology information in a single unified framework. This improves efficiency and reduces the need for providing multiple search primitives (unlike the second method above). However, since this method computes a single signature for an entire image, it may fail to identify images containing similar objects where the objects differ in location or size. Wavelet-based signature with region-based granularity: In this approach, the computation and comparison of signatures are at the granularity of regions, not the entire image. This is based on the observation that similar images may contain similar regions, but a region in one image could be a translation or scaling of a matching region in the other. Therefore, a similarity measure between the query image Q and a target image T can be defined in terms of the fraction of the area of the two images covered by matching pairs of regions from Q and T. Such a region-based similarity search can find images containing similar objects, where these objects may be translated or scaled. The representation of multidimensional points and objects, and the development of appropriate indexing methods that enable them to be retrieved efficiently is a well-studied subject. Most of these methods were designed for use in application domains where the data usually has a spatial component which has a relatively low dimension. Examples of such application domains include geographic information systems (GIS), spatial databases, solid modelling, computer vision, computational geometry, and robotics. However, there are many application domains where the data is of considerably higher dimensionality, and is not necessarily spatial. This is especially true in multimedia databases where the data is a set of objects and the high dimensionality is a direct result of trying to describe the objects via a collection of features (also known as a feature vector). In the case of images, examples of 323 features include color, color moments, textures, shape descriptions, etc. expressed using scalar values. The goal in these applications is often expressed more generally as one of the following: Find objects whose feature values fall within a given range or where the distance from some query object falls into a certain range (range queries). Find objects whose features have values similar to those of a given query object or set of query objects (nearest neighbour queries). These queries are collectively referred to as similarity searching. Curse of dimensionality: An apparently straightforward solution to finding the nearest neighbour is to compute a Voronoi diagram for the data points (i.e., a partition of the space into regions where all points in the region are closer to the region‘s associated data point than to any other data point), and then locate the Voronoi region corresponding to the query point. The problem with this solution is that the combinatorial complexity of the Voronoi diagram in high dimensions is prohibitive —that is, it grows exponentially with its dimension k so that for N points, the time to build and the space requirements can grow as rapidly as Θ (N k/2 ). This renders its applicability moot. The above is typical of the problems that we must face when dealing with high-dimensional data. Generally speaking, multidimensional queries become increasingly more difficult as the dimensionality increases. The problem is characterized as the curse of dimensionality. This term is used to indicate that the number of samples needed to estimate an arbitrary function with a given level of accuracy grows exponentially with the number of variables (i.e., dimensions) that comprise it. For similarity searching (i.e., finding nearest neighbours), this means that the number of objects (i.e., points) in the data set that need to be examined in deriving the estimate grows exponentially with the underlying dimension. The curse of dimensionality has a direct bearing on similarity searching in high dimensions as it raises the issue of whether or not nearest neighbour searching is even meaningful in such a domain. In particular, letting ‗d’ denote a distance function which need not necessarily be a metric, it has been pointed out that nearest neighbour searching is not meaningful when the ratio of the variance of the distance between two random points p and q, drawn from the data and query distributions, and the expected distance between them converges to zero as the dimension ‗k’ goes to infinity — that is, lim k-> ∞ Variance [d(p, q)] Expected [d(p, q)] =0 324 In other words, the distance to the nearest neighbour and the distance to the farthest neighbour tend to converge as the dimension increases. Multidimensional indexing: Assuming that the curse of dimensionality does not come into play, query responses are facilitated by sorting the objects on the basis of some of their feature values and building appropriate indexes. The high-dimensional feature space is indexed using some multidimensional data structure (termed multidimensional indexing) with appropriate modifications to fit the high-dimensional problem environment. Similarity search which finds objects similar to a target object can be performed with a range search or a nearest neighbor search in the multidimensional data structure. However, unlike applications in spatial databases where the distance function between two objects is usually Euclidean, this is not necessarily the case in the high-dimensional feature space where the distance function may even vary from query to query on the same feature. Searching in high-dimensional spaces is time-consuming. Performing range queries in high dimensions is considerably easier, from the standpoint of computational complexity, than performing similarity queries as range queries do not involve the computation of distance. In particular, searches through an indexed space usually involve relatively simple comparison tests. However, if we have to examine all of the index nodes, then the process is again time-consuming. In contrast, computing similarity in terms of nearest neighbour search makes use of distance and the process of computing the distance can be computationally complex. For example, computing the Euclidean distance between two points in a high-dimensional space, say ‗d’, requires ‗d’ multiplication operations and ‗d-1’ addition operations, as well as a square root operation (which can be omitted). Note also that computing similarity requires the definition of what it means for two objects to be similar, which is not always so obvious. Distance based indexing: Often, the only information that we have available is a distance function that indicates the degree of similarity (or dissimilarity) between all pairs of the N objects. Usually the distance function ‗d’ is required to obey the triangle inequality, be nonnegative, and be symmetric, in which case it is known as a metric and also referred to as a distance metric. However, at times, the distance function is not a metric. Often, the degree of similarity is expressed using a similarity matrix which contains interobject distance values, for all possible pairs of the N objects. Given a distance function, we usually index the objects with respect to their distance from a few selected objects. We use the term distance-based 325 indexing to describe such methods. There are two basic partitioning schemes: ball partitioning and generalized hyper plane partitioning. In ball partitioning, the data set is partitioned based on distances from one distinguished object, sometimes called a vantage point, into the subset that is inside and the subset that is outside a ball around the object .In generalized hyper plane partitioning, two distinguished objects p1 and p2 are chosen and the data set is partitioned based on which of the two distinguished objects is the closest — that is, all the objects in subset A are closer to p1 than to p2, while the objects in subset B are closer to p2. The asymmetry of ball partitioning is a potential drawback of this method as the outer shell tends to be very narrow for metric spaces typically used in similarity search .In contrast, generalized hyper plane partitioning is more symmetric, in that both partitions form a ―ball‖ around an object. The advantage of distance-based indexing methods is that distance computations are used to build the index, but once the index has been built, similarity queries can often be performed with a significantly lower number of distance computations than a sequential scan of the entire dataset. Of course, in situations where we may want to apply several different distance metrics, then the drawback of the distance-based indexing techniques is that they require that the index be rebuilt for each different distance metric, which may be nontrivial. This is not the case for the multidimensional indexing methods which have the advantage of supporting arbitrary distance metrics (however, this comparison is not entirely fair, since the assumption, when using distance-based indexing, is that often we do not have any feature values as for example in DNA sequences). 16.5 Multidimensional Analysis of Multimedia Data Multidimensional Analysis and Descriptive Mining of Complex Data Objects: Many advanced, data-intensive applications, such as scientific research and engineering design, need to store, access, and analyze complex but relatively structured data objects. These objects cannot be represented as simple and uniformly structured records (i.e., tuples) in data relations. Such application requirements have motivated the design and development of object-relational and object-oriented database systems. Both kinds of systems deal with the efficient storage and access of vast amounts of disk-based complex structured data objects. These systems organize a large set of complex data objects into classes, which are in turn organized into class/subclass hierarchies. Each object in a class is associated with 326 an object-identifier, a set of attributes that may contain sophisticated data structures, set- or list-valued data, class composition hierarchies, multimedia data, and a set of methods that specify the computational routines or rules associated with the object class. There has been extensive research in the field of database systems on how to efficiently index, store, access, and manipulate complex objects in object-relational and object-oriented database systems. Technologies handling these issues are discussed in many books on database systems, especially on object-oriented and object-relational database systems. One step beyond the storage and access of massive-scaled, complex object data is the systematic analysis and mining of such data. This includes two major tasks: (1) construct multidimensional data warehouses for complex object data and perform online analytical processing (OLAP) in such data warehouses, (2) Develop effective and scalable methods for mining knowledge from object databases and/or data warehouses. The second task is largely covered by the mining of specific kinds of data (such as spatial, temporal, sequence, graph- or tree-structured, text, and multimedia data), since these data form the major new kinds of complex data objects. Thus, our focus in this section will be mainly on how to construct object data warehouses and perform OLAP analysis on data warehouses for such data. A major limitation of many commercial data warehouse and OLAP tools for multidimensional database analysis is their restriction on the allowable data types for dimensions and measures. Most data cube implementations confine dimensions to nonnumeric data, and measures to simple, aggregated values. To introduce data mining and multidimensional data analysis for complex objects, this section examines how to perform generalization on complex structured objects and construct object cubes for OLAP and mining in object databases. To facilitate generalization and induction in object-relational and object-oriented databases, it is important to study how each component of such databases can be generalized, and how the generalized data can be used for multidimensional data analysis and data mining. Generalization of Structured Data 327 An important feature of object-relational and object-oriented databases is their capability of storing, accessing, and modelling complex structure-valued data, such as set- and list-valued data and data with nested structures. ―How can generalization be performed on such data?‖ Let‘s start by looking at the generalization of set-valued, list-valued, and sequence-valued attributes. A set-valued attribute may be of homogeneous or heterogeneous type. Typically, set-valued data can be generalized by (1) generalization of each value in the set to its corresponding higher-level concept, or (2) derivation of the general behavior of the set, such as the number of elements in the set, the types or value ranges in the set, the weighted average for numerical data, or the major clusters formed by the set. Moreover, generalization can be performed by applying different generalization operators to explore alternative generalization paths. In this case, the result of generalization is a heterogeneous set. Example 1: Generalization of a set-valued attribute. Suppose that the hobby of a person is a set-valued attribute containing the set of values (tennis, hockey, soccer, violin, SimCity). This set can be generalized to a set of high-level concepts, such as (sports, music, computer games) or into the number 5 (i.e., the number of hobbies in the set). Moreover, a count can be associated with a generalized value to indicate how many elements are generalized to that value, as in fsports(3),music(1), computer games(1)}, where sports(3) indicates three kinds of sports, and so on. A set-valued attribute may be generalized to a set-valued or a single-valued attribute; a single-valued attribute may be generalized to a set-valued attribute if the values form a lattice or ―hierarchy‖ or if the generalization follows different paths. Further generalizations on such a generalized set-valued attribute should follow the generalization path of each value in the set. List-valued attributes and sequence-valued attributes can be generalized in a manner similar to that for set-valued attributes except that the order of the elements in the list or sequence should be preserved in the generalization. Each value in the list can be generalized into its corresponding higher-level concept. Alternatively, a list can be generalized according to its general behaviour, such as the length of the list, the type of list elements, the value range, the weighted average value for numerical data, or by dropping unimportant elements in the list. A list may be generalized into a list, a set, or a single value. Example 2: Generalization of list-valued attributes. Consider the following list or sequence of data for a person‘s education record: ―((B.Sc. in Electrical Engineering, U.B.C., Dec., 1998), (M.Sc. in Computer Engineering, U. Maryland, May, 2001), (Ph.D. in Computer 328 Science, UCLA, Aug., 2005))‖. This can be generalized by dropping less important descriptions (attributes) of each tuple in the list, such as by dropping the month attribute to obtain ―((B.Sc., U.B.C., 1998), : : :)‖, and/or by retaining only the most important tuple(s) in the list, e.g., ―(Ph.D. in Computer Science, UCLA, 2005)‖.A complex structure-valued attribute may contain sets, tuples, lists, trees, records, and their combinations, where one structure may be nested in another at any level. In general, a structure-valued attribute can be generalized in several ways, such as 1. Generalizing each attribute in the structure while maintaining the shape of the structure, 2. Flattening the structure and generalizing the flattened structure, 3. Summarizing the low-level structures by high-level concepts or aggregation, and 4. Returning the type or an overview of the structure. In general, statistical analysis and cluster analysis may help toward deciding on the directions and degrees of generalization to perform, since most generalization processes are to retain main features and remove noise, outliers, or fluctuations. Aggregation and Approximation in Spatial and Multimedia Data Generalization: Aggregation and approximation are another important means of generalization. They are especially useful for generalizing attributes with large sets of values, complex structures, and spatial or multimedia data. Let‘s take spatial data as an example. We would like to generalize detailed geographic points into clustered regions, such as business, residential, industrial, or agricultural areas, according to land usage. Such generalization often requires the merge of a set of geographic areas by spatial operations, such as spatial union or spatial clustering methods. Aggregation and approximation are important techniques for this form of generalization. In a spatial merge, it is necessary to not only merge the regions of similar types within the same general class but also to compute the total areas, average density, or other aggregate functions while ignoring some scattered regions with different types if they are unimportant to the study. Other spatial operators, such as spatial-union, spatialoverlapping, and spatial-intersection (which may require the merging of scattered small regions into large, clustered regions) can also use spatial aggregation and approximation as data generalization operators. Example 3 Spatial aggregation and approximation: Suppose that we have different pieces of land for various purposes of agricultural usage, such as the planting of vegetables, grains, and fruits. These pieces can be merged or aggregated into one large piece of agricultural land by a spatial merge. However, such a piece of agricultural land may contain highways, houses, 329 and small stores. If the majority of the land is used for agriculture, the scattered regions for other purposes can be ignored, and the whole region can be claimed as an agricultural area by approximation. A multimedia database may contain complex texts, graphics, images, video fragments, maps, voice, music, and other forms of audio/video information. Multimedia data are typically stored as sequences of bytes with variable lengths, and segments of data are linked together or indexed in a multidimensional way for easy reference. Generalization on multimedia data can be performed by recognition and extraction of the essential features and/or general patterns of such data. There are many ways to extract such information. For an image, the size, color, shape, texture, orientation, and relative positions and structures of the contained objects or regions in the image can be extracted by aggregation and/or approximation. For a segment of music, its melody can be summarized based on the approximate patterns that repeatedly occur in the segment, while its style can be summarized based on its tone, tempo, or the major musical instruments played. For an article, its abstract or general organizational structure (e.g., the table of contents, the subject and index terms that frequently occur in the article, etc.) may serve as its generalization. In general, it is a challenging task to generalize spatial data and multimedia data in order to extract interesting knowledge implicitly stored in the data. Technologies developed in spatial databases and multimedia databases, such as spatial data accessing and analysis techniques, pattern recognition, image analysis, text analysis, contentbased image/text retrieval and multidimensional indexing methods, should be integrated with data generalization and data mining techniques to achieve satisfactory results. Techniques for mining such data are further discussed in the following sections. Generalization of Object Identifiers and Class/Subclass Hierarchies: ―How can object identifiers be generalized?‖ At first glance, it may seem impossible to generalize an object identifier. It remains unchanged even after structural reorganization of the data. However, since objects in an object-oriented database are organized into classes, which in turn are organized into class/subclass hierarchies, the generalization of an object can be performed by referring to its associated hierarchy. Thus, an object identifier can be generalized as follows. First, the object identifier is generalized to the identifier of the lowest subclass to which the object belongs. The identifier of this subclass can then, in turn, be generalized to a higher level class/subclass identifier by climbing up the class/subclass hierarchy. Similarly, a class or a subclass can be generalized to its corresponding superclass(es) by climbing up its associated class/subclass hierarchy. ―Can inherited properties of objects be generalized?‖ Since object-oriented databases are organized into class/subclass hierarchies, some attributes 330 or methods of an object class are not explicitly specified in the class but are inherited from higher-level classes of the object. Some object-oriented database systems allow multiple inheritance, where properties can be inherited from more than one superclass when the class/subclass ―hierarchy‖ is organized in the shape of a lattice. The inherited properties of an object can be derived by query processing in the object-oriented database. From the data generalization point of view, it is unnecessary to distinguish which data are stored within the class and which are inherited from its super class. As long as the set of relevant data are collected by query processing, the data mining process will treat the inherited data in the same manner as the data stored in the object class, and perform generalization accordingly. Methods are an important component of object-oriented databases. They can also be inherited by objects. Many behavioural data of objects can be derived by the application of methods. Since a method is usually defined by a computational procedure/function or by a set of deduction rules, it is impossible to perform generalization on the method itself. However, generalization can be performed on the data derived by application of the method. That is, once the set of task-relevant data is derived by application of the method, generalization can then be performed on these data. Generalization of Class Composition Hierarchies: An attribute of an object may be composed of or described by another object, some of whose attributes may be in turn composed of or described by other objects, thus forming a class composition hierarchy. Generalization on a class composition hierarchy can be viewed as generalization on a set of nested structured data (which are possibly infinite, if the nesting is recursive). In principle, the reference to a composite object may traverse via a long sequence of references along the corresponding class composition hierarchy. However, in most cases, the longer the sequence of references traversed, the weaker the semantic linkage between the original object and the referenced composite object. For example, an attribute vehicles owned of an object class student could refer to another object class car, which may contain an attribute auto dealer, which may refer to attributes describing the dealer‘s manager and children. Obviously, it is unlikely that any interesting general regularities exist between a student and her car dealer‘s manager‘s children. Therefore, generalization on a class of objects should be performed on the descriptive attribute values and methods of the class, with limited reference to its closely related components via its closely related linkages in the class composition hierarchy. That is, in order to discover interesting knowledge, generalization should be performed on the objects in the class composition hierarchy that are closely related 331 in semantics to the currently focused class(es), but not on those that have only remote and rather weak semantic linkages. Construction and Mining of Object Cubes: In an object database, data generalization and multidimensional analysis are not applied to individual objects but to classes of objects. Since a set of objects in a class may share many attributes and methods, and the generalization of each attribute and method may apply a sequence of generalization operators, the major issue becomes how to make the generalization processes cooperate among different attributes and methods in the class(es). ―So, how can class-based generalization be performed for a large set of objects?‖ For class based generalization, the attribute-oriented induction method for mining characteristics of relational databases can be extended to mine data characteristics in object databases. Consider that a generalization-based data mining process can be viewed as the application of a sequence of class-based generalization operators on different attributes. Generalization can continue until the resulting class contains a small number of generalized objects that can be summarized as a concise, generalized rule in high-level terms. For efficient implementation, the generalization of multidimensional attributes of a complex object class can be performed by examining each attribute (or dimension), generalizing each attribute to simple-valued data, and constructing a multidimensional data cube, called an object cube. Once an object cube is constructed, multidimensional analysis and data mining can be performed on it in a manner similar to that for relational data cubes. Notice that from the application point of view, it is not always desirable to generalize a set of values to single-valued data. Consider the attribute keyword, which may contain a set of keywords describing a book. It does not make much sense to generalize this set of keywords to one single value. In this context, it is difficult to construct an object cube containing the keyword dimension. We will address some progress in this direction in the next section when discussing spatial data cube construction. However, it remains a challenging research issue to develop techniques for handling set-valued data effectively in object cube construction and object-based multidimensional analysis. Generalization-Based Mining of Plan Databases by Divide-and-Conquer: To show how generalization can play an important role in mining complex databases, we examine a case of mining significant patterns of successful actions in a plan database using a divide-andconquer strategy. A plan consists of a variable sequence of actions. A plan database, or simply a plan base, is a large collection of plans. Plan mining is the task of mining significant patterns or knowledge from a plan base. Plan mining can be used to discover travel patterns of business passengers in an air flight database or to find significant patterns from the 332 sequences of actions in the repair of automobiles. Plan mining is different from sequential pattern mining, where a large number of frequently occurring sequences are mined at a very detailed level. Instead, plan mining is the extraction of important or significant generalized (sequential) patterns from a plan base. Let‘s examine the plan mining process using an air travel example. Example 4 An air flight plan base: Suppose that the air travel plan base shown in Table 1 stores customer flight sequences, where each record corresponds to an action in a sequential database, and a sequence of records sharing the same plan number is considered as one plan with a sequence of actions. The columns departure and arrival specify the codes of the airports involved. Table 2 stores information about each airport. There could be many patterns mined from a plan base like Table 1. For example, we may discover that most flights from cities in the Atlantic United States to Midwestern cities have a stopover at ORD in Chicago, which could be because ORD is the principal hub for several major airlines. Notice that the airports that act as airline hubs (such as LAX in Los Angeles, ORD in Chicago, and JFK in New York) can easily be derived from Table 2 based on airport size. However, there could be hundreds of hubs in a travel database. Indiscriminate mining may result in a large number of ―rules‖ that lack substantial support, without providing a clear overall picture. Figure 2: A multidimensional view of a database .. 333 Table .1 Table .2 Multidimensional Analysis of Multimedia Data ―Can we construct a data cube for multimedia data analysis?‖ To facilitate the multidimensional analysis of large multimedia databases, multimedia data cubes can be designed and constructed in a manner similar to that for traditional data cubes from relational data. A multimedia data cube can contain additional dimensions and measures for multimedia information, such as color, texture, and shape. Let‘s examine a multimedia data mining system prototype called MultiMediaMiner, which extends the DBMiner system by handling multimedia data. The example database tested in the MultiMediaMiner system is constructed as follows. Each image contains two descriptors: a feature descriptor and a layout descriptor. The original image is not stored directly in the database; only its descriptors are stored. The description information encompasses fields like image file name, image URL, image type (e.g., gif, tiff, jpeg, mpeg, bmp, avi), a list of all known Web pages referring to the image (i.e., parent URLs), a list of keywords, and a thumbnail used by the user interface for image and video browsing. The feature descriptor is a set of vectors for each visual characteristic. The main 8x8 for RGB), an MFC (Most Frequent Color) vector, and an MFO (Most Frequent Orientation) vector. The MFC and MFO contain five color centroids and five edge orientation centroids for the five most frequent colors and five most frequent orientations, respectively. The edge orientations used are 0, 22.5, 45, 67.5, 90, and so on. The layout descriptor contains a color layout vector and an edge layout vector. Regardless of their original size, all images are assigned an 8x8 grid. The most frequent color for each of the 64 cells is stored in the 334 color layout vector, and the number of edges for each orientation in each of the cells is stored in the edge layout vector. Other sizes of grids, like 4x4, 2x2, and 1x1, can easily be derived. The Image Excavator component of MultiMediaMiner uses image contextual information, like HTML tags in Web pages, to derive keywords. By traversing on-line directory structures, like the Yahoo! directory, it is possible to create hierarchies of keywords mapped onto the directories in which the image was found. These graphs are used as concept hierarchies for the dimension keyword in the multimedia data cube. ―What kind of dimensions can a multimedia data cube have?‖ A multimedia data cube can have many dimensions. The following are some examples: the size of the image or video in bytes; the width and height of the frames (or pictures), constituting two dimensions; the date on which the image or video was created (or last modified); the format type of the image or video; the frame sequence duration in seconds; the image or video Internet domain; the Internet domain of pages referencing the image or video (parent URL); the keywords; a color dimension; an edge-orientation dimension; and so on. Concept hierarchies for many numerical dimensions may be automatically defined. For other dimensions, such as for Internet domains or color, predefined hierarchies may be used. The construction of a multimedia data cube will facilitate multidimensional analysis of multimedia data primarily based on visual content, and the mining of multiple kinds of knowledge, including summarization, comparison, classification, association, and clustering. The Classifier module of MultiMediaMiner and its output are presented in Figure 3 335 Figure 3 An output of the Classifier module of MultiMediaMiner The multimedia data cube seems to be an interesting model for multidimensional analysis of multimedia data. However, we should note that it is difficult to implement a data cube efficiently given a large number of dimensions. This curse of dimensionality is especially serious in the case of multimedia data cubes. We may like to model color, orientation, texture, keywords, and so on, as multiple dimensions in a multimedia data cube. However, many of these attributes are set-oriented instead of single-valued. For example, one image may correspond to a set of keywords. It may contain a set of objects, each associated with a set of colors. If we use each keyword as a dimension or each detailed color as a dimension in the design of the data cube, it will create a huge number of dimensions. On the other hand, not doing so may lead to the modelling of an image at a rather rough, limited, and imprecise scale. More research is needed on how to design a multimedia data cube that may strike a balance between efficiency and the power of representation. ―So, how should we go about mining a plan base?‖ We would like to find a small number of general (sequential) patterns that cover a substantial portion of the plans, and then we can 336 divide our search efforts based on such mined sequences. The key to mining such patterns is to generalize the plans in the plan base to a sufficiently high level. A multidimensional database model, such as the one shown in Figure 2 for the air flight plan base, can be used to facilitate such plan generalization. Since low-level information may never share enough commonality to form succinct plans, we should do the following: (1) Generalize the plan base in different directions using the multidimensional model (2) Observe when the generalized plans share common, interesting, sequential patterns with substantial support (3) Derive high-level, concise plans. Let‘s examine this plan base. By combining tuples with the same plan number, the sequences of actions (shown in terms of airport codes) may appear as follows: ALB - JFK - ORD - LAX - SAN SPI - ORD - JFK - SYR ... Table. 3 Table .4 These sequences may look very different. However, they can be generalized in multiple dimensions. When they are generalized based on the airport size dimension, we observe some interesting sequential patterns, like S-L-L-S, where L represents a large airport (i.e., a hub), and S represents a relatively small regional airport, as shown in Table 3. The generalization of a large number of air travel plans may lead to some rather general but 337 highly regular patterns. This is often the case if the merge and optional operators are applied to the generalized sequences, where the former merges (and collapses) consecutive identical symbols into one using the transitive closure notation ―+‖ to represent a sequence of actions of the same type, whereas the latter uses the notation ―[ ]‖ to indicate that the object or action inside the square brackets ―[ ]‖ is optional. Table .4 shows the result of applying the merge operator to the plans of Table 3. By merging and collapsing similar actions, we can derive generalized sequential patterns, such as the Pattern shown below : [S] - L+ - [S] [98.5%] (10.1) The pattern states that 98.5% of travel plans have the pattern [S] - L+ - [S],where [S] indicates that action S is optional, and L+ indicates one or more repetitions of L. In other words, the travel pattern consists of flying first from possibly a small airport, hopping through one to many large airports, and finally reaching a large (or possibly, a small) airport. After a sequential pattern is found with sufficient support, it can be used to partition the plan base. We can then mine each partition to find common characteristics. For example, from a partitioned plan base, we may find flight(x,y)^airport size(x,S)^airport size(y,L))=>region(x) = region(y) [75%], which means that for a direct flight from a small airport x to a large airport y, there is a 75% probability that x and y belong to the same region. This example demonstrates a divide-and-conquer strategy, which first finds interesting, high-level concise sequences of plans by multidimensional generalization of a plan base, and then partitions the plan base based on mined patterns to discover the corresponding characteristics of sub plan bases. This mining approach can be applied to many other applications. For example, in Weblog mining, we can study general access patterns from the Web to identify popular Web portals and common paths before digging into detailed subordinate patterns. The plan mining technique can be further developed in several aspects. For instance, a minimum support threshold similar to that in association rule mining can be used to determine the level of generalization and ensure that a pattern covers a sufficient number of cases. Additional operators in plan mining can be explored, such as less than. Other variations include extracting associations from subsequences, or mining sequence patterns involving multidimensional attributes—for example, the patterns involving both airport size and location. Such dimension-combined mining also requires the generalization of each dimension to a high level before examination of the combined sequence patterns. 338 16.6 Mining Associations in Multimedia Data ―What kinds of associations can be mined in multimedia data?‖ Association rules involving multimedia objects can be mined in image and video databases. At least three categories can be observed: Associations between image content and non image content features: A rule like ―If at least 50% of the upper part of the picture is blue, then it is likely to represent sky‖ belongs to this category since it links the image content to the keyword sky. Associations among image contents that are not related to spatial relationships: A rule like ―If a picture contains two blue squares, then it is likely to contain one red circle as well‖ belongs to this category since the associations are all regarding image contents. To mine associations among multimedia objects, we can treat each image as a transaction and find frequently occurring patterns among different images. ―What are the differences between mining association rules in multimedia databases versus in transaction databases?‖ There are some subtle differences. First, an image may contain multiple objects, each with many features such as color, shape, texture, keyword, and spatial location, so there could be many possible associations. In many cases, a feature may be considered as the same in two images at a certain level of resolution, but different at a finer resolution level. Therefore, it is essential to promote a progressive resolution refinement approach. That is, we can first mine frequently occurring patterns at a relatively rough resolution level, and then focus only on those that have passed the minimum support threshold when mining at a finer resolution level. This is because the patterns that are not frequent at a rough level cannot be frequent at finer resolution levels. Such a multiresolution mining strategy substantially reduces the overall data mining cost without loss of the quality and completeness of data mining results. This leads to an efficient methodology for mining frequent item sets and associations in large multimedia databases. Second, because a picture containing multiple recurrent objects is an important feature in image analysis, recurrence of the same objects should not be ignored in association analysis. For example, a picture containing two golden circles is treated quite differently from that containing only one. This is quite different from that in a transaction database, where the fact that a person buys one gallon of milk or two may often be treated the same as ―buys milk.‖ Therefore, the definition of multimedia association and its measurements, such as support and confidence, should be adjusted accordingly. 339 Third, there often exist important spatial relationships among multimedia objects, such as above, beneath, between, nearby, left-of, and so on. These features are very useful for exploring object associations and correlations. Spatial relationships together with other content-based multimedia features, such as color, shape, texture, and keywords, may form interesting associations. Thus, spatial data mining methods and properties of topological spatial relationships become important for multimedia mining. 16.7 Summary A multimedia database system stores and manages a large collection of multimedia data, such as audio, video, image, graphics, speech, text, document, and hypertext data, which contain text, text mark-ups, and linkages. In multimedia documents, knowledge discovery deals with non-structured information. There are two forms of feature extraction: description-based and content-based. We consider two main families of multimedia indexing and retrieval systems: Description-based retrieval systems, Content-based retrieval systems. To facilitate the multidimensional analysis of large multimedia databases, multimedia data cubes can be designed and constructed in a manner similar to that for traditional data cubes from relational data. A multimedia data cube can contain additional dimensions and measures for multimedia information, such as color, texture, and shape. Association rules involving multimedia objects can be mined in image and video databases. At least three categories can be observed: Associations between image content and non image content features, Associations among image contents that are not related to spatial relationships. 16.8 Keywords Multimedia database, Multimedia Data Mining, Description-based, Content-based, Color histogram–based, Multi feature composed, Wavelet-based , Wavelet-based signature, Mining Associations in Multimedia Data. 16.9 Exercises 1. What is multimedia data? 2. Explain Multimedia Data Mining? 3. How Feature extraction done in case of text? 4. How Feature extraction done in case of image? 5. What are features used for audio classification? 6. Explain briefly in data pre-processing in multimedia data? 340 7. What are two types of retrieval in Multimedia Data? 8. Explain Multidimensional Analysis of Multimedia Data? 9. What are two types of descriptor of image? 10. Explain Mining Associations in Multimedia Data? 16.10 References 1. Data Mining: Concepts and Techniques by Jiawei Han and Micheline Kamber, Morgan Kaufmann Publisher, Second Edition, 2006. 2. Introduction to Data Mining (ISBN: 0321321367) by Pang-Ning Tan, Michael Steinbach, Vipin Kumar, Addison-Wesley Publisher, 2005. 3. Introduction to Data Mining with Case Studies by G. K. Gupta, Eastern Economy Edition (PHI, New Delhi), Third Edition, 2009. 4. Data Mining Techniques by Arun K Pujari, University Press, Second Edition, 2009. 341