Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CHAPTER 1 INTRODUCTION 1.1 INTRODUCTION Knowledge Discovery in Databases (KDD) is the process of automatic discovery of previously unknown patterns, rules, and other regular contents implicitly present in large volumes of data[1]. Data Mining (DM) denotes discovery of patterns in a data set previously prepared in a specific way. DM is often used as a synonym for KDD. However, strictly speaking DM is just a central phase of the entire process of KDD. The idea of automatic knowledge discovery in large databases is first presented informally, by describing some practical needs of users of modern database systems. The scope of KDD and DM is briefly presented in terms of classification of KDD/DM problems and common points between KDD and several other scientific and technical disciplines that have well-developed methodologies and techniques used in the field of KDD. 1.2 DATA MINING AND WAREHOUSING CONCEPTS The past couple of decades have seen a dramatic increase in the amount of information or data being stored in electronic format. This accumulation of data has taken place at an explosive rate[2]. It has been estimated that the amount of information in the world doubles every 20 months and the sizes as well as number of databases are increasing even faster[10]. There are many examples that can be cited. Point of sale data in retail, policy and claim data in insurance, medical history data in health care, financial data in banking and securities, are some instances of the types of data that is being collected. Data storage became easier as the availability of large amounts of computing power at low cost i.e., the cost of processing power and storage is falling, 1 made data cheap. There was also the introduction of new machine learning methods for knowledge representation based on logic programming etc. in addition to traditional statistical analysis of data. The new methods tend to be computationally intensive hence a demand for more processing power. It was recognized that information is at the heart of business operations and that decision makers could make use of the data stored to gain valuable insight into the business[5]. Database Management Systems gave access to the data stored but this was only a small part of what could be gained from the data. Traditional on-line transaction processing systems, OLTPs, are good at putting data into databases quickly, safely and efficiently but are not good at delivering meaningful analysis in return[6]. Analyzing data can provide further knowledge about a business by going beyond the data explicitly store to derive knowledge about the business. Data mining, also called as data archaeology, data dredging, data harvesting, is the process of extracting hidden knowledge from large volumes of raw data and using it to make crucial business decisions[11]. This is where Data Mining or Knowledge Discovery in Databases (KDD) has obvious benefits for any enterprise. 1.3 DATA MINING DEFINITIONS The term data mining has been stretched beyond its limits to apply to any form of data analysis. Some of the numerous definitions of Data Mining, or Knowledge Discovery in Databases are: Extraction of interesting information or patterns from data in large databases is known as data mining[13]. According to William J. Frawley, Gregory Piatetsky-Shapiro and Christopher J. Matheus “Data Mining, or Knowledge Discovery in Databases (KDD) as it is also known, is the nontrivial extraction of implicit, previously unknown, and potentially useful information from data”[9]. This encompasses a number of different technical approaches, such as clustering, data summarization, 2 learning classification rules, finding dependency networks, analyzing changes, and detecting anomalies. According to Marcel Holshemier and Arno Siebes "Data mining is the search for relationships and global patterns that exist in large databases but are 'hidden' among the vast amount of data, such as a relationship between patient data and their medical diagnosis. These relationships represent valuable knowledge about the database and the objects in the database and, if the database is a faithful mirror of the real world registered by the database"[12]. Data mining refers to "using a variety of techniques to identify nuggets of information or decision-making knowledge in bodies of data and extracting these in such a way that they can be put to use in the areas such as a decision support, prediction, forecasting and estimation. The data is often voluminous, but as it stands of low value as no direct use can be made of it; it is the hidden information in the data that is useful[12]" Data mining is concerned with the analysis of data and the use of software techniques for finding patterns and regularities in sets of data. It is the computer which is responsible for finding the patterns by identifying the underlying rules and features in the data. The idea is that it is possible to strike gold in unexpected places as the data mining software extracts patterns not previously discernable or so obvious that no-one has noticed them before[2]. Data mining analysis tends to work from the data up and the best techniques are those developed with an orientation towards large volumes of data, making use of as much of the collected data as possible to arrive at reliable conclusions and decisions. The analysis process starts with a set of data, uses a methodology to develop an optimal representation of the structure of the data during which time knowledge is acquired. Once knowledge has been acquired this can be extended to larger sets of data working on the assumption that the larger data set has a structure similar to the sample data. Again this is 3 analogous to a mining operation where large amounts of low-grade materials are sifted through in order to find something of value. 1.4 DATA MINING PROCESS Data mining operations require a systematic approach. The process of data mining is generally specified in the form of an ordered list but the process is not linear. At times, one may need to step back and rework on the previously performed step[5]. The general phases in the data mining process to extract knowledge are: 1. Problem definition: This phase is to understand the problem and the domain environment in which the problem occurs. We need to clearly define the problem before we proceed further. Problem definition specifies the limits within which the problem needs to be solved. It also specifies the cost limitations to solve the problem. 2. Creating a database for data mining: This phase is to create a database where the data to be mined are stored for knowledge acquisition. Creating database does not require creating a specialized database management system. We can even use storage where large amount of data is stored for data mining. The creation of data mining database consumes about 50% to 90% of the overall data mining process. 3. Exploring the database: This phase is to select and examine important data sets of a data mining database in order to determine their feasibility to solve the problem. Exploring the database is a timeconsuming process and requires a good user interface and computer system with good processing speed. 4. Preparation for creating a data mining model: This phase is to select variables to act as predictors. New variables are also built depending upon the existing variables along with defining the range of variables in order to support imprecise information. 4 5. Building a data mining model: This phase is to create multiple data mining models and to select the best of these models. Building a data mining model is an iterative process. At times we need to go back to the problem definition phase in order to change the problem definition itself. The data mining model that we select can be a decision tree, an artificial neural network, or an association rule model. 6. Evaluating the data mining model: This phase is to evaluate the accuracy of the selected data mining model. In data mining, the evaluating parameter is data accuracy in order to test the working of the model. This is because the information generated in the simulated environment varies from the external environment. The errors that occur during the evaluation phase needs to be recorded and the cost and time involved in rectifying the error needs to estimate. External validation is also needs to be performed in order to check whether the selected model performs correctly when provided real world values. 7. Deploying the data mining model: This phase is to deploy the built and the evaluated data mining model in the external working environment. A monitoring system should monitor the working of the model and generate reports about its performance. The information in the report helps enhance of selected data mining model. 1.5 KNOWLEDGE DISCOVERY IN DATABASES (KDD) With the enormous amount of data stored in files, databases, and other repositories, it is increasingly important, if not necessary, to develop powerful means for analysis and perhaps interpretation of such data and for the extraction of interesting knowledge that could help in decision-making. Data Mining, also popularly known as Knowledge Discovery in Databases (KDD), refers to the nontrivial extraction of implicit, previously unknown and potentially useful information from data in databases[6]. While data mining and knowledge discovery in databases (or KDD) are frequently treated as 5 synonyms, data mining is actually part of the knowledge discovery process. The following figure (Figure 1.1) shows data mining as a step in an iterative knowledge discovery process. Figure 1.1: Knowledge discovery process The Knowledge Discovery in Databases process comprises of a few steps leading from raw data collections to some form of new knowledge. The iterative process consists of the following steps: 1. Data cleaning: Also known as data cleansing, it is a phase in which noise data and irrelevant data are removed from the collection. 2. Data integration: At this stage, multiple data sources, often heterogeneous, may be combined in a common source. 3. Data selection: At this step, the data relevant to the analysis is decided on and retrieved from the data collection. 4. Data transformation: Also known as data consolidation, it is a phase in which the selected data is transformed into forms appropriate for the mining procedure. 5. Data mining: It is the crucial step in which clever techniques are applied to extract patterns potentially useful. 6. Pattern evaluation: In this step, strictly interesting patterns representing knowledge are identified based on given measures. 6 7. Knowledge representation: It is the final phase in which the discovered knowledge is visually represented to the user. This essential step uses visualization techniques to help users understand and interpret the data mining results. It is common to combine some of these steps together. For instance, data cleaning and data integration can be performed together as a pre-processing phase to generate a data warehouse[7]. Data selection and data transformation can also be combined where the consolidation of the data is the result of the selection, or, as for the case of data warehouses, the selection is done on transformed data. The KDD is an iterative process. Once the discovered knowledge is presented to the user, the evaluation measures can be enhanced, the mining can be further refined, new data can be selected or further transformed, or new data sources can be integrated, in order to get different, more appropriate results. Data mining derives its name from the similarities between searching for valuable information in a large database and mining rocks for a vein of valuable ore. Both imply either sifting through a large amount of material or ingeniously probing the material to exactly pinpoint where the values reside. It is, however, a misnomer, since mining for gold in rocks is usually called "gold mining" and not "rock mining", thus by analogy, data mining should have been called "knowledge mining" instead[14]. Nevertheless, data mining became the accepted customary term, and very rapidly a trend that even overshadowed more general terms such as knowledge discovery in databases (KDD) that describe a more complete process. Other similar terms referring to data mining are: data dredging, knowledge extraction and pattern discovery. 1.6 DATA MINING VERSUS KNOWLEDGE DISCOVERY IN DATABASES The terms knowledge discovery in databases (KDD) and data mining are often used interchangeably. In fact, there have been many other names given to this 7 process discovering useful (hidden) patterns in data: knowledge extraction, information discovery, exploratory data analysis, information harvesting, and unsupervised pattern recognition[12]. Over the last few years KDD has been used to refer to a process consisting of many steps, while data mining is only one of these steps. Definition 1.1 Knowledge Discovery in databases (KDD) is the process of finding useful information and patterns in data[12]. Definition 1.2 Data mining is the use of algorithms to extract the information and patterns derived by the KDD process[12]. The KDD process is often said to be nontrivial; however, we take the larger view that KDD is an all-encompassing concept. A traditional SQL database query can be viewed as the data mining part of a KDD process. Indeed, this may be viewed as somewhat simple and trivial. However, this was not the case 30 years ago. If we were to advance 30 years into the future, we might find that processes thought of today as nontrivial and complex will be viewed as equally simple. The definition of KDD includes the keyword useful Although some definitions have included the term "potentially useful," we believe that if the information found in the process is not useful, then it really is not information of course, the idea of being useful is relative and depends on the individuals involved. KDD is a process that involves many different steps. The input to this process is the data, and the output is the useful information desired by the users. However, the objective may be unclear or inexact. 1.7 PROCESS MODELS OF DATA MINING We need to follow a systematic approach of data mining for meaningful retrieval of data from large data banks. Several process models have been proposed by various individuals and organizations that provide systematic steps for data mining. Four most popular process models of data mining are[4]: 8 1. 5 A's process model 2. CRISP-DM process model 3. SEMMA process model 4. Six-Sigma process model 1.7.1 5 A's process model The 5 A's process model has been proposed and used by SPSS Inc, Chicago, USA. The5 A's in this process model stands for Assess, Access, Analyse, Act, and Automate[9]. SPSS uses this model as a preparatory step towards data mining and does not plan to provide any further description to perform the various data mining tasks. After initially applying the 5 A's process model, SPSS uses the CRISP-DM process model discussed in the next section, to analyse data in a data bank. Figure 1.2: The 5 A's process model The 5 A's process model of data mining generally begins by first assessing the problem in hand. The next logical step is to access or accumulate data that are related to the problem. After that, we analyse the accumulated data from different angles using various data mining techniques[8]. We then extract meaningful information from the analysed data and implement the result in solving the problem in hand. At last, try to automate the process of data 9 mining by building software that uses the various techniques that used in the 5 A's process model. Figure 1.2 shows the life cycle of the 5 A's process model: 1.7.2 CRISP-DM Process Model The CRISP-DM Process model has been proposed by a group of vendors viz. NCS Systems Engineering Copenhangen (Denmark), Daimler-Benz AG (Germany) , SPSS/Integral Solutions Ltd. (United Kingdom), and OHRA V (The Netherlands)[10]. In this process model, CRISP-DM stands for CrossIndustry Standard Process for Data Mining. Figure 1.3: CRISP-DM Process Model CRISP-DM process model provides with several data mining techniques that can use and apply for a specific datasets. Moreover, it is also likely use a single data mining technique for different types of datasets. In such case, CRISP-DM process model is never a top-down process; rather jump from one phase of the model to another in between before completing a complete cycle of the process. The life cycle of CRISP-DM process model consists of six phases: 10 1. Understanding the business: This phase is to understand the objectives and requirements of the business problem and generating a data mining definition for the business problem. 2. Understanding the data: This phase is to first analyze the data collected in the first phase and study its characteristics and matching patterns to propose a hypothesis for solving the problem. 3. Preparing the data: This phase is to create final datasets that are input to various modeling tools. The raw data items are first transformed and cleaned to generate datasets that are in the form of tables, records, and fields. 4. Modeling: This phase is to select and apply different modeling techniques of data mining then input the datasets collected from the previous phase to these modeling techniques and analyze the generated output. 5. Evaluation: This phase is to evaluate a model or a set of models that generate in the previous phase for better analysis of the refined data. 6. Deployment: This phase is to organize and implement the knowledge gained from the evaluation phase in such a way that it is easy for the end users to comprehend. 1.7.3 SEMMA Process Model The SEMMA Process Model has been proposed and used by SAS Institute Inc. In this process model, SEEMA stands for Sample, Explore, Modify, Model, and Assess[6]. Figure 1.4 shows the life cycle of the SEMMA process model. The life cycle of the SEMMA Process Model consists of five phases: 1. Sample: This phase is to extract a portion from a large data bank such that able to retrieve meaningful information from the extracted portion of data. Selecting a portion from a large data bank significantly reduces the amount of time required to process them. 11 2. Explore: This phase is to explore and refine the sample portion of data using various statistical data mining techniques in order to search for unusual trends and irregularities in the sample data. For example, an online trading organization can use the technique of clustering to find a group of consumers that have similar ordering patterns. 3. Modify: This phase is to modify the explored data by creating, selecting, and transforming the predictive variables for the selection of a prospective data mining model. As per the problem in hand, one may need to add new predictive variables or delete existing predictive variables to narrow down the search for a useful solution to the problem. Figure 1.4: SEMMA Process Model 4. Model: This phase is to select a data mining model that automatically search for a combination of data, which can use to predict the required result for the problem. Some of the modeling techniques that can use as a model are neural networks and statistical models. 12 5. Assess: This phase is to assess the use and reliability of the data generated by the model that selected in the previous phase and estimate its performance. One can assess the selected model by applying the sample data that collected in the sample phase and check the output data. 1.7.4 Six-Sigma Process Model Six-Sigma is a data driven process model that eliminates defects, wastes, or quality control problems that generally occurs in a production environment. This model has been pioneered by Motorola and popularised by General Electric (GE)[8]. Six-Sigma is very popular in various American industries due to its easy implementation, and it is likely to be implemented worldwide. This process model is based on various statistical techniques, use of various types of data analysis techniques, and implementation of systematic training of all the employees of an organization. Six-Sigma process model postulates a sequence of five stages called DMAIC, which stands for Define, Measure, Analyse, Improve and Control. Figure 1.5 shows the five phases in the life cycle of the Six-Sigma process model: Figure 1.5: Six-Sigma Process Model The life cycle of the Six-Sigma process model consists of five phases: 13 1. Define: This phase is to define the goals of a project along with its limitations. This phase also identifies the issues that need to be addressed in order to achieve the defined goal. 2. Measure: This phase is to collect information about the current process in which the work is done and to try to identify the basics of the problem. 3. Analyze: This phase is to identify the root cause of the problem in hand and ensure those root causes of the problem in hand. The root causes are identified in the previous phase. 4. Control: This phase is to monitor the outcome of all its previous phases and suggest improvement measures in each of its earlier phases. 1.8. DATA MINING FUNCTIONALITIES The kinds of patterns that can be discovered depend upon the data mining tasks employed. There are two types of data mining tasks: descriptive data mining tasks that describe the general properties of the existing data, and predictive data mining tasks that attempt to do predictions based on inference on available data[2]. The data mining functionalities and the variety of knowledge they discover are briefly presented in the following list: 1. Characterization: Data characterization is a summarization of general features of objects in a target class, and produces what is called characteristic rules. The data relevant to a user-specified class are normally retrieved by a database query and run through a summarization module to extract the essence of the data at different levels of abstractions. With concept hierarchies on the attributes describing the target class, the attribute-oriented induction method can be used, for example, to carry out data summarization. With a data cube containing summarization of data, simple OLAP operations fit the purpose of data characterization. 2. Discrimination: Data discrimination produces what are called discriminate rules and is basically the comparison of the general features of objects 14 between two classes referred to as the target class and the contrasting class. The techniques used for data discrimination are very similar to the techniques used for data characterization with the exception that data discrimination results include comparative measures. 3. Association analysis: Association analysis is the discovery of what are commonly called association rules. It studies the frequency of items occurring together in transactional databases, and based on a threshold called support, identifies the frequent item sets. Another threshold, confidence, which is the conditional probability than an item appears in a transaction when another item appears, is used to pinpoint association rules. Association analysis is commonly used for market basket analysis. For example, it could be useful for the Video Store manager to know what movies are often rented together or if there is a relationship between renting a certain type of movies and buying popcorn or pop. The discovered association rules are of the form: P -> Q [s,c], where P and Q are conjunctions of attribute value-pairs, and s (for support) is the probability that P and Q appear together in a transaction and c (for confidence) is the conditional probability that Q appears in a transaction when P is present. For example, the hypothetic association rule: RentType(X, "game") AND Age(X, "13-19") -> Buys(X, "pop") [s=2% ,c=55%] would indicate that 2% of the transactions considered are of customers aged between 13 and 19 who are renting a game and buying a pop, and that there is a certainty of 55% that teenage customers who rent a game also buy pop. 4. Classification: Classification analysis is the organization of data in given classes. Also known as supervised classification, the classification uses given class labels to order the objects in the data collection. Classification approaches normally use a training set where all objects are already associated with known class labels. The classification algorithm learns from the training set and builds a model. The model is used to classify new 15 objects. For example, after starting a credit policy, the Video Store managers could analyze the customers behaviours vis-à-vis their credit, and label accordingly the customers who received credits with three possible labels "safe", "risky" and "very risky". The classification analysis would generate a model that could be used to either accept or reject credit requests in the future. 5. Prediction: Prediction has attracted considerable attention given the potential implications of successful forecasting in a business context. There are two major types of predictions: one can either try to predict some unavailable data values or pending trends, or predict a class label for some data. The latter is tied to classification. Once a classification model is built based on a training set, the class label of an object can be foreseen based on the attribute values of the object and the attribute values of the classes. Prediction is however more often referred to the forecast of missing numerical values, or increase/ decrease trends in time related data. The major idea is to use a large number of past values to consider probable future values. 6. Clustering: Similar to classification, clustering is the organization of data in classes. However, unlike classification, in clustering, class labels are unknown and it is up to the clustering algorithm to discover acceptable classes. Clustering is also called unsupervised classification, because the classification is not dictated by given class labels. There are many clustering approaches all based on the principle of maximizing the similarity between objects in a same class (intra-class similarity) and minimizing the similarity between objects of different classes (inter-class similarity). 7. Outlier analysis: Outliers are data elements that cannot be grouped in a given class or cluster. Also known as exceptions or surprises, they are often very important to identify. While outliers can be considered noise and 16 discarded in some applications, they can reveal important knowledge in other domains, and thus can be very significant and their analysis valuable. 8. Evolution and deviation analysis: Evolution and deviation analysis pertain to the study of time related data that changes in time. Evolution analysis models evolutionary trends in data, which consent to characterizing, comparing, classifying or clustering of time related data. Deviation analysis, on the other hand, considers differences between measured values and expected values, and attempts to find the cause of the deviations from the anticipated values. It is common that users do not have a clear idea of the kind of patterns they can discover or need to discover from the data at hand. It is therefore important to have a versatile and inclusive data mining system that allows the discovery of different kinds of knowledge and at different levels of abstraction. This also makes interactivity an important attribute of a data mining system. 1.9 CATEGORIES OF DATA MINING SYSTEMS There are many data mining systems available or being developed. Some are specialized systems dedicated to a given data source or are confined to limited data mining functionalities, other are more versatile and comprehensive. Data mining systems can be categorized according to various criteria among other classification are the following [4]: 1. Classification according to the type of data source mined: This classification categorizes data mining systems according to the type of data handled such as spatial data, multimedia data, time-series data, text data, World Wide Web, etc. 2. Classification according to the data model drawn on: This classification categorizes data mining systems based on the data model involved such as relational database, object-oriented transactional, etc. 17 database, data warehouse, 3. Classification according to the kind of knowledge discovered: This classification categorizes data mining systems based on the kind of knowledge discovered or data mining functionalities, such as characterization, discrimination, association, classification, clustering, etc. Some systems tend to be comprehensive systems offering several data mining functionalities together. 4. Classification according to mining techniques used: Data mining systems employ and provide different techniques. This classification categorizes data mining systems according to the data analysis approach used such as machine learning, neural networks, genetic algorithms, statistics, visualization, database-oriented or data warehouse-oriented, etc. The classification can also take into account the degree of user interaction involved in the data mining process such as query-driven systems, interactive exploratory systems, or autonomous systems. A comprehensive system would provide a wide variety of data mining techniques to fit different situations and options, and offer different degrees of user interaction. 1.10 DATA MINING ISSUES There are many important implementation issues associated with data mining[5]: 1. Human interaction: Since data mining problems are often not precisely stated, interfaces may be needed with both domain and technical experts. Technical experts are used to formulate the queries and assist in interpreting the results. Users are needed to identify training data and desired results. 2. Overfitting: When a model is generated that is associated with a given database state, it is desirable that the model also fit future database states. Overfitting occurs when the model does not fit future states. This may be caused by assumptions that are made about the data or 18 may simply be caused by the small size of the training database. For example, a classification model for an employee database may be developed to classify employees as short, medium, or tall. If the training database is quite small. The model might erroneously indicate that a short person is anyone fewer than five feet eight inches because there is only one entry in the training database under five feet eight. In this case, many future employees would be erroneously classified as short. Overfitting can arise under other circumstances as well, even though the data are not changing. 3. Outliers: There are often many data entries that do not fit nicely into the derived model. This becomes even more of an issue with very large databases. If a model is developed that includes these outliers, then the model may not behave well for data that are not outliers. 4. Interpretation of results: Currently, data mining output may require experts to correctly interpret the results, which might otherwise be meaningless to the average database user. 5. Visualization of results: To easily view and understand the output of data mining algorithms, visualization of the results is helpful. 6. Large datasets: The massive datasets associated with data mining create problems when applying algorithms designed for small datasets. Many modeling applications grow exponentially on the dataset size and thus are too inefficient for larger datasets. Sampling and parallelization are effective tools to attack this scalability problem. 7. High dimensionality: A conventional database schema may be composed of many different attributes. The problem here is that not all attributes may be needed to solve a given data mining problem. This problem is sometimes referred to as the dimensionality curse, meaning that there are many attributes (dimensions) involved and it is difficult to determine which ones should be used. One solution to this high 19 dimensionality problem is to reduce the number of attributes, which is known as dimensionality reduction. 8. Multimedia data: Most previous data mining algorithms are targeted to traditional data types (numeric, character, text, etc). The use of multimedia data such as is found in GIS database complicates or invalidates many proposed algorithms. 9. Missing data: During the preprocessing phase of KDD, missing data may be replaced with estimates. This and other approaches to handling missing data can lead to invalid results in the data mining step. 10. Irrelevant data: Some attributes in the database might not be of interest to the data mining task being developed. 11. Noisy data: Some attribute values might be invalid or incorrect. These values are often corrected before running data mining application. 12. Changing data: Databases cannot be assumed to be static. However, most data mining algorithms do assume a static database. This requires that the algorithm be completely rerun anytime the database changes. 13. Integration: The KDD process is not currently integrated into normal data processing activities. KDD requests may be treated as special, unusual, or one-time needs. This makes them inefficient, ineffective, and not general enough to be used on an ongoing basis. Integration of data mining functions into traditional DBMS systems is certainly a desirable goal. 14. Application: Determining the intended use for the information obtained from the data mining function is a challenge. Indeed, how business executives can effectively use the output is sometimes considered the more difficult part, not the running of the algorithms themselves. These issues should be addressed by data mining algorithms and products. 20