Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Dr. Bjarne Berg DRAFT DATA MINING, KDD and BUSINESS INTELLIGENCE Data Mining is often considered to be "a blend of statistics, artificial intelligence and data base research” and until the early 1990s the area was often considered to be "a dirty word in Statistics" (Daryl Pregibon, 1997). Other have defined data mining as a “process of discovering meaningful new correlations, patterns and trends by sifting through large amounts of data stored in repositories, using pattern recognition technologies as well as statistical and mathematical techniques.” (Gartner Group) or the “extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data.” (Clifton). In general, data mining consists of two classification schemes. The first is the use of data mining to examine and describe large volumes of data or to draw inferences about a population based on a sample. This is known as descriptive data mining. The other category is the use of data mining to make predictions about future outcomes based on findings in the data sets. This is known as predictive data mining. These two categories have employs several techniques that can be used. First data mining can be used for deviation detection to find unusual observations in large sets of data. This is usually done with a larger data set as a reference to determine what constitutes unusual values. Another technique is time change analysis to track aggregate behavioral changes in previously unidentified groups (we may establish those groups as well). Additional methods include analysis of expected and observed values (often done using Chi analysis); clustering of data to determine classes that shares come behavioral traits or states; concept descriptions to find shared characteristics of individuals in a larger group; or the use of data mining to describe how classes are different. Others use data mining for data dependency analysis to find permanent or semi-permanent relationships within the data. As this illustrates, data mining is a large field with many different uses and techniques. However, in general these are normally grouped into 6 major groups or approaches: patterns and association detection, summarization, clustering, deviation detection, sequence analysis and classifications. Data mining is not used as tool for descriptive statistics per se, but may use tools such as neural networks to create predictions that are accurate, but which does not explicitly describe the relationships of the variables that were used to create the models (it is this usage that statisticians ‘frown’ upon). Knowledge Discovery in Databases (KDD) is viewed as a ‘higher level’ usage of analysis tools for databases. While data mining may be one of those tools it is by far not the only one. The goals of KDD are to organize high volumes of data into meaningful information. Sometimes this may be as simple as aggregation and simple descriptive statistics such as establishing the mean, mode or medium of a population. Other times it may include many of the data mining techniques as well as visual representation Dr. Bjarne Berg DRAFT and illustrations to make data accessible to the user communities. So while data mining explores high volumes of data, the focus of KDD is data reduction and transformation of variables into meaningful information that can be made available to decision makers. The key is therefore to “find useful features, dimensionality/variable reduction, invariant representation” and to perform pattern evaluation and knowledge presentation through data visualization and often through reduction of repeated patterns. A separate area of KDD also focuses on how to make the new knowledge actionable to a wider audience and an examination on how the newly acquired knowledge can be used. Business Intelligence (BI) is a concept that is even broader in nature than KDD and data mining. BI also includes custom applications, on-line analytical processing (OLAP), data marts and also Managed Query Environments (MQE), as well as data mining and visualization tools. A comprehensive knowledge management architecture is illustrated in figure 1. Figure 1: The Knowledge Management Architecture. Bjarne Berg, PricewaterhouseCoopers, 1999: Knowledge Management Architecture Metadata Source Data Extract General Ledger Other Internal Systems External Data Sources Transform Data Warehouse Functional Area Invoicing Systems Purchasing Systems Operational Data Store Custom Developed Applications Purchasing Data Extraction Integration and Cleansing Processes Marketing and Sales Corporate Information Translate Summation Calculate Product Line Location Business Intelligence Attribute Derive Summarize Data Mining Segmented Data Subsets Summarized Data Synchronize Statistical Packages Query Access Tools & OLAP Data Marts Data Resource Management And Quality Assurance The core concept is that data mining can be used as a business intelligence tool and can be part of larger knowledge management architecture, but is not a required component. KDD is more of a framework for discovering, categorizing and organizing knowledge in a format and context that can be comprehended by users. It also focuses a substantial part of the efforts on creating new knowledge by exploring vast amounts of data. The tools for KDD can therefore be statistical packages for monitoring and alerting, data mining applications or simply custom developed applications that are built to access operational data stores (ODS) and data warehouses. Finally, it is worth noting that data mining tends to have a closer focus on examining data sets to solve specific questions using statistical and/or inferential techniques.