Download DM_KDD_BI

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Dr. Bjarne Berg
DRAFT
DATA MINING, KDD and BUSINESS INTELLIGENCE
Data Mining is often considered to be "a blend of statistics, artificial intelligence and data base
research” and until the early 1990s the area was often considered to be "a dirty word in Statistics" (Daryl
Pregibon, 1997). Other have defined data mining as a “process of discovering meaningful new correlations,
patterns and trends by sifting through large amounts of data stored in repositories, using pattern recognition
technologies as well as statistical and mathematical techniques.” (Gartner Group) or the “extraction of
interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from
huge amount of data.” (Clifton).
In general, data mining consists of two classification schemes. The first is the use of data mining to
examine and describe large volumes of data or to draw inferences about a population based on a sample.
This is known as descriptive data mining. The other category is the use of data mining to make predictions
about future outcomes based on findings in the data sets. This is known as predictive data mining. These
two categories have employs several techniques that can be used. First data mining can be used for
deviation detection to find unusual observations in large sets of data. This is usually done with a larger data
set as a reference to determine what constitutes unusual values. Another technique is time change analysis
to track aggregate behavioral changes in previously unidentified groups (we may establish those groups as
well). Additional methods include analysis of expected and observed values (often done using Chi
analysis); clustering of data to determine classes that shares come behavioral traits or states; concept
descriptions to find shared characteristics of individuals in a larger group; or the use of data mining to
describe how classes are different. Others use data mining for data dependency analysis to find permanent
or semi-permanent relationships within the data.
As this illustrates, data mining is a large field with many different uses and techniques. However, in
general these are normally grouped into 6 major groups or approaches: patterns and association detection,
summarization, clustering, deviation detection, sequence analysis and classifications. Data mining is not
used as tool for descriptive statistics per se, but may use tools such as neural networks to create predictions
that are accurate, but which does not explicitly describe the relationships of the variables that were used to
create the models (it is this usage that statisticians ‘frown’ upon).
Knowledge Discovery in Databases (KDD) is viewed as a ‘higher level’ usage of analysis tools for
databases. While data mining may be one of those tools it is by far not the only one. The goals of KDD are
to organize high volumes of data into meaningful information. Sometimes this may be as simple as
aggregation and simple descriptive statistics such as establishing the mean, mode or medium of a
population. Other times it may include many of the data mining techniques as well as visual representation
Dr. Bjarne Berg
DRAFT
and illustrations to make data accessible to the user communities. So while data mining explores high
volumes of data, the focus of KDD is data reduction and transformation of variables into meaningful
information that can be made available to decision makers. The key is therefore to “find useful features,
dimensionality/variable reduction, invariant representation” and to perform pattern evaluation and
knowledge presentation through data visualization and often through reduction of repeated patterns. A
separate area of KDD also focuses on how to make the new knowledge actionable to a wider audience and
an examination on how the newly acquired knowledge can be used.
Business Intelligence (BI) is a concept that is even broader in nature than KDD and data mining. BI
also includes custom applications, on-line analytical processing (OLAP), data marts and also Managed
Query Environments (MQE), as well as data mining and visualization tools. A comprehensive knowledge
management architecture is illustrated in figure 1.
Figure 1: The Knowledge Management Architecture. Bjarne Berg, PricewaterhouseCoopers, 1999:
Knowledge Management Architecture
Metadata
Source Data
Extract
General
Ledger
Other Internal
Systems
External Data
Sources
Transform
Data
Warehouse
Functional Area
Invoicing
Systems
Purchasing
Systems
Operational
Data Store
Custom
Developed
Applications
Purchasing
Data
Extraction
Integration
and
Cleansing
Processes
Marketing
and Sales
Corporate
Information
Translate
Summation
Calculate
Product Line
Location
Business
Intelligence
Attribute
Derive
Summarize
Data
Mining
Segmented
Data Subsets
Summarized
Data
Synchronize
Statistical
Packages
Query Access
Tools & OLAP
Data Marts
Data Resource Management And Quality Assurance
The core concept is that data mining can be used as a business intelligence tool and can be part of
larger knowledge management architecture, but is not a required component. KDD is more of a framework
for discovering, categorizing and organizing knowledge in a format and context that can be comprehended
by users. It also focuses a substantial part of the efforts on creating new knowledge by exploring vast
amounts of data. The tools for KDD can therefore be statistical packages for monitoring and alerting, data
mining applications or simply custom developed applications that are built to access operational data stores
(ODS) and data warehouses. Finally, it is worth noting that data mining tends to have a closer focus on
examining data sets to solve specific questions using statistical and/or inferential techniques.