Download Knowledge Discovery and Data Mining

yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia, lookup

Cluster analysis wikipedia, lookup

Forac Summer School, University Laval, Quebec, Canada, May-June 2004
Data Mining
Elena Irina Neaga
Forac Research Consortium
Laval University
Québec City, Canada
E-mail: [email protected]
Motivation: Why Data Mining?
Current State-of-the-Art
General Applications
Industry and Business Application Areas
Selected Algorithms and Methods
Distributed Data Mining using Intelligent Agents
Commercial Software Systems
Methodologies, Projects and Standards
Main References and Web-Resources
PolyAnalystTM software demonstration
“ Knowledge is Power“ Francis Bacon
The conventional model to turn data into information
and further to knowledge and probably wisdom is
defined as follows:
data ==> information ==> knowledge ==> wisdom
Knowledge discovery (KD) and data mining (DM) are
interdisciplinary areas based on statistical analysis,
database approaches and artificial intelligence (AI),
especially machine learning.
KD and DM incorporate complex algorithms from
statistics and AI, including imaginative and intuitive
Data, Information, Knowledge
” Yesterday’s Data are today’s Information, and tomorrow’s Knowledge.” I. Spiegler
Spiegler, I. - Technology and knowledge: bridging a ‘‘generating’’ gap,
Information & Management 40 (2003) 533–539, Elsevier Science B.V.
Data is a collection of unanalyzed observations of
worldly events.
Information is a summary and communication of
the main components and relationships contained within
the data and presented within a specific context.
Knowledge is an interrelated collection of
procedures for acting toward particular results in the
world with associated references for when each is
applicable along with its range of effectiveness.
©Pyle, D. “Business Modeling and Data Mining”, Morgan Kaufmann, 2003.
”We are drowning in information, but starved for knowledge.”
John Naisbitt
Nowadays the amount of data generated by
several applications has dramatically increased, and this
data is a valuable source for the discovery of new
information and knowledge.
Also, the eruption of data has caused a
comparable explosion in the need to analyze it which is
possible by the increase of computational power which
might at one time have been too computationally
Motivation (continued)
“ In an economy where the only certainty is uncertainty, the one
sure source of lasting competitive advantage is knowledge.“
Ikujiro Nonaka
•Organizations have huge databases containing
amounts of data which could be a source of new
information and knowledge;
•Business and marketing databases potentially
constitute a valuable resource for business and market
intelligence applications;
•Enterprises also rely on vast amounts of data and
information that is located in large databases. The value
of this information can be increased if additional
knowledge can be gained from it.
Knowledge Discovery from Databases (KDD) is the
nontrivial process of identifying valid, previously unknown,
potentially useful and ultimately understandable patterns in
data [Fayyad et al.,1996].
The whole KDD process could include and it is not
limited to the following steps:
•data selection;
•data cleaning;
•data preprocessing includes reduction and transformation;
•data mining for identifying interesting patterns in datasets;
•data interpretation and evaluation;
Patterns in the context of knowledge discovery and
data mining are defined as similar structures in a file or a
database that are relevant and repetitive.
A model is an abstraction that captures the essential
and global aspects of the complex real-world systems
and/or sub-systems. The model may include the definition
of an information structure in order to store, process,
analyze and use the associated data.
In the context of DM the distinction between the
pattern and model is arbitrary [Hand, 1998]
Discovery vs. Invention
•Discovery Science(DS) - 5th Conference was held
at Lubeck, Germany, in 2002.
•Knowledge is a topic which belongs to science and
•Francis Bacon (1610) stated that knowledge is obtained
from experience, and the Nature is ruled by laws and
theories which the scientists have the main task to discover
and to describe by models. According to him science is an
inductive process.
•On the other hand science may be defined as a process of
inventing theories which are checked against experience.
This trend is stated in 19th century by the invention of nonEuclidian geometry, and relativity theories.
•This is still an open debate!
KDD vs. KM
KM supports the knowledge creation; KDD leads to
KM typically deals with the managerial procedures for
producing and using knowledge within an organization
such as individual, collective learning and transferring.
KDD is focused on the automated or semi-automated
knowledge generation from rough data based on
machine learning.
The difficulty of the formulation of distinct definitions
for KDD and KM is due to the paradox that knowledge
resides in the human’s mind, but it may be captured,
generated, stored, processed and reported using
information technologies.
Polanyi (1962, 1966) defines two types of
knowledge generally accepted in the field of KM, but
also some KDD approaches attempt to consider:
Tacit knowledge: implicit, mental models, and
experiences of individuals.
Explicit knowledge: formal models, rules, and
An open debate may be related to the human
knowledge and computer knowledge approaches such
as knowledge discovery, knowledge engineering
(acquisition, knowledge based/expert systems) and
some areas of knowledge management.
Knowledge about the past which is stable,
voluminous and accurate;
Knowledge about present which is unstable,
compact and may be inaccurate;
Knowledge about the future which is
DM vs. Operations Research
Combining OR and data mining may be very useful in decision-making
A discovered pattern is interesting only to the extent in which it
can be used in the decision-making process of an enterprise.
Generally OR deals with searching for the best solutions to
decision problems using mathematical techniques.
Optimization Solvers may be complemented and refined with
data mining algorithms.
Optimization algorithms are applied to data imported from
DBMS and/or Internet, but they may be processed a data
warehouse and/or the discovered patterns in data.
The potential of applied DM and neural networks for OR
SAS/Operations Research and SAS/Enterprise Miner may be
used in the same environment.
Related Definitions
KD and DM are defined in several ways, but from the
perspective of computer science the best known definitions are:
The process of searching and retrieving or visualization of
valuable information and new knowledge in large volumes of data.
Representing the exploration and analysis by automatic, or
semi-automatic means of large quantities of data usually stored in
Dealing with the discovery of new correlations, hidden
knowledge, unexpected information, patterns and new rules from
large databases;
It is also possible to consider DM more as a set of organized
activities than as methods on their own because the main algorithms
are employed from close areas such as statistics and/or artificial
Related Definitions (continued)
DM is the key element or the core of the whole process of
Knowledge Discovery in Databases (KDD) dealing with several
processing techniques for data especially included in large
databases and data warehouse. Data warehouse is a central store of
data extracted from operational data.
Cristofor (2002) clearly specifies that there is no restriction
to the types of data that can be used as input for DM. The input data
can be a relational or object-oriented database, a data warehouse, a
web server log or a text file. DM is associated with large amounts of
data, but for research and testing applications, the test data sets are
of a limited length, and are usually flat files.
Related Definitions (continued)
Several research projects are inter or cross-disciplinary
with respect to data mining as well as to business, finance,
marketing and other areas. These approaches define data
mining as follows [Berry, Linoff, 2000], [Berson et al., 2000],
[Helberg, 2002], [Pyle, 2003]:
The process of utilizing the results of data exploration to
adjust or enhance business strategies and performances. The
information produced by DM engines requires intelligent review
by human experts.
A technique which helps uncover trends in time to make
the knowledge actionable.
Within every organization is an amount of data which
can describe the past performance of the organization through
KD and DM.
Related Definitions
DM finds patterns and relationships in data by using
sophisticated techniques to build models.
A model is an abstract representation of the reality
which is useful to understand and analyze it in order to
making decisions. There are two main kinds of models in
data mining:
Predictive models can be used to forecast
explicit values, based on patterns determined from known
results. They could predict financial trends, market
evolution, customer behaviour etc.
Descriptive models describe patterns in existing
data, and are generally used to create meaningful
General Applications
Marketing (Direct Marketing, Market Basket
Banking and Finance
Environmental and Molecular Sciences
Computer/Digital Art
Analysis of transactional data stored into a database
of a supermarket in order to improve the way in which
the products are arranged on shelves.
Exploring a supermarket database in order to
determine the patterns related to the way in which
people use to buy, grouping products that people buy
together, and what time.
Predicting customer demand for a specific product.
Data analysis of a promotional campaign e.g. who is
most probably to reply to a direct-mail promotional
Industry and
Business Application Areas
•Customer Relationship Management;
•Supply Chain Management;
•Enterprise Resource Planning;
•E-Business and E-Commerce;
•Demand Management (forecasting);
Improving the quality of products and
Improving business performances;
Improving the position on the
Improving the customer satisfaction
and loyality;
Fidéliser les clients;
Data Mining Processing
Data management and
Presentation and
interpretation of
the result
Prediction and
forcasting based on
new information
Data aggregation
and integration
Data Modeling
© 1999 Michel Jambu – Introduction au data mining Analyse intelligente des données
DM using a DataWarehouse
Data Mining
New Information
Knowledge & Patterns
Integrated DM, DW and OLAP
Data Mining tools
Extended Enterprise Databases
DSS tools
OLAP tools
Data warehouse(DW) is defined as the extraction
and integration of data from multiple sources and legacy
systems in an effective and efficient manner.
Usually a DW is obtained from operational data, and
the information in a DW is subject-oriented, non-volatile,
integrated and time dependent [Adriaans, Zantinge, 1996].
A DW contains large datasets which are organized
using metadata concept which describes the properties
and characteristics of data and information stored in a central
repository. The metadata becomes a topic in its own right
which deals with the intensive studies of data and its
Data marts (DMs) are subsets of data focused on
selected subjects for e.g. a marketing data mart may include
customer, product and sales information.
Data Mining vs. Statistics
” To statisticians, the data mining conveys the sense of naive hope vainly struggling against the cold realities of chance.” D.J. Hand
DM and Statistics do not overlap, and the main
differences are presented below [Pyle, 1999]:
Statistics assume a pattern and the algorithms attempt to
prove it; DM describes a kind of pattern and the algorithms find
DM processes data which is usually given as a database or
a large flat file; Statistics are applied to small and clean datasets;
The objective of DM is to find patterns, knowledge and
valuable new information in data and through statistical analysis
data is processed according to a defined objective of analysis;
Statistics consider data variation, but this is not
considered in DM;
In DM residual data is useful, and it is processed, and in
statistics it is removed from the original data set.
Data Mining vs. Statistics (continued)
” To statisticians, the data mining conveys the sense of naive hope vainly struggling against the cold realities of chance.” D.J. Hand
DM is very much an inductive process opposed to the
hypothetic-deductive approach often seen as the paradigm for
how modern science progresses [Hand, 1998]
Statistics are dealt with primary data analysis. DM is
entirely concerned with secondary data analysis.
Classical statistics deals with numeric data. DM is
applied to image data, audio data, text data, and geographical
data. Mining the web has become a distinct topic.
However DM applied to several real-world
problems such as supply chain optimizations,
process and quality control may not provide
solutions beyond the use of statistics, probability
theory, evolutionary computation (ANN, fuzzy logic)
and operations research.
Several DM algorithms have their roots in
statistical analysis.
DM is not new as it joins several mathematical
and artificial intelligence problem solving techniques
and methods usually applied to large amounts of
historical data.
Selected Algorithms
and Method
” It is by intuition that we discover and by logic we prove.”
Henri Poincaré
•Association Rules;
•Sequential Analysis/Pattern Finding;
•Combined Methods;
•On-Line Analytical Processing (OLAP);
Linear and non-linear regression are widely used for
correlating data.
Statistical regression requires the specification of a
function over which the data is fitted.
In order to specify the function it is necessary to know the forms
of equations governing the correlation for a data set [Wang, 1999]. Even
though regression is considered to be a statistical technique, the
distinction is arbitrary because DM deals with predictive modeling, and
regression does exactly the same [Berry, Linoff, 2000].
There are many applications of regression, for example
predicting costumer demand for a new product as a function of
advertising expenditure and predicting time series where the input
variables can be time-lagged versions of the prediction variable.
Classification also known as segmentation is the
process of examining known groups of data to determine
which characteristics can be used to identify (or predict)
group membership.
Examples of classification include the
classification of trends in financial markets, grouping
customers based on their past transactions and predicting
their response to a particular product promotion [Fayyad et
al., 1996], [Helberg, 2002].
Association Rules
Association Rules were introduced by R.
Agrawal, T. Imielski and A. Shami, in 1993, and the most used
algorithm Apriori, in 1994, by R. Agrawal and R. Srikant.
The basic idea of association rules is to search the
data for patterns of the following form:
IF (some conditions are true) THEN
(some other conditions are probably true)
Each condition extracted from data is called an
association rule, or simply a rule.
Association Rules generate rule-based models.
Association Rules (continued)
Association Rules have two main characteristics associated
with them that measure their value:
Coverage describes how much evidence is in the
training data set to back up the rule. It usually ranges between 0
and 1 (0% and 100%).
Confidence describes how likely the rule is to give a
correct prediction. It is also in the range between 0 and 1(0%
and 100%).
In addition, the algorithm of association rules uses the support
of a rule which is the number of records or transactions which
confirm the rule [Cristofor, 2002].
Association Rules
Let I = {i1, i2, ….im} – a set of items;
Let D – a database usually of transactions, where each T  I;
For a given itemset (a non-empty set of items) X  I and given a transaction T If
X  T then T contains X;
It is also defined the support count  X of an itemset X and X is a large itemset
with respect to support (s) if  X  s x |D| where |D| is the number of transactions
in D.
An Association Rule is an implication of the form X  Y , where X  I, Y
and X  Y = 
The Association Rule X  Y has the confidence c if the ratio of  X  Y over
 X = c. The rule X  Y has the support s in D if  X  Y = s x |D|.
Thus if s is the given support the mining association rules is finding the set
L ={ X|X  I   X  s x |D|}.
Clustering like segmentation identifies groups of
similar cases, but it does not predict outcomes or target
categories [Helberg, 2002].
Clustering algorithms are also called
unsupervised classification, and they process a group of
physical and abstract objects into classes of similar objects.
Clustering analysis supports the construction of
meaningful partitions of a large set of objects based on the
divide-and-conquer methodology which decomposes a largescale system into smaller components to simplify design and
An example relates to identifying customers that would
make good targets for a new product marketing promotion.
Clustering (continued)
The clustering methods are divided into:
Hierarchical clustering which represents the
combination of cases and clusters that are similar to each
other, one pair at a time.
K-Means clustering which is based on the
assumption that the data falls into a known number (K) of
clusters. This method starts by defining initial profiles called
cluster centers, for the K clusters, sometimes using random
values for the clustering characteristics or sometimes using
dissimilar cases from the data set.
K-Means Clustering
In K-Means algorithm, each object xi is
assigned to a cluster j according to its distance
d(xi,mj) from a value mj representing the cluster
itself. mj is called the
representative of the
Given a set of objects D = {x1, . . . , xn}, a
clustering problem is to find a partition
C = {C1 . . . Ck}, of D such that:
1. Each Ci is associated to a representative mi ;
2. xi  Cj if d(xi,mj) ≤ d(xi,ml) for 1 ≤ l ≤ k, j  l;
3. The partition C minimizes:
ik 1 xj  Ci d 2 (xj, mi)
Sequential Patterns
sequential analysis. The main goal of this
algorithm is to find all sequential patterns
with a pre-defined minimum support
represented by a data sequence. The input
data is represented by a list of sequential
transactions and there is often an
associated transaction-time.
Combined Methods
•Combination of different algorithms for the knowledge
extraction process based on rules with neural
networks (NN) and Case Base Reasoning (CBR)
CBR represents the process of acquiring knowledge
represented by cases using reasoning by analogy.
NNs are computer models based on the architecture of
the human brain which consists of multiple simple
processing units connected by adaptive weights.
•Combination of clustering and neural networks (NN);
•Combination of classification and NN.
Combined Methods
NN model
Knowledge Extracting by Neural Networks and Rules
On-line Analytical Processing
OLAP and DM are considered to be two
complementary techniques for analyzing large
amounts of data in databases and/or data
warehousing environments.
OLAP is a way of performing multi-dimensional
analysis on relational databases.
DM is more powerful than an OLAP because of
the difference of multi-dimensional processing of a
database and the fact that new knowledge, and
hidden information can be extracted through DM.
A multi-dimensional representation related to a
product family is shown in the next slide / figure.
OLAP (continued)
City= London
Company = xx
Product= yy
Industry= Food
Profit= 56%
Distributed Data Mining using
Intelligent Agents
Intelligent Agents support the distributed and
collaborative KD&DM systems:
Each agent is responsible for a different step in the
KD&DM process such as pre-processing, DM, and
evaluation of the results;
Some agents specialize in a pre-determined task could
use the services of other agents, e.g. classification uses
a pre-processing agent services;
The agents interact as usually by a communication
language or messages;
The cooperative DM agents run concurrently and they
could be driven by an agent manager;
The mining agent systems could be flexibly integrated
with other agent systems.
• It is an extension of Microsoft ExcelTM;
• It can help to quickly start the DM on
spreadsheets and Excel files;
• It has extensive coverage of statistical and
classification, prediction, affinity analysis,
data exploration and reduction.
SAS Enterprise MinerTM
It is supported by SEMMA (sampling, exploration, modification,
modeling and assessment) methodology;
• It combines data warehousing, data mining and OLAP
• It defines a comprehensive solution that addresses the whole
KDD processes;
• It integrates advanced models and algorithms including clustering,
decision trees, neural networks, memory-based reasoning, linear and
logistic regression and associations;
• It also provides powerful statistical analysis capabilities;
• It uses advanced modeling techniques;
• It generates code in SAS internal language as well as C and
•It has a component for text mining.
SAS Enterprise MinerTM
It has been successfully used for a wide range of CRM
and e-commerce applications such as:
direct mail, telephone, e-mail, and Internet delivered and promotion
customers profiling;
identifying the most profitable customers and the underlying
reasons for their loyalty;
Identifying the fraudulent behaviour in an e-commerce site.
It is very easy to be used because of its GUI;
The business analyst with little statistical expertise can
quickly and easily navigate through the SEMMA process
while the data mining experts can analyze deeply the
analytical process.
SPSS ClementineTM
•It is a DM workbench that enables to quickly develop
predictive models and deploy them into business operations to
improve decision making;
•It delivers the maximum return on investment in the minimum
amount of time;
•It supports the entire DM process to shorten time-to-solution;
•It is designed around the de facto industry standard and
methodology CRoss-Industry Standard Process for Data Mining
•It uses Clementine Application Templates (CATs) which follow
the industry standard CRISP-DM methodology and use
previous real-world application experience in order that a new
project to benefit from a proven methodology and best
SEMMA Methodology
SEMMA (Sample, Explore, Modify, Model, Assess) methodology
was elaborated by SAS Institute Inc. and it is applied successfully, with the
SAS Enterprise MinerTM.
The steps of this methodology are as follows:
•Sample the data by extracting a portion of a large data set containing
enough significant information, but having optimal dimension to be
manipulated quickly.
•Explore the data by searching for unanticipated trends and anomalies in
order to understanding ideas and the trends of the data set.
•Modify the data by creating, selecting and transforming the variables to
focus the model selection process.
•Model the data by allowing the system to search automatically for a
combination of data that reliably predicts a desired outcome.
•Assess the data by evaluating the usefulness and reliability of the findings
from the data mining process.
Projects and Standards
for DM
Overview of Main Projects and Standards
Projects & Standards (continued)
•ISO: SQL/MM is a collection of SQL user-defined types and routines to
define and apply DM models.
•DM Group: Predictive Model Markup Language (PMML) is an
open standard based on XML specification for exchanging DM models
between applications.
•OMG: Common Warehouse Metamodel (CWM) is a Unified
Modeling Language/XML specification for DM metadata.
•Microsoft: OLE DB for DM is a major step toward the standardization
of DM primitives, and it defines a DM object model for relational
•Oracle9i DM is an extension to Oracle9i Database Enterprise Edition
that embeds DM algorithms for classifications, predictions and association
rules. All models and functions are accessible through Java-based
Application Programming Interfaces called Java Data Mining (JDM).
Projects & Standards (continued)
•CRISP-DM is a project which has also defined and validated a
standard DM process that is applicable in diverse industry sectors,
and it attempts to make any DM project faster, cheaper, reliable and
•SolEuNet has the main aim to apply of DM and Decision Support
(DS) systems in order to enhance efficiency, effectiveness and quality
of operations in business and industry. A virtual enterprise model has
been proposed as a dynamic problem-solving link between advanced
DM and DS systems.
•Kensington Enterprise DM (Imperial College, Dept. of
Computing, London, UK) project has developed Kensington
Discovery Edition (KDE) which is an enterprise-wide platform that
supports entire processes of KD, including dynamic information
integration, knowledge discovery and management.
Cross-Industry Standard Process for DM
CRISP-DM Project Description [Helberg, 2002]
Defining a DM Project
"Make it as simple as possible, but no simpler.“ Albert Einstein
Data Identification &
Experimental Design
affecting the
Data pre-
of Results
adoption of DW
and DM
Essential Activities in a Data Mining Project
Main References
Adriaans, P., Zantinge, D. “Data Mining”, Addison-Wesley, 1996.
Berry, M., Linoff, G.S. “Mastering Data Mining The Art and Science of
Customer Relationship Management” , John Wiley & Sons Inc., 2000.
Berson et al. “Building Data Mining Applications for CRM”, McGraw-Hill,
USA, 2000.
Bramer, M.A.(editor)”Knowledge Discovery and Data Mining”, IEE, 1999.
Chen, Z., “An integrated architecture for OLAP and data mining” in
Knowledge Discovery and Data Mining, Bramer, M.A.(editor), IEE, 1999.
Cristofor, L. “Mining Rules in Single-table, and Multiple-table Databases,
PhD Thesis, CS Dept. of Univ.of Massachusetts, Boston, USA, 2002.
Goglin, J.F., “La construction du datawarehouse du datamart au dataweb“,
2e édition revue, Hermes Science Publication, Paris, 1998, 2001.
Main References (continued)
Fayyad et al. (eds) “Advances in Knowledge Discovery and Data Mining”, AAAI
Press/The MIT Press, 1996.
Han, J., Kamber, M. “Data Mining: Concepts and Techniques”, Morgan Kaufman,
Hand, D.J., “Data Mining: Statistics and More”, The American Statistician, Vol. 52,
No. 2, 1998.
Helberg, C. “Data Mining with Confidence”, 2nd edition, SPSS Inc., 2002.
Jambu, M., “ Introduction au data mining - Analyse intelligente des données“, 1999
Eyrolles, Paris.
Lange, S., Satoh, K., Smith, C.H. (eds.) “Discovery Science 5th International
Conference, DS2002, Lubeck, Germany, Procedings“, Berlin: Springer-Verlag, 2002.
Klosgen, W., Zytkow, J.M. (editors) “Handbook of Data Mining and Knowledge
Discovery “, Oxford University Press, 2002.
Pyle, D.“Data Preparation for Data Mining” Morgan Kaufmann, 1999.
Pyle, D. “Business Modeling and Data Mining”, Morgan Kaufmann, 2003.
Web-Resources (continued)
” Discovery consists of seeing what everybody has seen and
thinking what nobody has thought.”
Albert von Szent-Gyorgyi