Amer Kanj
Data Mining
For Business Professionals
Data Mining Overview
Types of Data Mining
Why use Data Mining
How do we Mine Data
Models of Data Mining
Data Mining deals with large volumes of
data stored in DBMS
It is the process of analyzing large
databases to find useful patterns
Data Mining is the process of automating
information discovery
It automates the process of discovering
useful trends and patterns
The fundamental assumption of Data
Mining is that large data may contain
recurring hidden patterns
A Data Mining tool does not require any
It tries to discover relationships and
hidden patterns that may not always be
Types of Data Mining
Business professionals look for Data Mining
approaches that meet their needs.
They requires Data Mining to:
Be understandable
Have good performance
Be accurate
They define three fundamental approaches to
Data Mining:
Classification Studies
Clustering Studies
Visualization Studies
Classification Studies
Classification studies = Supervised learning
Very common in business world.
A telecommunication company’s analyst wants
 Understand why some customers remain loyal while others leave
 Predict which customers likely to lose to competitors
Classification Studies (cont)
So he can:
Construct a model derived from historical data of loyal
customers versus customers who have left
A good model enables him to better understanding his
customers and to predict which customer will stay and
which will leave
A study will identify an overall goal and
the data to be used
Classification Rules
Classification rules help assign new objects to a set of
 Given a new automobile insurance applicant, should
he/she be classified as low risk, medium risk or high
Classification rules for above example could use a
variety of knowledge, such as educational level of
applicant, salary of applicant, age of applicant, etc…
  person p, = masters & p.income > 75,000
 = excellent
  person p, = bachelors and
(p.income >= 25,000 and p.income <= 75,000)
 = good
Classification rules can compactly shown as a
decision tree
Clustering Studies
Clustering Studies = Unsupervised Learning
A method of grouping rows of data that share
similar trends and patterns
We have no dependent variable
Clustering can also be based on historical
patterns, but the outcome (loyal or lost) is not
supplied with the training data
Clustering techniques try to look for similarities
within a data set and group similar rows
together into clusters or segments
Customers are clustered into four
Cluster 1
Income: High
Children: 1
Car: Luxery
Cluster 2
Income: high
Children: 0
Car: Compact
Cluster 4
Income: Medium
Children: 2
Car: Sedan and Car: Track
Income: Medium
Children: 3
Cluster 3
It is simply the graphical presentation of
Microsoft Excel has graphing and
mapping capabilities in its product
Representing data graphically often
brings out points that you would not
normally see
Why use Data Mining
Direct Marking
Trend Analysis
Fraud Detection
Forecasting in Financial Markets
Direct Marketing
The ability to predict who is most likely or
most desirable to buy certain product can
save companies immense amounts in
marketing expenditures
Trend Analysis
Understanding trends in the marketplace
is a strategic advantage, because it is
useful in reducing costs and timeliness to
Fraud Detection
data Mining techniques can model which
insurance claims, cellular phone calls, or
credit card purchases are likely to be
Forecasting in Financial Markets
The use of data mining to model financial
markets is used extensively
How Do We Mine Data
There are five steps to Data Mining:
 Data Manipulating
 Defining a study
 Reading the data and building a model
 Understanding the model
 Prediction
Data Preparation
Data preparation is considered as the
heart of the Data Mining process
Data usually accumulates in transactional
database where actual records of
transactions are stored
Data preparation requires that the data from
distributed databases be pooled together,
cleansed from redundant, inconsistent,
incomplete, irrelevant, and otherwise
inappropriate data
Data Preparation (Cont)
Data Cleaning:
 A column containing a list of soft drinks may have the
values “Pepsi” , “Pepsi Cola”, and “Cola”.
 The values refer to the same drink, but are not known to
the computer as the same.
Missing Values:
 Some Data Mining approaches require rows of data to
be complete in order to mine the data
 If too many values are missing in a data set, it becomes
hard to gather any useful information from this data or to
make predictions from it
Data Preparation (Cont)
Data Derivation:
 If I have column called maximum$-2002 and
maximum$-2003 to describe the dollars spent in
2002 and 2003
 Then an interesting derivation is $-difference, which
is the change in amount of money spent between
2002 and 2003
Merging Data:
 Data usually stored in the form of tables
 Merging data in a relational system can be
achieved in a number of ways:
1. Merging tables through a view (Query Tools)
2. An SQL statement, or
3. An export of data into a flat file
Defining a Study
Differs from Supervised (Classification)
versus Unsupervised (Clustering) learning
For Supervised learning:
 Involves articulating a goal
 Specifying the data fields that are used in the study
For Unsupervised learning:
 The goal is to group similar types of data, usually used in
many activities, or
 To identify exceptions in a data set, which is useful in
discovering fraudulent or incorrect data
Read the data and build a Model
A data mining product reads a data set
and constructs a model
A model will summarize large amounts of
data by accumulating indicators
such Indicators:
 Frequencies: Show how often a certain value occurs
 Weight: or impacts, indicate how well some inputs indicate the
occurrence of an output
 Conjunctions: Sometimes inputs have more weight together
than apart
 Differentiation: Indicates how much more important an input
criterion is to one outcome than another
Understanding the Model
Model understand takes different
forms based on the type of model
used to represent the data
We will discuss Data Mining
Models later…
Prediction is the process of choosing the
best possible outcomes based on
historical data
Predictive data mining methods fall into
three broad categories:
 Mathematical methods
 Logic methods
 Distance methods
Prediction (Cont)
Mathematical method:
 Linear math solution
 Non-linear math solution
Logic methods:
 Quite different from what math methods produce
 Logical methods often produce tree-like solutions
 Best known logical solutions are decision trees, and
decision rules.
Prediction (Cont)
Distance methods:
 A representative sample of cases is kept on file
 These cases will be used as a benchmark for
classifying new cases
 Features of the new case are measured against
features of the benchmark cases for proximity
Prediction (Cont)
Here are a few interesting predictive
 Understanding why a prediction is made: some models
will provide the reasons why a prediction is made
 Margin of victory: if the best case prediction has a score
of 100 and the challenger prediction has a score of 50,
then the margin of victory is 50%. If the prediction has a
score of 100 and the challenger has 99, then the margin
of victory would be 1%. Generally, the higher the margin
of victory, the more likely the prediction is to be true
Prediction (Cont)
 Scenario playing: Some prediction models have
the ability to change parameters to see how
predictions change
 Understanding prediction affinities: Is to set two
variables constant and see what the other
predictions would look like
Data Mining Models
Decision Trees
Genetic Algorithms
Neural Nets
Agent Network Technology
Hybrid Models
Data mining Models (Cont)
Decision Trees:
 Creating a tree-like structure to describe a data set
 The greatest benefit to decision tree approaches is their
Genetic Algorithms:
 Are a method of combinatorial optimization based on
processes in biological evolution
Data Mining Models (Cont)
Neural Nets:
 Are used extensively in the business world as predictive
 Neural Nets are widely used in the financial market to
model fraud in credit cards and monetary transactions
Agent Network Technology:
 This method of model treats all data elements as agents
that are connected to each other in a significant way
Data Mining Models (Cont)
Hybrid Models:
 Vendor Tools that make use of more than one
approach are referred to as hybrid systems
 Being a hybrid system does not always imply that
the tool uses a hybrid algorithm
 For example, Thinking Machines, with their Darwin
product, makes use of several different mining
algorithm. While the algorithm themselves are not
hybrid, the product uses the algorithms in
Data Mining Models (Cont)
 Used to create a model of data sets
 Uses probability, data analysis, and
statistical inference
Thank You For