Download Data Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Data Mining
and OLAP
Stages of Data Mining and OLAP
(with thanks to Janet Francis)
CREATE THE DIFFERENCE
Aims
• This lecture aims to cover
– The nature of data mining
– Stages of Data Mining
– OLAP
CREATE THE DIFFERENCE
What is Data Mining
• The term Data Mining is used because mining for
valuable data in a large database is similar to mining for
a valuable ore in a huge mountain.
– In a mining operation large amounts of low grade materials are
sifted through in order to find something of value.
– In its computing counterpart large volumes of data are searched
in an attempt to find something worthwhile.
CREATE THE DIFFERENCE
A useful Scenario
• The following scenario will be used in this lecture in
order to make the processes seem more relevant.
– Beaconside JAMS PLC (BJP) supplies jams/cake fillings/sauces
to confectioners and bakers. Customers include large multi
nationals and small specialist outlets.
– BJ has centres in England, Scotland and Spain
– BJ is not a manufacturing organisation – it is a retailer which
means that it buys from a distributor and sells on to customers.
CREATE THE DIFFERENCE
Human vs Data Mining
•
Human
–
Usually takes the form of hypothesis verification
•
•
•
Data Mining
–
•
The analyst has a theory – we sell more high margin goods pro rata
to specialist outlets than to multinationals and specialist outlets are
more profitable
The analyst gathers the necessary data and proves or disproves or
amends and re tests the hypothesis
can perform hypothesis verification – on vast quantities of data
Data Mining allows the user to discover patterns that the user did not
know existed!
CREATE THE DIFFERENCE
Types of Data Mining
• Directed
• Undirected
CREATE THE DIFFERENCE
Directed Data Mining
• A top down approach – used when there is some idea of
what is being looked for (some direction for the search)
or and idea of what might be predicted
• The goal is to create a predictive model or set of models
from the existing data which can then be used to predict
future trends.
• For example - which customers are most likely to be
interested in a new type of cake filling
CREATE THE DIFFERENCE
Undirected Data Mining
• A bottom-up approach - the data itself determines the
relationships – for example using clustering– If patterns
are found it is for the user to determine whether the
patterns are useful or not.
• The goal is to find patterns in the existing data.
– Human interaction is necessary because only people can
determine what significance, if any, the patterns have
• This type of data mining is one of the key steps in Knowledge
Discovery in Databases (KDD).
• Necessary to know how the model works and how it comes up
with the answer in order to decide if patterns are valid
• Example: People who are over 5ft tall with brown hair
like Blackcurrent Jam
CREATE THE DIFFERENCE
Approaches to Data Mining
• Descriptive
– Describes the current data in terms of rules or patterns
• Predictive
– Identify a set of rules/model which can be used to predict
currently unknown values
CREATE THE DIFFERENCE
Descriptive Data Mining uses
•
•
•
Market Basket Analysis
Clustering
Classification
CREATE THE DIFFERENCE
Descriptive Data Mining uses:
Market Basket Analysis
•
•
•
•
Identifies relationships between data – for example,
patterns in transaction purchases
A rule(s) can be developed. The rule is supported
depending on the frequency of the occurrence and
a confidence interval can be calculated and
expressed as a ratio
This is also known as market basket analysis
For example: People who buy Blackcurrent Jam also
buy Redcurrent Jelly
–
Beer and Nappies?
CREATE THE DIFFERENCE
Example
•
BJP analysts discovered that sales of Strawberry Jam
increased:
–
–
When the customer was offered a small pot of Blackcurrent Jam
free with the purchase
With the height of the person buying the product
•
How commercially useful is this information?
•
Just because there is a correlation, does not mean
it is useful
CREATE THE DIFFERENCE
Descriptive Data Mining uses:
Clustering
•
•
•
Identifies the natural groupings within data – e.g. customers
may be classified into groups – known as customer
segmentation this is useful in Customer Relationship
Management (CRM)
Data items within groups should be as similar as possible to
each other and as different as possible to other groups
Need to determine parameters which will result in realistic
clusters
CREATE THE DIFFERENCE
Example
• BJP has identified clusters of customers who buy
only jam, customers who buy only cake fillings,
customers who buy both
• How would this be commercially useful?
CREATE THE DIFFERENCE
Descriptive Data Mining uses:
Classification
• Data of interest is sorted into predefined classes
• BJP classifies customers as
–
–
–
–
Multinational;
UK based;
independent chain;
single outlet
CREATE THE DIFFERENCE
Predictive Data Mining Use
• Customers in the single outlet category typically order
jams and sauces but not cake fillings
• A new client is placed in the single outlet category – it is
possible to predict likely ordering patterns
CREATE THE DIFFERENCE
Stages in Data-Mining
1. Preparation of data
–
This stage involves selection and preparation of input data from a
variety of sources
•
•
•
Data integration
Data cleansing
Data warehousing (this usually includes the above)
2. Mining stage
–
This stage involves producing useful predictive models (OLAP)
3. Interpretation and Evaluation – Knowledge Discovery
–
The final stage involves deploying the models and applying them to
new data in order to generate predictions or new knowledge.
CREATE THE DIFFERENCE
1. Preparation of Data
• Input data must be in or converted to electronic form. It
could come from a variety of different sources such as:
–
–
–
–
Operational Databases (sales, finance etc.)
Commercial Databases (demographics)
Internet documents
Spreadsheets or other “office” documents
• The input data must be integrated and cleansed.
• Note – much of the preparation is complete in a data
warehouse
CREATE THE DIFFERENCE
Data Integration
• Data from different sources must be integrated to
provide heterogeneity
– Involves de-normalisation of databases
– Dates and times must be of the same format.
– Records must be in the same type
CREATE THE DIFFERENCE
Data Cleansing
• Once integrated, the data must be cleansed to resolve the following
issues
– Duplicate data
• Need to delete
– Missing values (unrecorded or really missing?)
• Unrecorded - might not have been required in one or more of the
contributing data sets. Could be added if based on other values eg. Post
code.
• Really missing- could actually denotes a missing value eg. An unpaid bill.
• Need to decide how missing values will be represented.
– Irrelevant values
• Need expert to identify sets and delete
– Inaccurate data
• can identify anomalies by using graphs and clusters. Values outside the
normal expected range can be investigated.
– Old data
• Need to delete
CREATE THE DIFFERENCE
What are demographic overlays?
• Most customer databases include post codes.
• Various data is collected via census and based on post codes eg.
– Gender Distribution
– Age distribution
• Other data is known about areas eg
– Proximity to the coast
– Major employers
– Proximity to National parks
• This data could be used in conjunction with customer data to
predict trends. Eg
– If a product sells well in one area close to the coast with a higher than
average percentage of old ladies, then it might be worth marketing that
product in other such areas.
CREATE THE DIFFERENCE
2. Mining stage
A Typical Data Set
Customer names in a
certain post code area
It is known that in this area
75% of the population is considered
Rich and 75% is male
CREATE THE DIFFERENCE
Histograms
1 dimensional
2 dimensional
Number
61-70
51-60
Poor
41-50
Rich
Male
31-40
21-30
0
1
2
3
4
5
6
7
8
Female
61-70
0
2
4
6
8
10
12
14
51-60
Poor
41-50
Rich
31-40
21-30
0%
CREATE THE DIFFERENCE
20%
40%
60%
80%
100%
Into the 3rd Dimension
F
Female
Male
M
Poor
Rich
•Even with just two attributes each with two values the table is
more difficult to understand.
•What if there were 16 attributes each with multiple values?
•The number of 2d histograms which could be potentially
useful would be over 100.
•This structure is known as an OLAP cube.
CREATE THE DIFFERENCE
On-line Analytical Processing OLAP
• OLAP functionality is characterised by dynamic multi-dimensional
analysis of consolidated enterprise data:
– Slice: A slice is a subset of a multi-dimensional array corresponding to a
single value for one or more members of the dimensions not in the
subset.
– Dice: The dice operation is a slice on more than two dimensions of a
data cube (or more than two consecutive slices).
– Drill Down/Up: Drilling down or up is a specific analytical technique
whereby the user navigates among levels of data ranging from the
most summarized (up) to the most detailed (down).
– Roll-up: A roll-up involves computing all of the data relationships for
one or more dimensions. To do this, a computational relationship or
formula might be defined.
– Pivot: To change the dimensional orientation of a report or page display
CREATE THE DIFFERENCE
OLAP
• Uses various algorithms – examples are:
a. Decision trees
b. K-Nearest neighbour
CREATE THE DIFFERENCE
Decision trees
• The Decision Tree is one of the most popular classification
algorithms in current use in Data Mining
– A decision tree takes as input an object or situation described by a
set of properties, and outputs a yes/no decision.
– Algorithm is recursive partitioning – divide and conquer.
– Internal nodes denote a test.
– A branch represents the outcome
– The leaf nodes represent the class.
• The algorithm is simple but extremely powerful.
CREATE THE DIFFERENCE
Decision tree example
Candidate for class
label
Rabbit
Rules
Tests
Wings?
Not
in class!
Swims?
Y
N
Y
legs?
Internal node
Leaf node
Rabbit does not
have Wings
Rabbit does
not swim
N
Y
N
Rabbit
has legs
Rabbit has
whiskers
Whiskers?
Eats Meat?
CREATE THE DIFFERENCE
Y
N
Rabbit does
not eat meat
Y
N
Need to decide
• Which attributes to select in order to identify the class of
the sample as quickly as possible?
• When to stop?
– No remaining attributes to test or when the class is determined
CREATE THE DIFFERENCE
K-Nearest neighbour
• k-nearest neighbour algorithm (k-NN) is a popular method for
classification
– feature space is a multidimensional space where each pattern sample is
represented as a point whose dimension is determined by the number of
features used to describe the patterns.
– Firstly the training samples and their class labels are plotted in the
multidimensional feature space. The space is then partitioned into regions
by class labels of the training samples. The training phase of the algorithm
consists simply of plotting the points in the feature space.
– In the actual classification phase, the same features as before are
computed for a test sample. Distances from the new point to all stored
points are computed and k closest samples are selected. The test sample
is assigned to the class whose label is the most frequent among the k
nearest training samples.
• The algorithm is easy to implement, but it is computationally intensive,
especially when the size of the training set grows.
CREATE THE DIFFERENCE
K-Nearest neighbour Example
This is simplistic – usually 16 or more attributes are used.
The small coloured dots are the training samples
Each colour represents a different class label
The large black dots are test samples
When K> 5 boundaries are less distinct in most cases
CREATE THE DIFFERENCE
Need to decide
• A value for k
• The best choice of k depends upon the data
– generally, larger values of k reduce the effect of noise on the
classification, but make boundaries between classes less distinct
CREATE THE DIFFERENCE
3. Interpretation and Evaluation
• Uses of Data Mining in business
– Market segmentation
• Identify the common characteristics of customers who buy certain products from a
company.
– Customer churn
• Predict which customers are likely to leave your company and go to a competitor.
– Fraud detection
• Identify which transactions are most likely to be fraudulent.
– Direct marketing
• Identify which prospects should be included in a mailing list to obtain the highest
response rate.
– Supermarket basket analysis
• Understand what products or services are commonly purchased together.
– Trend analysis
• Reveal the difference between a typical customer this month and last. Allows
organisations to map trends and
CREATE THE DIFFERENCE
Further Reading
• The OLAP report
• A view from QUB
• Date chapter 22
CREATE THE DIFFERENCE