Download Decision Tree Construction

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Chapter 26: Data Mining
Prepared by Assoc. Professor Bela Stantic
Definition
Data mining is the exploration and analysis
of large quantities of data in order to
discover valid, novel, potentially useful,
and ultimately understandable patterns in
data.
Example pattern: 62% of customers who bought milk bought cheese
as well
Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.
Definition (Cont.)
Data mining is the exploration and analysis of large quantities of data
in order to discover valid, novel, potentially useful, and ultimately
understandable patterns in data.
Valid: The patterns hold in general.
Novel: We did not know the pattern beforehand.
Useful: We can devise actions from the patterns.
Understandable: We can interpret and
comprehend the patterns.
Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.
Why Use Data Mining Today?
Human analysis skills are inadequate:
• Volume and dimensionality of the data
• High data growth rate
Availability of:
•
•
•
•
•
Data
Storage
Computational power
Off-the-shelf software
Expertise
Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.
Sources of Data
•
•
•
•
•
•
•
•
•
•
Supermarket scanners, POS data
Credit card transactions
Direct mail response
Call center records
ATM machines
Demographic data
Sensor networks
Cameras
Web server logs
Customer web site trails
Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.
Why Use Data Mining Today?
Competitive pressure!
“The secret of success is to know something that
nobody else knows.”
Aristotle Onassis
• Competition on service, not only on price
(Banks, phone companies, hotel chains, rental
car companies)
• Personalization,
• The real-time enterprise
Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.
The Knowledge Discovery Process
Steps:
l Identify business problem
l Data mining
l Action
l Evaluation and measurement
l Deployment and integration into
businesses processes
Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.
Data Mining Step in Detail
2.1 Data preprocessing
• Data selection: Identify target datasets and
relevant fields
• Data cleaning
•
•
•
•
Remove noise and outliers
Data transformation
Create common units
Generate new fields
2.2 Data mining model construction
2.3 Model evaluation – present to the end user in
understandable form (visually)
Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.
Preprocessing and Mining
Knowledge
Patterns
Target
Data
Preprocessed
Data
Interpretation
Model
Construction
Original Data
Preprocessing
Data
Integration
and Selection
Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.
What is a Data Mining Model?
A data mining model is a description of a
specific aspect of a dataset. It produces
output values for an assigned set of input
values.
Examples:
• Linear regression model
• Classification model
• Clustering
Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.
Data Mining: Types of Data
• Relational data and transactional data
• Spatial and temporal data, spatio-temporal
observations
• Time-series data
• Text
• Images, video
• Mixtures of data
• Sequence data
• Features from processing other data sources
Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.
Types of Variables
• Numerical: Domain is ordered and can be
represented on the real line (e.g., age, income)
• Nominal or categorical: Domain is a finite set
without any natural ordering (e.g., occupation,
marital status, race)
• Ordinal: Domain is ordered, but absolute
differences between values is unknown (e.g.,
preference scale, severity of an injury)
Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.
Applications of Frequent Itemsets
•
•
•
•
•
•
Market Basket Analysis
Association Rules
Classification (especially: text)
Seeds for construction of Bayesian Networks
Web log analysis
Collaborative filtering
Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.
Frequent Itemset
• Itemset – set of items purchased {pen}, {pen,
milk}, …
• Support of the itemset is the fraction of
transactions in database that contain all the
items in the itemset.
• Frequent Itemset - If support is higher than the
user-specified minimal support
• The a Priori property – Every subset of a
frequent itemset is also a frequent itemset
Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.
Frequent Itemset – refined algorithm
• Min support 70%
• Level 1 – finds that the items
{pen}, {milk}, and {ink} are
frequent itemset
• Level 2 following the a Priori
Property set of two items can
be only from frequent itemset:
{pen, milk}, {pen, ink} and
{ink, milk}. We find that the
itemsets {pen, milk}, {pen,
ink} are frequent.
• Level 3 in not required as item
(ink, milk} is not frequent so
therefore itemset {pen, ink,
milk} is not frequent as well.
TID
111
111
111
111
112
112
112
113
113
114
114
114
CID
201
201
201
201
105
105
105
106
106
201
201
201
Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.
Date
5/1/99
5/1/99
5/1/99
5/1/99
6/3/99
6/3/99
6/3/99
6/5/99
6/5/99
7/1/99
7/1/99
7/1/99
Item
Pen
Ink
Milk
Juice
Pen
Ink
Milk
Pen
Milk
Pen
Ink
Juice
Qty
2
1
3
6
1
1
1
1
1
2
2
4
Iceberg queries
• We can apply logic from the refined frequent itemset algorithm to
the Iceberg queries.
• Consider example:
SELECT custID, Item, sum(qty)
FROM Purchase
GROUP BY custID, Item
HAVIN SUM(qty)> 5
This query would perform better if we look only for customers or
items that satisfy the criteria:
SELECT custID, sum(qty)
FROM Purchase
GROUP BY custID
HAVIN SUM(qty)> 5
OR
SELECT Item, sum(qty)
FROM Purchase
GROUP BY Item
HAVIN SUM(qty)> 5
Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.
Association Analysis
• Consider shopping cart filled with several items
• Market basket analysis tries to answer the
following questions:
• Who makes purchases?
• What do customers buy together?
• In what order do customers purchase items?
• When do customers purchase the most and
what?
Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.
Market Basket Analysis
Given:
• A database of
customer transactions
• Each transaction is a
set of items
• Example:
Transaction with TID
111 contains items
{Pen, Ink, Milk, Juice}
TID
111
111
111
111
112
112
112
113
113
114
114
114
CID
201
201
201
201
105
105
105
106
106
201
201
201
Date
5/1/99
5/1/99
5/1/99
5/1/99
6/3/99
6/3/99
6/3/99
6/5/99
6/5/99
7/1/99
7/1/99
7/1/99
Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.
Item
Pen
Ink
Milk
Juice
Pen
Ink
Milk
Pen
Milk
Pen
Ink
Juice
Qty
2
1
3
6
1
1
1
1
1
2
2
4
Market Basket Analysis (Contd.)
• Coocurrences
• 80% of all customers purchase items X, Y and
Z together.
• Association rules
• 60% of all customers who purchase X and Y
also buy Z.
• Sequential patterns
• 60% of customers who first buy X also
purchase Y within three weeks.
Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.
Confidence and Support
We prune the set of all possible association rules using two
interestingness measures:
• Support of a rule:
• X  Y has support s if P(XY) = s
• Represents percentage of the transactions that contain all these
items
• Confidence of a rule:
• X  Y has confidence c if P(sup(LHS U RHS) | sup (LHS)) = c
• Confidence for a rule X  Y is the percentage of such
transactions that also contain all items in Y
We can also define
• Support of an itemset (a coocurrence) XY:
• XY has support s if P(XY) = s
Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.
Example
Examples:
• {Pen} => {Milk}
Support: 100%
Confidence: 75%
• {Ink} => {Pen}
Support: 75%
Confidence: 100%
TID
111
111
111
111
112
112
112
113
113
114
114
114
CID
201
201
201
201
105
105
105
106
106
201
201
201
Date
5/1/99
5/1/99
5/1/99
5/1/99
6/3/99
6/3/99
6/3/99
6/5/99
6/5/99
7/1/99
7/1/99
7/1/99
Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.
Item
Pen
Ink
Milk
Juice
Pen
Ink
Milk
Pen
Milk
Pen
Ink
Juice
Qty
2
1
3
6
1
1
1
1
1
2
2
4
Example
• Find all itemsets with
support >= 75%?
TID
111
111
111
111
112
112
112
113
113
114
114
114
CID
201
201
201
201
105
105
105
106
106
201
201
201
Date
5/1/99
5/1/99
5/1/99
5/1/99
6/3/99
6/3/99
6/3/99
6/5/99
6/5/99
7/1/99
7/1/99
7/1/99
Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.
Item
Pen
Ink
Milk
Juice
Pen
Ink
Milk
Pen
Milk
Pen
Ink
Juice
Qty
2
1
3
6
1
1
1
1
1
2
2
4
Example
• Can you find all
association rules with
support >= 50%?
TID
111
111
111
111
112
112
112
113
113
114
114
114
CID
201
201
201
201
105
105
105
106
106
201
201
201
Date
5/1/99
5/1/99
5/1/99
5/1/99
6/3/99
6/3/99
6/3/99
6/5/99
6/5/99
7/1/99
7/1/99
7/1/99
Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.
Item
Pen
Ink
Milk
Juice
Pen
Ink
Milk
Pen
Milk
Pen
Ink
Juice
Qty
2
1
3
6
1
1
1
1
1
2
2
4
Market Basket Analysis: Applications
• Sample Applications
• Direct marketing
• Fraud detection for medical insurance
• Floor/shelf planning
• Web site layout
• Cross-selling
Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.
Association Rules and ISA Hierarchies
• Or Category hierarchy, can be imposed on
group of items in same hierarchy such as, Pen
and Ink belong to Stationary while Juice and
milk belong to Beverages.
• When applying Assoc. Rules on hiearchy it
allows us to detect relationship between
different levels of hierarchies.
Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.
Generalised Association Rules
“On a day when a pen is purchased, it is likely that the
milk is also purchased”
• If we use the date field as group we can consider more
general problem called calendric market basket analysis.
• Every Thursday, First Sunday every Month, First Monday
every Semester, etc
Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.
The use of Assoc. Rules for prediction
• Are widely used for prediction, however such
predictive usage is not justified without
additional analysis and domain knowledge.
Ramakrishnan and Gehrke. Database Management Systems, 3rd Edition.
Related documents