Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Data Mining – Day 2
Fabiano Dalpiaz
Department of Information and
Communication Technology
University of Trento - Italy
http://www.dit.unitn.it/~dalpiaz
Database e Business Intelligence
A.A. 2007-2008
Knowledge Discovery (KDD)
Process
Today
Presented yesterday
Pattern Evaluation
Data Mining
Task-relevant Data
Data Warehouse
Selection
Data Cleaning
Data Integration
Databases
© P. Giorgini, F. Dalpiaz
2
Outline
Data Mining techniques
Frequent patterns, association rules
• Support and confidence
Classification and prediction
•
•
•
•
Decision trees
Bayesian classifiers
Support Vector Machines
Lazy learning
Cluster Analysis
Visualization of the results
Summary
© P. Giorgini, F. Dalpiaz
3
Data Mining techniques
© P. Giorgini, F. Dalpiaz
4
Frequent pattern analysis
What is it?
Frequent pattern: a pattern (a set of items, subsequences,
substructures, etc.) that occurs frequently in a data set
Frequent pattern analysis: searching for frequent patterns
Motivation: Finding inherent regularities in data
• Which products are bought together? Yesterday’s wine and spaghetti
example
• What are the subsequent purchases after buying a PC?
• Can we automatically classify web documents?
Applications
•
•
•
•
© P. Giorgini, F. Dalpiaz
Basket data analysis
Cross-marketing
Catalog design
Sale campaign analysis
5
Basic Concepts: Frequent Patterns
and Association Rules (1)
Transaction-id
Items bought
1
Wine, Bread, Spaghetti
2
Wine, Cocoa, Spaghetti
3
Wine, Spaghetti, Cheese
4
Bread, Cheese, Sugar
5
Bread, Cocoa, Spaghetti, Cheese,
Sugar
Itemsets (= transactions
in this example)
Goal: find all rules of type X Y between items in an itemset
with minimum:
Support s - probability that an itemset contains X Y
Confidence c – conditional probability that an itemset containing X
contains also Y
© P. Giorgini, F. Dalpiaz
6
Basic Concepts: Frequent Patterns
and Association Rules (2)
Transaction-id
Items bought
1
Wine, Bread, Spaghetti
2
Wine, Cocoa, Spaghetti
3
Wine, Spaghetti, Cheese
4
Bread, Cheese, Sugar
5
Bread, Cocoa, Spaghetti, Cheese,
Sugar
Suppose:
support s = 50%
confidence c=50%
Support is used to define frequent patterns (sets of
products in more than s% itemsets)
{Wine} in itemsets 1, 2, 3 (support = 60%)
{Bread} in itemsets 1, 4, 5 (support = 60%)
{Spaghetti} in itemsets 1, 2, 3, 5 (support = 80%)
{Cheese} in itemsets 3, 4, 5 (support = 60%)
{Wine, Spaghetti} in itemsets 1, 2, 3 (support = 60%)
© P. Giorgini, F. Dalpiaz
7
Basic Concepts: Frequent Patterns
and Association Rules (3)
Transaction-id
Items bought
1
Wine, Bread, Spaghetti
2
Wine, Cocoa, Spaghetti
3
Wine, Spaghetti, Cheese
4
Bread, Cheese, Sugar
5
Bread, Cocoa, Spaghetti, Cheese,
Sugar
Suppose:
support s = 50%
confidence c=50%
Confidence defines association rules: X Y rules in frequent
patterns whose confidence is bigger than c
Suggestion: {Wine, Spaghetti} is the only frequent
pattern to be considered. Why?
Association rules:
Wine Spaghetti (support=60%, confidence=100%)
Spaghetti Wine (support=60%, confidence=75%)
© P. Giorgini, F. Dalpiaz
8
Advanced concepts in Association
Rules discovery
Algorithms must face scalability problems
Apriori: If there is any itemset which is infrequent, its superset
should not be generated/tested!
Advanced problems
Boolean vs. quantitative associations
age(x, “30..39”) and income(x, “42..48K”) buys(x, “car”)
[s=1%, c=75%]
Single level vs. multiple-level analysis
What brands of wine are associated with what brands of
spaghetti?
Are support and confidence
clear?
© P. Giorgini, F. Dalpiaz
9
Another example for association
rules
Transaction-id
Items bought
1
Margherita, Beer, Coke
2
Margherita, Beer
3
Quattro stagioni, Coke
4
Margherita, Coke
Frequent itemsets:
{Margherita} = 75%
{Beer} = 50%
{Coke} = 75%
{Margherita, Beer} = 50%
{Margherita, Coke} = 50%
© P. Giorgini, F. Dalpiaz
Support s = 40%
Confidence c = 70%
Association rules:
Beer Margherita [c=50%,s=100%]
10
Classification vs. Prediction
Classification
Prediction
Characterizes (describes) a set of items belonging to a training
set; these items are already classified according to a label
attribute
The characterization is a model
The model can be applied to classify new data (predict the
class they should belong to)
models continuous-valued functions, i.e., predicts unknown or
missing values
Applications
Credit approval, target marketing, fraud detection
© P. Giorgini, F. Dalpiaz
11
Classification: the process
1.
Model construction
2.
The class label attribute defines the class each item should
belong to
The set of items used for model construction is called training
set
The model is represented as classification rules, decision
trees, or mathematical formulae
Model usage
Estimate accuracy of the model
•
•
On the training set
On a generalization of the training set
If the accuracy is acceptable, use the model to classify data
tuples whose class labels are not known
© P. Giorgini, F. Dalpiaz
12
Classification: the process
Model construction
Classification
Training
Data
NAME
Mike
Mary
Bill
Jim
Dave
Anne
RANK
YEARS TENURED
Assistant Prof
3
no
Assistant Prof
7
yes
Professor
2
yes
Associate Prof
7
yes
Assistant Prof
6
no
Associate Prof
3
no
© P. Giorgini, F. Dalpiaz
Algorithms
Classifier
(Model)
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
13
Classification: the process
IF rank = ‘professor’
Model usage
OR years > 6
THEN tenured = ‘yes’
Classifier
Testing
Data
Unseen Data
(Jeff, Professor, 4)
NAME
T om
M erlisa
G eorge
Joseph
RANK
YEARS TENURED
A ssistant P rof
2
no
A ssociate P rof
7
no
P rofessor
5
yes
A ssistant P rof
7
yes
© P. Giorgini, F. Dalpiaz
Tenured?
14
Supervised vs. Unsupervised
Learning
Supervised learning (classification)
Supervision: The training data (observations, measurements,
etc.) are accompanied by labels indicating the class of the
observations
New data is classified based on the training set
Unsupervised learning (clustering)
The class labels of training data is unknown
Given a set of measurements, observations, etc. with the aim of
establishing the existence of classes or clusters in the data
© P. Giorgini, F. Dalpiaz
15
Evaluating generated models
Accuracy
Speed
handling noise and missing values
Scalability
time to construct the model (training time)
time to use the model (classification/prediction time)
Robustness
classifier accuracy: predicting class label
predictor accuracy: guessing value of predicted attributes
efficiency in disk-resident databases
Interpretability
understanding and insight provided by the model
© P. Giorgini, F. Dalpiaz
16
Classification techniques
Decision Trees (1)
Investment type choice
Income > 20K€
no
yes
Age > 60
Low risk
no
yes
Married?
no
High risk
© P. Giorgini, F. Dalpiaz
Mid risk
yes
Mid risk
17
Classification techniques
Decision Trees (2)
How are the attributes in decision trees selected?
Two well-known indexes are used
• Information gain selects the most informative attribute in
distinguishing the items between the classes
• It biases towards attributes with a large set of values
• Gain ratio faces the information gain limitations
© P. Giorgini, F. Dalpiaz
18
Classification techniques
Bayesian classifiers (2)
Bayesian classification
A statistical classification technique
• Predicts class membership probabilities
Founded on the Bayes theorem
P( X | H ) P( H )
P( H | X )
P( X )
• What if X = “Red and rounded” and H = “Apple”?
Performance
• The simplest implementation (Naïve Bayes) can be compared to decision
trees and neural networks
Incremental
• Each training example can increase/decrease the probability that an
hypothesis in correct
© P. Giorgini, F. Dalpiaz
19
5 minutes break!
© P. Giorgini, F. Dalpiaz
20
Classification techniques
Support Vector Machines
One of the most advanced classification techniques
Left figure: a small margin between the classes is found
Right figure: the largest margin is found
Support vector machines (SVMs) are able to identify the right
figure margin
© P. Giorgini, F. Dalpiaz
21
Classification techniques
SVMs + Kernel Functions
Is data always linearly separable?
NO!!!
Solution: SVMs + Kernel Functions
How to split this?
© P. Giorgini, F. Dalpiaz
SVM
SVM + Kernel
Functions
22
Classification techniques
Lazy learning
Lazy learning
Simply stores training data (or only minor processing) and waits
until it is given a test tuple
Less time in training but more time in predicting
Uses a richer hypothesis space (many local linear functions),
and hence the accuracy is higher
Instance-based learning
Subcategory of lazy learning
Store training examples and delay the processing (“lazy
evaluation”) until a new instance must be classified
An example: k-nearest neighbor approach
© P. Giorgini, F. Dalpiaz
23
Classification techniques
k-nearest neighbor
All instances correspond to points in the n-Dimensional
space – x is the instance to be classified
The nearest neighbor are defined in terms of Euclidean
distance, dist(X1, X2)
For discrete-valued, k-NN returns the most common
value among the k training examples nearest to x
Which class should the
green circle belong to?
© P. Giorgini, F. Dalpiaz
It depends on k!!!
k=3 Red
K=5 Blue
24
Prediction techniques
An overview
Prediction is different from classification
Major method for prediction: regression
Classification refers to predict categorical class label
Prediction models continuous-valued functions
model the relationship between one or more independent or
predictor variables and a dependent or response variable
Regression analysis
Linear and multiple regression
Non-linear regression
Other regression methods: generalized linear model, Poisson
regression, log-linear models, regression trees
No details here
© P. Giorgini, F. Dalpiaz
25
What is cluster Analysis?
Cluster: a collection of data objects
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Cluster analysis
Finding similarities between data according to the
characteristics found in the data and grouping similar data
objects into clusters
It belongs to unsupervised learning
Typical applications
As a stand-alone tool to get insight into data distribution
As a preprocessing step for other algorithms (day 1 slides)
© P. Giorgini, F. Dalpiaz
26
Examples of cluster analysis
Marketing:
Land use:
Identification of areas of similar land use in an earth observation
database
Insurance:
Help marketers discover distinct groups in their customer bases
Identifying groups of motor insurance policy holders with a high
average claim cost
City-planning:
Identifying groups of houses according to their house type,
value, and geographical location
© P. Giorgini, F. Dalpiaz
27
Good clustering
A good clustering method will produce high quality
clusters with
high intra-class similarity
low inter-class similarity
Dissimilarity/Similarity metric: Similarity is expressed in
terms of a distance function, typically metric: d(i, j)
The definitions of distance functions are usually very
different for interval-scaled, boolean, categorical, ordinal
ratio, and vector variables.
It is hard to define “similar enough” or “good enough”
© P. Giorgini, F. Dalpiaz
28
A small example
How to cluster this data?
This process is not
easy in practice. Why?
© P. Giorgini, F. Dalpiaz
29
Visualization of the results
Presentation of the results or knowledge obtained from
data mining in visual forms
Examples
Scatter plots
Association rules
Decision trees
Clusters
© P. Giorgini, F. Dalpiaz
30
Scatter plots (SAS Enterprise miner)
© P. Giorgini, F. Dalpiaz
31
Association rules (SGI/Mineset)
© P. Giorgini, F. Dalpiaz
32
Decision trees (SGI/Mineset)
© P. Giorgini, F. Dalpiaz
33
Clusters (IBM Intelligent Miner)
© P. Giorgini, F. Dalpiaz
34
Summary
Why Data
Mining?
Data Mining
and KDD
Data
preprocessing
Some
scenarios
Classification
Clustering
© P. Giorgini, F. Dalpiaz
35