Download Introduction to Data Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Data Mining – Day 2
Fabiano Dalpiaz
Department of Information and
Communication Technology
University of Trento - Italy
http://www.dit.unitn.it/~dalpiaz
Database e Business Intelligence
A.A. 2007-2008
Knowledge Discovery (KDD)
Process
Today
Presented yesterday
Pattern Evaluation
Data Mining
Task-relevant Data
Data Warehouse
Selection
Data Cleaning
Data Integration
Databases
© P. Giorgini, F. Dalpiaz
2
Outline

Data Mining techniques

Frequent patterns, association rules
• Support and confidence

Classification and prediction
•
•
•
•



Decision trees
Bayesian classifiers
Support Vector Machines
Lazy learning
Cluster Analysis
Visualization of the results
Summary
© P. Giorgini, F. Dalpiaz
3
Data Mining techniques
© P. Giorgini, F. Dalpiaz
4
Frequent pattern analysis

What is it?



Frequent pattern: a pattern (a set of items, subsequences,
substructures, etc.) that occurs frequently in a data set
Frequent pattern analysis: searching for frequent patterns
Motivation: Finding inherent regularities in data
• Which products are bought together? Yesterday’s wine and spaghetti
example
• What are the subsequent purchases after buying a PC?
• Can we automatically classify web documents?

Applications
•
•
•
•
© P. Giorgini, F. Dalpiaz
Basket data analysis
Cross-marketing
Catalog design
Sale campaign analysis
5
Basic Concepts: Frequent Patterns
and Association Rules (1)
Transaction-id
Items bought
1
Wine, Bread, Spaghetti
2
Wine, Cocoa, Spaghetti
3
Wine, Spaghetti, Cheese
4
Bread, Cheese, Sugar
5
Bread, Cocoa, Spaghetti, Cheese,
Sugar
Itemsets (= transactions
in this example)
Goal: find all rules of type X  Y between items in an itemset
with minimum:
Support s - probability that an itemset contains X  Y
Confidence c – conditional probability that an itemset containing X
contains also Y
© P. Giorgini, F. Dalpiaz
6
Basic Concepts: Frequent Patterns
and Association Rules (2)
Transaction-id
Items bought
1
Wine, Bread, Spaghetti
2
Wine, Cocoa, Spaghetti
3
Wine, Spaghetti, Cheese
4
Bread, Cheese, Sugar
5
Bread, Cocoa, Spaghetti, Cheese,
Sugar
Suppose:
support s = 50%
confidence c=50%
Support is used to define frequent patterns (sets of
products in more than s% itemsets)
{Wine} in itemsets 1, 2, 3 (support = 60%)
{Bread} in itemsets 1, 4, 5 (support = 60%)
{Spaghetti} in itemsets 1, 2, 3, 5 (support = 80%)
{Cheese} in itemsets 3, 4, 5 (support = 60%)
{Wine, Spaghetti} in itemsets 1, 2, 3 (support = 60%)
© P. Giorgini, F. Dalpiaz
7
Basic Concepts: Frequent Patterns
and Association Rules (3)
Transaction-id
Items bought
1
Wine, Bread, Spaghetti
2
Wine, Cocoa, Spaghetti
3
Wine, Spaghetti, Cheese
4
Bread, Cheese, Sugar
5
Bread, Cocoa, Spaghetti, Cheese,
Sugar
Suppose:
support s = 50%
confidence c=50%
Confidence defines association rules: X  Y rules in frequent
patterns whose confidence is bigger than c
Suggestion: {Wine, Spaghetti} is the only frequent
pattern to be considered. Why?
Association rules:
Wine  Spaghetti (support=60%, confidence=100%)
Spaghetti  Wine (support=60%, confidence=75%)
© P. Giorgini, F. Dalpiaz
8
Advanced concepts in Association
Rules discovery

Algorithms must face scalability problems


Apriori: If there is any itemset which is infrequent, its superset
should not be generated/tested!
Advanced problems


Boolean vs. quantitative associations
age(x, “30..39”) and income(x, “42..48K”)  buys(x, “car”)
[s=1%, c=75%]
Single level vs. multiple-level analysis
What brands of wine are associated with what brands of
spaghetti?
Are support and confidence
clear?
© P. Giorgini, F. Dalpiaz
9
Another example for association
rules
Transaction-id
Items bought
1
Margherita, Beer, Coke
2
Margherita, Beer
3
Quattro stagioni, Coke
4
Margherita, Coke
Frequent itemsets:
{Margherita} = 75%
{Beer} = 50%
{Coke} = 75%
{Margherita, Beer} = 50%
{Margherita, Coke} = 50%
© P. Giorgini, F. Dalpiaz
Support s = 40%
Confidence c = 70%
Association rules:
Beer  Margherita [c=50%,s=100%]
10
Classification vs. Prediction

Classification




Prediction


Characterizes (describes) a set of items belonging to a training
set; these items are already classified according to a label
attribute
The characterization is a model
The model can be applied to classify new data (predict the
class they should belong to)
models continuous-valued functions, i.e., predicts unknown or
missing values
Applications

Credit approval, target marketing, fraud detection
© P. Giorgini, F. Dalpiaz
11
Classification: the process
1.
Model construction



2.
The class label attribute defines the class each item should
belong to
The set of items used for model construction is called training
set
The model is represented as classification rules, decision
trees, or mathematical formulae
Model usage

Estimate accuracy of the model
•
•

On the training set
On a generalization of the training set
If the accuracy is acceptable, use the model to classify data
tuples whose class labels are not known
© P. Giorgini, F. Dalpiaz
12
Classification: the process
Model construction
Classification
Training
Data
NAME
Mike
Mary
Bill
Jim
Dave
Anne
RANK
YEARS TENURED
Assistant Prof
3
no
Assistant Prof
7
yes
Professor
2
yes
Associate Prof
7
yes
Assistant Prof
6
no
Associate Prof
3
no
© P. Giorgini, F. Dalpiaz
Algorithms
Classifier
(Model)
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
13
Classification: the process
IF rank = ‘professor’
Model usage
OR years > 6
THEN tenured = ‘yes’
Classifier
Testing
Data
Unseen Data
(Jeff, Professor, 4)
NAME
T om
M erlisa
G eorge
Joseph
RANK
YEARS TENURED
A ssistant P rof
2
no
A ssociate P rof
7
no
P rofessor
5
yes
A ssistant P rof
7
yes
© P. Giorgini, F. Dalpiaz
Tenured?
14
Supervised vs. Unsupervised
Learning


Supervised learning (classification)

Supervision: The training data (observations, measurements,
etc.) are accompanied by labels indicating the class of the
observations

New data is classified based on the training set
Unsupervised learning (clustering)

The class labels of training data is unknown

Given a set of measurements, observations, etc. with the aim of
establishing the existence of classes or clusters in the data
© P. Giorgini, F. Dalpiaz
15
Evaluating generated models

Accuracy



Speed



handling noise and missing values
Scalability


time to construct the model (training time)
time to use the model (classification/prediction time)
Robustness


classifier accuracy: predicting class label
predictor accuracy: guessing value of predicted attributes
efficiency in disk-resident databases
Interpretability

understanding and insight provided by the model
© P. Giorgini, F. Dalpiaz
16
Classification techniques
Decision Trees (1)
Investment type choice
Income > 20K€
no
yes
Age > 60
Low risk
no
yes
Married?
no
High risk
© P. Giorgini, F. Dalpiaz
Mid risk
yes
Mid risk
17
Classification techniques
Decision Trees (2)

How are the attributes in decision trees selected?

Two well-known indexes are used
• Information gain selects the most informative attribute in
distinguishing the items between the classes
• It biases towards attributes with a large set of values
• Gain ratio faces the information gain limitations
© P. Giorgini, F. Dalpiaz
18
Classification techniques
Bayesian classifiers (2)

Bayesian classification

A statistical classification technique
• Predicts class membership probabilities

Founded on the Bayes theorem
P( X | H ) P( H )
P( H | X ) 
P( X )
• What if X = “Red and rounded” and H = “Apple”?

Performance
• The simplest implementation (Naïve Bayes) can be compared to decision
trees and neural networks

Incremental
• Each training example can increase/decrease the probability that an
hypothesis in correct
© P. Giorgini, F. Dalpiaz
19
5 minutes break!
© P. Giorgini, F. Dalpiaz
20
Classification techniques
Support Vector Machines

One of the most advanced classification techniques



Left figure: a small margin between the classes is found
Right figure: the largest margin is found
Support vector machines (SVMs) are able to identify the right
figure margin
© P. Giorgini, F. Dalpiaz
21
Classification techniques
SVMs + Kernel Functions

Is data always linearly separable?


NO!!!
Solution: SVMs + Kernel Functions
How to split this?
© P. Giorgini, F. Dalpiaz
SVM
SVM + Kernel
Functions
22
Classification techniques
Lazy learning

Lazy learning




Simply stores training data (or only minor processing) and waits
until it is given a test tuple
Less time in training but more time in predicting
Uses a richer hypothesis space (many local linear functions),
and hence the accuracy is higher
Instance-based learning



Subcategory of lazy learning
Store training examples and delay the processing (“lazy
evaluation”) until a new instance must be classified
An example: k-nearest neighbor approach
© P. Giorgini, F. Dalpiaz
23
Classification techniques
k-nearest neighbor



All instances correspond to points in the n-Dimensional
space – x is the instance to be classified
The nearest neighbor are defined in terms of Euclidean
distance, dist(X1, X2)
For discrete-valued, k-NN returns the most common
value among the k training examples nearest to x
Which class should the
green circle belong to?
© P. Giorgini, F. Dalpiaz
It depends on k!!!
k=3  Red
K=5  Blue
24
Prediction techniques
An overview

Prediction is different from classification



Major method for prediction: regression


Classification refers to predict categorical class label
Prediction models continuous-valued functions
model the relationship between one or more independent or
predictor variables and a dependent or response variable
Regression analysis




Linear and multiple regression
Non-linear regression
Other regression methods: generalized linear model, Poisson
regression, log-linear models, regression trees
No details here
© P. Giorgini, F. Dalpiaz
25
What is cluster Analysis?


Cluster: a collection of data objects

Similar to one another within the same cluster

Dissimilar to the objects in other clusters
Cluster analysis

Finding similarities between data according to the
characteristics found in the data and grouping similar data
objects into clusters

It belongs to unsupervised learning

Typical applications

As a stand-alone tool to get insight into data distribution

As a preprocessing step for other algorithms (day 1 slides)
© P. Giorgini, F. Dalpiaz
26
Examples of cluster analysis

Marketing:


Land use:


Identification of areas of similar land use in an earth observation
database
Insurance:


Help marketers discover distinct groups in their customer bases
Identifying groups of motor insurance policy holders with a high
average claim cost
City-planning:

Identifying groups of houses according to their house type,
value, and geographical location
© P. Giorgini, F. Dalpiaz
27
Good clustering

A good clustering method will produce high quality
clusters with

high intra-class similarity

low inter-class similarity

Dissimilarity/Similarity metric: Similarity is expressed in
terms of a distance function, typically metric: d(i, j)

The definitions of distance functions are usually very
different for interval-scaled, boolean, categorical, ordinal
ratio, and vector variables.

It is hard to define “similar enough” or “good enough”
© P. Giorgini, F. Dalpiaz
28
A small example
How to cluster this data?
This process is not
easy in practice. Why?
© P. Giorgini, F. Dalpiaz
29
Visualization of the results

Presentation of the results or knowledge obtained from
data mining in visual forms

Examples

Scatter plots

Association rules

Decision trees

Clusters
© P. Giorgini, F. Dalpiaz
30
Scatter plots (SAS Enterprise miner)
© P. Giorgini, F. Dalpiaz
31
Association rules (SGI/Mineset)
© P. Giorgini, F. Dalpiaz
32
Decision trees (SGI/Mineset)
© P. Giorgini, F. Dalpiaz
33
Clusters (IBM Intelligent Miner)
© P. Giorgini, F. Dalpiaz
34
Summary
Why Data
Mining?
Data Mining
and KDD
Data
preprocessing
Some
scenarios
Classification
Clustering
© P. Giorgini, F. Dalpiaz
35