Download Data mining and Data warehousing

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

K-means clustering wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Data mining and Machine
Learning
Sunita Sarawagi
[email protected]
Data Mining


Data mining is the process of semi-automatically
analyzing large databases to find useful patterns
Prediction based on past history



Predict if a credit card applicant poses a good credit risk, based on
some attributes (income, job type, age, ..) and past history
Predict if a pattern of phone calling card usage is likely to be
fraudulent
Some examples of prediction mechanisms:

Classification


Given a new item whose class is unknown, predict to which class it
belongs
Regression formulae

Given a set of mappings for an unknown function, predict the function
result for a new parameter value
Data Mining (Cont.)

Descriptive Patterns

Associations


Associations may be used as a first step in detecting causation


Find books that are often bought by “similar” customers. If a new
such customer buys one such book, suggest the others too.
E.g. association between exposure to chemical X and cancer,
Clusters


E.g. typhoid cases were clustered in an area surrounding a
contaminated well
Detection of clusters remains important in detecting epidemics
Data mining


Data: of various shapes and sizes
Patterns/Model: of various shapes and sizes


Abstraction of data into some understandable and useful
Basic structure of data


Set of instances/objects/cases/rows/points/examples
Each instance: fixed set of attributes/dimensions/columns



Continuous
Categorical
Patterns:

Express one attribute as a function of another:


Classification, regression
Group together related instances: clustering, projection, factorization,
itemset mining
Classification

Given old data about customers and payments,
predict new applicant’s loan eligibility.
Previous customers
Age
Salary
Profession
Location
Class
Customer type
label
Labeled data
Model
Classifier
Decision rules
Salary > 5 L
Prof. = Exec
Good/
bad
Deployment
Training
New customer’s data
Unlabeled data
Applications





Ad placement in search engines
Book recommendation
Citation databases: Google scholar, Citeseer
Resume organization and job matching
Retail data mining

Banking: loan/credit card approval


Customer relationship management:




identify those who are likely to leave for a competitor.
Targeted marketing:


predict good customers based on old customers
identify likely responders to promotions
Machine translation
Speech and handwriting recognition
Fraud detection: telecommunications, financial transactions

from an online stream of event identify fraudulent events
Applications (continued)

Medicine: disease outcome, effectiveness of
treatments



Molecular/Pharmaceutical: identify new drugs
Scientific data analysis:


analyze patient disease history: find relationship between
diseases
identify new galaxies by searching for sub clusters
Image and vision:



Object recognition from images
Remove noise from images
Identifying scene breaks
The KDD process


Problem fomulation
Data collection



Pre-processing: cleaning



name/address cleaning, different meanings (annual, yearly), duplicate
removal, supplying missing values
Transformation:


subset data: sampling might hurt if highly skewed data
feature selection: principal component analysis, heuristic search
map complex objects e.g. time series data to features e.g. frequency
Choosing mining task and mining method:
Result evaluation and Visualization:
Knowledge discovery is an iterative process
Mining products
Data
warehouse
Extract
data via
ODBC
Preprocessing
utilities
•Sampling
•Attribute
transformation
Commercial Tools
–
–
–
–
SAS Enterprise miner
SPSS
IBM Intelligent miner
Microsoft SQL Server Data
mining services
– Oracle data mining (ODM)
Scalable algorithms
• association
• classification
• clustering
• sequence mining
Mining
operations
Visualization
Tools
Free
Weka
Individual
algorithms
Mining operations
Clustering
Sequence mining
Classification
 hierarchical
 Time series
 Regression
similarity
 Classification trees
 EM
 Neural networks
 density based  Temporal patterns
 Bayesian learning
 Nearest neighbour
Itemset mining
 Radial basis functions
 Association rules
 Support vector
 Causality
machines
 Meta learning methods

Bagging,boosting
Sequential classification
 Graphical models

Hidden Markov Models
Classification methods
Goal: Predict class Ci = f(x1, x2, .. Xn)
 Regression: (linear or any other polynomial)
 Decision tree classifier: divide decision space into
piecewise constant regions.
 Neural networks: partition by non-linear boundaries
 Probabilistic/generative models
 Lazy learning methods: nearest neighbor
 Support vector machines: boundary to maximally
separate classes
Decision tree learning
Decision tree classifiers





Widely used learning method
Easy to interpret: can be re-represented as if-then-else
rules
Approximates function by piece wise constant regions
Does not require any prior knowledge of data
distribution, works well on noisy data.
Has been applied to:




classify medical patients based on the disease,
equipment malfunction by cause,
loan applicant by likelihood of payment.
lots and lots of other applications..
Decision trees

Tree where internal nodes are simple decision rules
on one or more attributes and leaf nodes are
predicted class labels.
Salary < 1 M
Prof = teaching
Good
Bad
Age < 30
Bad
Good
Training Dataset
This
follows
an
example
from
Quinlan’s
ID3
age
<=30
<=30
30…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
income student credit_rating
high
no
fair
high
no
excellent
high
no
fair
medium
no
fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no
fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no
excellent
high
yes fair
medium
no
excellent
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
Output: A Decision Tree for “buys_computer”
age?
<=30
student?
overcast
30..40
yes
>40
credit rating?
no
yes
excellent
fair
no
yes
no
yes
Weather Data: Play or not Play?
Outlook
Temperature
Humidity
Windy
Play?
sunny
hot
high
false
No
sunny
hot
high
true
No
overcast
hot
high
false
Yes
rain
mild
high
false
Yes
rain
cool
normal
false
Yes
rain
cool
normal
true
No
overcast
cool
normal
true
Yes
sunny
mild
high
false
No
sunny
cool
normal
false
Yes
rain
mild
normal
false
Yes
sunny
mild
normal
true
Yes
overcast
mild
high
true
Yes
overcast
hot
normal
false
Yes
rain
mild
high
true
No
Note:
Outlook is the
Forecast,
no relation to
Microsoft
email program
Example Tree for “Play?”
Outlook
sunny
overcast
Humidity
Yes
rain
Windy
high
normal
true
false
No
Yes
No
Yes
Topics to be covered

Tree construction:




Tree pruning:



Basic tree learning algorithm
Measures of predictive ability
High performance decision tree construction: Sprint
Why prune
Methods of pruning
Other issues:



Handling missing data
Continuous class labels
Effect of training size
Tree learning algorithms





ID3 (Quinlan 1986)
Successor C4.5 (Quinlan 1993)
CART
SLIQ (Mehta et al)
SPRINT (Shafer et al)
Basic algorithm for tree building

Greedy top-down construction.
Gen_Tree (Node, data)
make node a leaf?
Yes
Stop
Selection
Find best attribute and best split on attribute
criteria
Partition data on split condition
For each child j of node Gen_Tree (node_j, data_j)
Split criteria






Select the attribute that is best for classification.
Intuitively pick one that best separates instances
of different classes.
Quantifying the intuitive: measuring separability:
First define impurity of an arbitrary set S
consisting of K classes
Smallest when consisting of only one class,
highest when all classes in equal number.
1
Should allow computations in multiple stages.
Measures of impurity

Entropy
k
Entropy( S )  pi log pi
i 1

Gini
k
Gini ( S ) 1   pi
i 1
2
Information gain
0.5
Gini
Entropy
1
0


p1
1
0
1
Information gain on partitioning S into r subsets
Impurity (S) - sum of weighted impurity of each subset
r
Gain( S , S1..S r )  Entropy( S )  
j 1
Sj
S
Entropy( S j )
Information gain: example
K= 2, |S| = 100, p1= 0.6, p2= 0.4
E(S) = -0.6 log(0.6) - 0.4 log (0.4)=0.29
| S1 | = 70, p1= 0.8, p2= 0.2
E(S1) = -0.8log0.8 - 0.2log0.2 = 0.21
S1
S | S | = 30, p = 0.13, p = 0.87
2
1
2
E(S2) = -0.13log0.13 - 0.87 log 0.87=.16
S2
Information gain: E(S) - (0.7 E(S1 ) + 0.3 E(S2) ) =0.1
Weather Data: Play or not Play?
Outlook
Temperature
Humidity
Windy
Play?
sunny
hot
high
false
No
sunny
hot
high
true
No
overcast
hot
high
false
Yes
rain
mild
high
false
Yes
rain
cool
normal
false
Yes
rain
cool
normal
true
No
overcast
cool
normal
true
Yes
sunny
mild
high
false
No
sunny
cool
normal
false
Yes
rain
mild
normal
false
Yes
sunny
mild
normal
true
Yes
overcast
mild
high
true
Yes
overcast
hot
normal
false
Yes
rain
mild
high
true
No
Which attribute to select?
witten&eibe
Example: attribute “Outlook”

“Outlook” = “Sunny”:
info([2,3])  entropy(2/5,3/5)  2 / 5 log( 2 / 5)  3 / 5 log(3 / 5)  0.971 bits

“Outlook” = “Overcast”:
Note: log(0) is
info([4,0])  entropy(1,0)  1log(1)  0 log(0)  0 bits not defined, but
 “Outlook” = “Rainy”:
we evaluate
0*log(0) as zero
info([3,2])  entropy(3/5,2/5)  3 / 5 log(3 / 5)  2 / 5 log( 2 / 5)  0.971 bits

Expected information for attribute:
info([3,2],[4,0],[3,2])  (5 / 14)  0.971  (4 / 14)  0  (5 / 14)  0.971
 0.693 bits
witten&eibe
Computing the information gain
Information gain:
(information before split) – (information after
split)

gain(" Outlook" )  info([9,5] ) - info([2,3] , [4,0], [3,2])  0.940 - 0.693
 0.247 bits

Information gain for attributes from weather
gain(" Outlook" )  0.247 bits
data:
gain(" Temperatur e" )  0.029 bits
gain(" Humidity" )  0.152 bits
gain(" Windy" )  0.048 bits
witten&eibe
Continuing to split
gain(" Humidity" )  0.971 bits
gain(" Temperatur e" )  0.571 bits
gain(" Windy" )  0.020 bits
witten&eibe
The final decision tree

Note: not all leaves need to be pure; sometimes
identical instances have different classes
 Splitting stops when data can’t be split any further
witten&eibe
Preventing overfitting




A tree T overfits if there is another tree T’ that
gives higher error on the training data yet gives
lower error on unseen data.
An overfitted tree does not generalize to unseen
instances.
Happens when data contains noise or irrelevant
attributes and training size is small.
Overfitting can reduce accuracy drastically:


10-25% as reported in Minger’s 1989 Machine learning
Example of over-fitting with binary data.
Training Data Vs. Test Data Error
Rates

Compare error rates
measured by





learn data
large test set
Learn R(T) always decreases
as tree grows (Q: Why?)
Test R(T) first declines then
increases (Q: Why?)
Overfitting is the result tree of
too much reliance on learn
R(T)

Can lead to disasters when
applied to new data
No.
Terminal
Nodes
71
63
58
40
34
19
**10
9
7
6
5
2
1
R(T)
Rts(T)
.00
.00
.03
.10
.12
.20
.29
.32
.41
.46
.53
.75
.86
.42
.40
.39
.32
.32
.31
.30
.34
.47
.54
.61
.82
.91
Digit recognition dataset: CART book
Overfitting example



Consider the case where a single attribute xj is
adequate for classification but with an error of
20%
Consider lots of other noise attributes that
enable zero error during training
This detailed tree during testing will have an
expected error of (0.8*0.2 + 0.2*0.8) = 32%
whereas the pruned tree with only a single split
on xj will have an error of only 20%.
Approaches to prevent overfitting

Two Approaches:

Stop growing the tree beyond a certain point


Tricky, since even when information gain is zero an attribute
might be useful (XOR example)
First over-fit, then post prune. (More widely used)

Tree building divided into phases:
 Growth phase
 Prune phase
size:

Three criteria:



Cross validation with separate test data
Statistical bounds: use all data for training but apply
statistical test to decide right size. (cross-validation
dataset may be used to threshold)
Use some criteria function to choose best size

Example: Minimum description length (MDL) criteria
Cross validation

Partition the dataset into two disjoint parts:




Rule of thumb: 2/3rds training, 1/3rd validation
Evaluate the tree on the validation set and at each
leaf and internal node keep count of correctly
labeled data.


1. Training set used for building the tree.
2. Validation set used for pruning the tree:
Starting bottom-up, prune nodes with error less than its children.
What if training data set size is limited?



n-fold cross validation: partition training data into n parts D1, D2…Dn.
Train n classifiers with D-Di as training and Di as test instance.
Pick average. (how?)
Extracting Classification Rules from Trees






Represent the knowledge in the form of IF-THEN rules
One rule is created for each path from the root to a leaf
Each attribute-value pair along a path forms a conjunction
The leaf node holds the class prediction
Rules are easier for humans to understand
Example
IF
IF
IF
IF
IF
age = “<=30” AND student = “no” THEN buys_computer = “no”
age = “<=30” AND student = “yes” THEN buys_computer = “yes”
age = “31…40”
THEN buys_computer = “yes”
age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes”
age = “<=30” AND credit_rating = “fair” THEN buys_computer = “no”
Rule-based pruning




Tree-based pruning limits the kind of pruning. If
a node is pruned all subtrees under it has to be
pruned.
Rule-based: For each leaf of the tree, extract a
rule using a conjuction of all tests upto the root.
On the validation set, independently prune tests
from each rule to get the highest accuracy for
that rule.
Sort rule by decreasing accuracy..
Regression trees



Decision tree with continuous class labels:
Regression trees approximates the function with
piece-wise constant regions.
Split criteria for regression trees:




Predicted value for a set S = average of all values in S
Error: sum of the square of error of each member of S from the
predicted average.
Pick smallest average error.
Splits on categorical attributes:


Can it be better than for discrete class labels?
Homework.
Other types of trees



Multi-way trees on low-cardinality categorical
data
Multiple splits on continuous attributes [Fayyad
93, Multi-interval discretization of continuous
attributes]
Multi attribute tests on nodes to handle
correlated attributes

multivariate linear splits [Oblique trees, Murthy 94]
Issues

Methods of handling missing values



assume majority value
take most probable path
Allowing varying costs for different attributes
Pros and Cons of decision trees
• Cons
• Pros
+ Reasonable training time – Not effective for very high
dimensional data where
+ Fast application
information about the class is
+ Easy to interpret
spread in small ways over
+ Easy to implement
many correlated features
+ Intuitive
–Example: words in text
classification
–Not robust to dropping of
important features even when
correlated substitutes exist in
data
The k-Nearest Neighbor Algorithm





All instances correspond to points in the n-D space.
The nearest neighbor are defined in terms of
Euclidean distance.
The target function could be discrete- or real- valued.
For discrete-valued, the k-NN returns the most
common value among the k training examples nearest
to xq.
Vonoroi diagram: the decision surface induced by 1NN for a typical set of training examples.
_
+ _ _
_
_
.
+
xq
.
_
+
.
.
.
+
From Jiawei Han's slides
.
Other lazy learning methods

Locally weighted regression:


learn a new regression equation by weighting each training
instance based on distance from new instance
Radial Basis Functions
• Pros
+ Fast training
• Cons
– Slow during application.
– No feature selection.
– Notion of proximity vague
Bayesian learning



Assume a probability model on generation of
data.
Apply bayes theorem to find most likely class as:
p(d | c j ) p(c j )
c  max p(c j | d )  max
cj
cj
p(d )
Naïve bayes: Assume attributes conditionally
independent given class value
c  max
cj



p(c j )
n
p(a

p(d )
i 1
i
| cj)
Easy to learn probabilities by counting, one pass
counting
Useful in some domains e.g. text.
Numeric attributes must be discretized
Bayesian belief network

Find joint probability over set of variables
making use of conditional independence
whenever known
a
d
ad ad ad ad
b
Variable e independent
of d given b
e


b 0.1 0.2 0.3 0.4
b 0.3 0.2 0.1 0.5
C
Learning parameters hard when hidden units:
use gradient descent / EM algorithms
Learning structure of network harder
Neural networks

Useful for learning complex data like
handwriting, speech and image recognition
Decision boundaries:
Linear regression
Classification tree
Neural network
Pros and Cons of Neural Network
• Pros
+ Can learn more complicated
class boundaries
+ Fast application
+ Can handle large number of
features
• Cons
– Slow training time
– Hard to interpret
– Hard to implement:
trial and error for
choosing number of
nodes
Conclusion: Use neural nets only if decision trees/NN fail.
Linear discriminants
Problem setting




Given a binary classification problem with points
in d dimensions
Training data n vectors with predictions of the
form: (x1,y1),…(xn,yn)
Each y will take value 1 or -1
Goal is to learn a function of the form:

F(x) = w.x + b
= w1x1+w2x2+…+wdxd + b
Linear regression






Developed for the case of real-valued y
y = f(x) = w.x+w0
Rewrite as y = w.x with x-s padded with a 1.
Error:
Minimize error by differentiating wrt w
Minimum reached at w=(X’X)-1X’Y
Fisher’s linear discriminant


Find hyperplane (w,b) on which the
projection of the data is maximally separated.
Cost function:


Where mi and si are the mean and standard deviation of the
projected points w.x+b for all point x in class i , pi is
fraction in class i.
The linear discriminate maximizes above
separation when


w = (m1-m2)’S-1x where m1 is the mean of the x values along
each class and S-1 is the covariance matrix of the data.
b = (m1-m2)’S-1(m1+m2) (mid-point of the two means on
the linear discriminate)
Fisher’s discriminant
This maximizes
separation between
projected red and black
points average
fj
fi
Shortcomings

Perceptron: ill-posed. several values of w might
yield the same zero error on training data
Support vector machines

Binary classifier: find hyper-plane providing
maximum margin between vectors of the two
classes
fj
fi
Support vector machines
Separators with larger margin will have smaller
generalization error
Geometry of SVMs
1/||w||
fj
w
-b
||w||
fi
wx+b
||w||
Linear separators



Most complex real-world applications require
more than linear separators
One way to get around the problem is to
represent the data in a transformed coordinate
space on which linear separators can be learnt.
Example f(m1,m2,r)=Cm1m2/r2 is not linear but
f(ln m1, ln m2, ln r) is.
Support Vector Machines

Extendable to:



Good generalization performance





Non-separable problems (Cortes & Vapnik, 1995)
Non-linear classifiers (Boser et al., 1992)
OCR (Boser et al.)
Vision (Poggio et al.)
Text classification (Joachims)
Requires tuning: which kernel, what parameters?
Several freely available packages: SVMTorch
Locally weighted regression
Learn a new regression equation by weighting each training
instance based on distance from new instance
Locally Weighted Regression

Regression equation

Find wi-s so as to minimize error:
f ( x)  w  w a ( x)wnan ( x)
0
11
E( D)  1 ( f ( x)  fˆ ( x))2
2 xD

Construct an explicit approximation to f over a local region
surrounding query instance xq.


The target function f is approximated near xq using the linear function:
minimize the squared error: distance-decreasing weight K
E ( xq )  1
( f ( x)  f ( x))2 K(d ( xq , x))

2 xk _nearest _neighbors_of _ x
q
Feature subset selection



Embedded: in the mining algorithm
Filter: features in advance
Wrapper: generate candidate features and test
on black box mining algorithm

high cost but provide better algorithm dependent features
Meta learning methods



No single classifier good under all cases
Difficult to evaluate in advance the conditions
Meta learning: combine the effects of the
classifiers




Voting: sum up votes of component classifiers
Combiners: learn a new classifier on the outcomes of previous
ones:
Boosting: staged classifiers
Disadvantage: interpretation hard

Knowledge probing: learn single classifier to mimick meta
classifier
Clustering or Unsupervised
learning
Applications

Customer segmentation e.g. for targeted
marketing



Collaborative filtering:




Group/cluster existing customers based on time series of
payment history such that similar customers in same cluster.
Identify micro-markets and develop policies foreach
group based on common items purchased
Image tiling
Text clustering e.g. scatter/gather
Compression
Distance functions

Numeric data: euclidean, manhattan distances



Categorical data: 0/1 to indicate presence/absence





Minkowski metric: [sum(xi-yi)^m]^(1/m)
Larger m gives higher weight to larger distances
Euclidean distance: equal weightage to 1 and 0 match
Hamming distance (# dissimilarity)
Jaccard coefficients: #similarity in 1s/(# of 1s) (0-0 matches not
important
data dependent measures: similarity of A and B depends on cooccurance with C.
Combined numeric and categorical data:weighted normalized distance:
Distance functions on high
dimensional data



Example: Time series, Text, Images
Euclidian measures make all points equally far
Reduce number of dimensions:




choose subset of original features using random projections, feature
selection techniques
transform original features using statistical methods like Principal
Component Analysis
Define domain specific similarity measures: e.g. for
images define features like number of objects, color
histogram; for time series define shape based measures.
Define non-distance based (model-based) clustering
methods:
Clustering methods

Hierarchical clustering



agglomerative Vs divisive
single link Vs complete link
Partitional clustering



distance-based: K-means
model-based: EM
density-based:
A Dendrogram Shows How the
Clusters are Merged Hierarchically
Decompose data objects into a several levels of nested
partitioning (tree of clusters), called a dendrogram.
A clustering of the data objects is obtained by cutting the
dendrogram at the desired level, then each connected
component forms a cluster.
divisive
Step 4
c
Step 3
Step 2 Step 1 Step 0
ac
0
a
0
2
bde
0
9
8
b
0
7
5
10
abcde
d
e
Step 2 Step 3 Step 4
Step 2 Step 1 Step 0
abcde
c
de
Step 1
Step 3
ac
bde
Step 2 Step 3 Step 4
b
d
Step 0
Step 4
de
Step 1
agglomerative
c
a
b
e
d
Step 0
Single-link
a
b
c
d
e
a
0
9
3
6
11
e
Complete-link
Pros and Cons

Single link:



Complete link



confused by near overlap
chaining effect
unnecessary splits of alongated point clouds
sensitive to outliers
Several other hierarchical methods known:
Partitional methods: K-means

Criteria: minimize sum of square of distance



Between each point and centroid of the cluster.
Between each pair of points in the cluster
Algorithm:

Select initial partition with K clusters: random, first K, K
separated points

Repeat until stabilization:



Assign each point to closest cluster center
Generate new cluster centers
Adjust clusters by merging/splitting
Association rules



Given set T of groups of items
T
Example: set of baskets of items
Milk, cereal
purchased
Tea, milk
Goal: find all rules on itemsets of the form
Tea, rice, bread
a-->b such that




support of a and b > user threshold s
conditional probability (confidence) of b given a >
user threshold c
Example: Milk --> bread
Lot of work done on scalable algorithms
cereal
Variants


High confidence may not imply high correlation
Use correlations. Find expected support and
large departures from that interesting.




Brin et al. Limited attempt.
More complete work in statistical literature on contingency
tables.
Still too many rules, need to prune...
Does not imply causality as in Bayesian networks
Applications of fast itemset
counting
Find correlated events:
 Applications in medicine: find redundant tests
 Cross selling in retail, banking
 Improve predictive capability of classifiers that
assume attribute independence
 New similarity measures of categorical
attributes [Mannila et al, KDD 98]
Temporal mining

Several large data domains inherently temporal


Stock prices
Monitoring data:



patient monitors, manufacturing processes, performance logs
Transaction data
Lots of prior work from



Signal processing
Statistics
Speech recognition
Temporal mining



Finding significant patterns along time
Similarity matches and clustering
Rules along time series:


Classification on time series data:



Drop in kerosene prices --> increase in bronchitis cases
customers with high variance in balance likely to default.
speed fluctuations with significant third order ARMA coefficients
are probably from drunk drivers.
Detecting drift in models along
Spatial scan statistics
(Paper in reading list)
Mining market


Around 20 to 30 mining tool vendors
Major tool players:






Clementine,
IBM’s Intelligent Miner,
SGI’s MineSet,
SAS’s Enterprise Miner.
All pretty much the same set of tools
Many embedded products:




fraud detection:
electronic commerce applications,
health care,
customer relationship management: Epiphany
Summary

What is data mining and an overview of the
various operations:




Classification: regression, nearest neighbour, neural network,
bayesian
Clustering: distance based (k-means), distribution based(EM)
Itemset counting
Several operations: challenge is choosing the
right operation for the problem
Resources


http://www.kdnuggets.com
SIGKDD: http://www.acm.org/sigkdd