Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Midterm Review/Coverage
Data mining concepts: what, process, techniques,
problems can solve (functionality), evaluation
DW and OLAP: concept and functionality, cube
and cuboid, aggregation, OLAP operators (drill
down, …), dimension and hierarchy
Data Preprocessing: concept, missing data, noisy
data, data normalization, data reduction, sampling,
discretization
Data mining primitives: concept and application
Copyright by Jiawei Han, modified
1
Midterm Review/Coverage
Decision Tree Methods: concept, cross-validation,
tree construction method (gain and gain ratio),
continuous variables, missing values, bias, pruning
method (concept), rules
Copyright by Jiawei Han, modified
2
If you need help…
My office hour (this week only): Wed: 3-5 pm.
TA Office hour:
Huajie Zhang: MC 21, Tuesday, 2-4 pm
Wenxia Jiang, ???, Tuesday, 2-4 pm
Copyright by Jiawei Han, modified
3
Data Mining and Data Warehousing
Introduction
Data warehousing and OLAP
Data preprocessing for mining and warehousing
Concept description: characterization and
discrimination
Classification and prediction
Association analysis
Clustering analysis
Mining complex data and advanced mining
techniques
Trends and research issues
Copyright by Jiawei Han, modified
4
Data Mining and Data Warehousing: Session 5
Classification and Prediction
Copyright by Jiawei Han, modified
5
Session 5. Classification and Prediction
Introduction
Decision Tree Induction
Bayesian Classification
Neural Networks
Other Classification Methods
Prediction Methods
Copyright by Jiawei Han, modified
6
Mining Classification and Prediction
Classification:
Independent variables (description, features) =>
dependent variables (target, class label)
Training set
Prediction:
Typical Applications
credit approval
target marketing
medical diagnosis
treatment effectiveness analysis
…..
Copyright by Jiawei Han, modified
7
Classification Process(I)
Training
Data
NAME
M ike
M ary
B ill
Jim
D ave
A nne
RANK
YEARS TENURED
A ssistant P rof
3
no
A ssistant P rof
7
yes
P rofessor
2
yes
A ssociate P rof
7
yes
A ssistant P rof
6
no
A ssociate P rof
3
no
Copyright by Jiawei Han, modified
Classification
Algorithms
Classifier
(Model)
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
8
Classification Process(II)
Classifier
Testing
Data
Unseen Data
(Jeff, Professor, 4)
NAME
T om
M erlisa
G eorge
Joseph
RANK
YEARS TENURED
A ssistant P rof
2
no
A ssociate P rof
7
no
P rofessor
5
yes
A ssistant P rof
7
yes
Copyright by Jiawei Han, modified
Tenured?
9
Supervised vs. Unsupervised Learning
Supervised learning (classification)
Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
Based on the training set to classify new data
Unsupervised learning (clustering)
We are given a set of measurements, observations, etc
with the aim of establishing the existence of classes or
clusters in the data
No training data, or the “training data” are not
accompanied by class labels
Copyright by Jiawei Han, modified
10
Evaluating Classification Methods
Predictive accuracy (and more)
Speed and scalability
time to learn
speed of the classifier
Robustness
noise
missing values
Explanability: e.g., decision trees vs. neural networks
Goodness of rules
decision tree size
the compactness of classification rules
Copyright by Jiawei Han, modified
11
Classification Accuracy: Estimating Error Rates
Partition: Training-and-testing
use two independent data sets, e.g., training set (2/3), test
set(1/3)
used for data set with large number of samples
Cross-validation
divide the data set into k subsamples
use k-1 subsamples as training data and one sub-sample
as test data --- do k times: k-fold cross-validation
for data set with moderate size
Bootstrapping (leave-one-out: k = sample size)
for small size data
Copyright by Jiawei Han, modified
12
Session 5. Classification and Prediction
Introduction
Decision Tree Induction
Bayesian Classification
Neural Networks
Other Classification Methods
Prediction Methods
Copyright by Jiawei Han, modified
13
Training Dataset
An
Example
from
Quinlan’s
ID3
Outlook Tempreature Humidity Windy Class
sunny hot
high
false
N
sunny hot
high
true
N
overcast hot
high
false
P
rain
mild
high
false
P
rain
cool
normal false
P
rain
cool
normal true
N
overcast cool
normal true
P
sunny mild
high
false
N
sunny cool
normal false
P
rain
mild
normal false
P
sunny mild
normal true
P
overcast mild
high
true
P
overcast hot
normal false
P
rainCopyright mild
high
true
N
by Jiawei Han, modified
14
A Sample Decision Tree
Outlook
sunny overcast
overcast
humidity
rain
windy
P
high
normal
true
false
N
P
N
P
Copyright by Jiawei Han, modified
15
Decision-Tree Classification Methods
The basic top-down decision tree generation
approach usually consists of two phases:
Tree
construction
– At start, all the training examples are at the root.
– Partition examples recursively based on selected
attributes.
Tree
pruning
– Aiming at removing tree branches that may lead to
errors when classifying test data (training data may
contain noise, statistical fluctuations, …)
Copyright by Jiawei Han, modified
16
Primary Issues in Tree Construction
Split criterion: Goodness function
Used to select the attribute to be split at a tree node during
the tree generation phase
Different algorithms may use different goodness functions:
– information gain
– gini index
Branching scheme:
Determining the tree branch to which a sample belongs
binary versus k-ary splitting
When to stop the further splitting of a node, e.g. impurity
measure
Labeling rule: a node is labeled as the class to which most
samples at the node belongs
Copyright by Jiawei Han, modified
17
Information Gain (ID3/C4.5)
Assume that there are two classes, P and N.
Let the set of examples S contain p elements of class P and
n elements of class N.
The amount of information, needed to decide if an
arbitrary example in S belong to P or N is defined as
I ( p, n )
p
p
n
n
log2
log2
pn
pn
pn
pn
Assume that using attribute A as the root in the tree will
partition S in sets {S1, S2 , …, Sv}.
If Si contains pi examples of P and ni examples of N, the
information needed to classify objects in all subtrees Si :
v p n
i
E ( A) i
I ( pi , ni )
i 1 p n
Copyright by Jiawei Han, modified
18
Information Gain -- Example
The attribute A is selected such that the information gain
gain(A) = I(p, n) - E(A)
is maximal, that is, E(A) is minimal since I(p, n) is the same
to all attributes at a node.
In the given sample data, attribute outlook is chosen to split
at the root :
gain(outlook) = 0.246
gain(temperature) = 0.029
gain(humidity) = 0.151
gain(windy) = 0.048
Copyright by Jiawei Han, modified
19
Improved Measures for Selecting Attributes
Info gain naturally favors attributes with many values.
One alternative measure: gain ratio (Quinlan’86) which is to
penalize attribute with many values.
|S |
|S |
SplitInfo(S , A) i log i .
|S| 2 |S|
GainRatio(S , A) Gain(S , A) .
SplitInfo(S , A)
Problem: denominator can be 0 or close which makes
GainRatio very large.
There are many other measures. Mingers’91 provides an
experimental analysis of effectiveness of several selection
measures over a variety of problems.
Copyright by Jiawei Han, modified
20
Gini Index
If a data set T contains examples from n classes, gini index,
n
gini(T) is defined as
gini (T ) 1 p2j
j 1
where pj is the relative frequency of class j in T.
If a data set T is split into two subsets T1 and T2 with sizes N1
and N2 respectively, the gini index of the split data contains
examples from n classes, the gini index gini(T) is defined as
ginisplit (T ) N 1 gini(T1) N 2 gini(T 2)
N
N
The attribute provides the smallest ginisplit(T) is chosen to split
the node (need to enumerate all possible splitting points for each
attribute).
Copyright by Jiawei Han, modified
21
Continuous in Decision-Tree Induction
Dynamically define new discrete-valued attributes that
partition the continuous attribute value into a discrete set of
intervals.
Temperature 40 48 60 72 80 90
play tennis
No No Yes Yes Yes No
Sort the examples according to the continuous attribute A, then
identify adjacent examples that differ in their target
classification, generate a set of candidate thresholds midway,
and select the one with the maximum gain.
Compare the best numerical split with discrete variable split
Extensible to split continuous attributes into multiple intervals.
Copyright by Jiawei Han, modified
22
Dealing with Missing Values in C4.5
During tree construction:
gain = gain of known cases * prob of known
Partition of training cases: weigh according to partition
During tree testing
Weigh according to partition
Weighted sum on different predictions, choose the most
likely one.
Better than assigning the most common value
Copyright by Jiawei Han, modified
23
Search and Bias in Decision-Tree Induction
Inductive bias (preference bias): the induction prefers
smallest trees, and
trees that place high info gain attribute close to the root.
no representation bias
Search strategies:
Why prefer small hypotheses?
Hill-climbing without backtrack; local optimal
Occam’s razor: prefer the simplest hypothesis
Decision separators formed by decision trees
Copyright by Jiawei Han, modified
24
Avoid Overfitting in Classification
A trees generated may overfit the training examples due to
noise or too small set of training data.
Two approaches to avoid overfitting:
(Stop earlier): Stop growing the tree earlier.
(Post-prune): Allow overfit and then post-prune the tree.
Approaches to determine the correct final tree size:
Separate training and testing sets or use cross-validation.
Use all the data for training, but apply a statistical test to
estimate whether expanding or pruning a node may
improve over entire distribution.
Use minimum description length (MDL) principle: halting
growth of the tree when the encoding is minimized.
Rule post-pruning (C4.5):
to rules, then pruning.
Copyright byconverting
Jiawei Han, modified
25
Tree Pruning
A decision tree constructed using the training data may have
too many branches/leaf nodes.
Caused by noise, overfitting
May result poor accuracy for unseen samples
Prune the tree: merge a subtree into a leaf node.
Using a set of data different from the training data.
At a tree node, if the accuracy without splitting is higher
than the accuracy with splitting, replace the subtree with
a leaf node, label it using the majority class.
Issues:
Obtaining the testing data
Criteria other than accuracy (e.g. minimum description
Copyright by Jiawei Han, modified
length)
26
Pruning Criterion
Use a separate set of examples to evaluate the utility of postpruning nodes from the tree
CART uses cost-complexity pruning
Apply a statistical test to estimate whether expanding (or
pruning) a particular node
C4.5 uses pessimistic pruning
– error rate: upper limit of the binomial distribution
Minimum Description Length
SLIQ and SPRINT use MDL pruning
Copyright by Jiawei Han, modified
27
C4.5: Popular Decision Tree Learner
Ross Quinlan, a machine learning researcher
Improved (commercial) version: See5/C5
Free demo version (max 200 cases)
http://www.rulequest.com/
Try it out!
Also Cognos Scenario Demo
Copyright by Jiawei Han, modified
28
C4.5rules: rules form tree and pruning
For each leaf, form a rule by collecting all
attribute-value pairs from the path from root
The rule set is equivalent to the tree
Pruning:
For each condition in rule, delete it, if better when testing
on the cross-val set, or using estimated error on the same
training set
May greatly simplify the rule set
Difference between rule set and tree
Format of the rule vs tree
Unique decision vs possible conflicts that need resolution
Copyright by Jiawei Han, modified
29
More to Think about…
Different cost for errors
Classification errors
Tree that minimizes the total loss
Different cost in testing attributes
Tree that minimizes the cost of testing/using the tree
Nodes that take combination of attributes
Logical combinations
Linear combination
Copyright by Jiawei Han, modified
30
Boosting Techniques (I)
Boosting increases classification accuracy.
It can be applied to decision trees or Bayesian
classifier
Learn a series of classifiers
Combine classifiers by (weighted) voting
Boosting requires only linear time and constant
space increase
Copyright by Jiawei Han, modified
31
Bagging
Random sampling with replacement, size N
Build a decision tree from the sample
Do that 10-20 times
To predict: all trees predict, get the voted
prediction
Improve unstable classifiers
Copyright by Jiawei Han, modified
32
Adaptive Boosting: AdaBoost
Assign every example a equal weight 1/N
Do For t = 1, 2, …, T
Obtain a hypothesis (classifier) h(t) under w(t)
Calculate the error of h(t) and re-weight the examples
based on the error
Normalize w(t+1) to sum to 1
Output a weighted sum of all the hypothesis, with
each hypothesis weighted according to its accuracy
on the training set
Copyright by Jiawei Han, modified
33
Classification and Databases
Classification is a classical problem extensively
studied by
statisticians
AI, especially machine learning researchers
Database researchers re-examined the problem in
the context of large databases
most previous studies used small size data, and most
algorithms are memory resident
recent data mining research contributes to
Scalability
Generalization-based classification
Parallel and distributed processing
Copyright by Jiawei Han, modified
34
Classifying Large Dataset
Decision trees seem to be a good choice
relatively faster learning speed than other classification
methods
can be converted into simple and easy to understand
classification rules
can be used to generate SQL queries for accessing
databases
has comparable classification accuracy with other methods
Objectives
Classifying data-sets with millions of examples and a
few hundred even thousands attributes with
reasonable speed.
Copyright by Jiawei Han, modified
35
Scalable Decision Tree Methods
Most algorithms assume data can fit in memory.
Data mining research contributes to the scalability
issue, especially for decision trees.
Successful examples
SLIQ (EDBT’96 -- Mehta et al.’96)
SPRINT (VLDB96 -- J. Shafer et al.’96)
PUBLIC (VLDB98 -- Rastogi & Shim’98)
RainForest (VLDB98 -- Gehrke, et al.’98)
Copyright by Jiawei Han, modified
36
RainForest
Gehrke, Ramakrishnan, and Ganti (VLDB’98)
A generic algorithm that separates the scalability
aspects from the criteria that determine the quality
of the tree.
Based on two observations:
Tree classifiers follow a greedy top-down induction
schema
When evaluating each attribute, the information about
the class label distribution is enough.
AVC-list (attribute, value, class label) data structure
Copyright by Jiawei Han, modified
44
Data Cube-Based Decision-Tree Induction
Integration of generalization with decision-tree
induction (Kamber et al’97).
Classification at primitive concept levels
E.g., precise temperature, humidity, outlook, etc.
Low-level concepts, scattered classes, bushy
classification-trees
Semantic interpretation problems.
Cube-based multi-level classification
Relevance analysis at multi-levels.
Information-gain analysis with dimension + level.
Copyright by Jiawei Han, modified
45
Presentation of Classification Rules
Copyright by Jiawei Han, modified
46
Session 5. Classification and Prediction
Introduction
Decision Tree Induction
Bayesian Classification
Neural Networks
Other Classification Methods
Prediction Methods
Copyright by Jiawei Han, modified
47
Bayesian Classification: Why?
Probabilistic learning: Calculate explicit probabilities for
hypothesis, among the most practical approaches to certain
types of learning problems.
Incremental: Each training example can incrementally increase
or decrease the probability that a hypothesis is correct. Prior
knowledge can be combined with observed data.
Probabilistic prediction: Predict multiple hypotheses, weighted
by their probabilities.
Standard: Even in cases where Bayesian methods prove
computationally intractable, they can provide a standard of
optimal decision making against which other methods can be
measured.
Copyright by Jiawei Han, modified
48
Bayesian Theorem
Given training data D, posteriori probability of a
hypothesis h, P(h|D) follows the Bayes theorem:
P(h | D) P(D | h)P(h)
P(D)
MAP (maximum posteriori) hypothesis:
h
arg max P(h | D) arg max P(D | h)P(h).
MAP hH
hH
Practical difficulty: require initial knowledge of
many probabilities, significant computational cost.
Copyright by Jiawei Han, modified
49
More on Bayesian Theorem...
p(h|D) = p(D|h) p(h)/p(D)
p(h)=p(h|_): priori probability before data (evidence)
D: data, observation, evidence, facts…
P(h|D): posteriori probability after seeing evident D (only)
Hard to calculate… but just to compare:
p(h1/D)/p(h2/D)= p(D|h1)p(h1) / p(D|h2)p(h2)
Copyright by Jiawei Han, modified
50
A true equation underlying all causal modeling
diagnosis: medical, car, …
p(+|cancer)=0.98 so p(- |cancer)=0.02
p(+ |no_cancer)=0.03 so p(- |no_cancer)=0.97
If I am tested positive, how likely do I have cancer?
p(cancer|+) / p(no_cancer|+) = p(+|c)p(c) / p(+|no_c)p(no_c)
= 0.98 p(c) / 0.03 p(no_c)
If p(cancer)=0.05, then the ratio = 0.98*0.05 / 0.02*0.95= 2.58
so p(cancer|+) = 2.58/(1+2.58)= 72%
if p(cancer)=0.001, then ratio = 0.98*0.001/0.03*0.999=0.033
so p(cancer|+)=0.033/1.033=3.2%
Priori probability matters
a lot
Copyright by Jiawei Han, modified
51
A true equation underlying all causal modeling
Fortune teller: are they real?
p(long_life | born_in_may) / p(short_life | born_in_may)
= p(m | l) p(l) / p(m | s) p(s)
= p(l) / p(s)
If fortune tellers want to maximize predictive
accuracy, they just try to predict most likely events
according to your age (and look, …)
Copyright by Jiawei Han, modified
52
More on Bayesian Theorem...
D is a set of “observations”: D = d1, d2, …
P(h | d1, d 2,...) P(d1, d 2,...| h)P(h)
P(d1, d 2,...)
di not in D means not observed (unknown), not false
D is continuously updating, so is p(h|D)
To calculate p(h|D), try p(D|h). Why p(D|h) easier?
“Diagnostic model” p(h|D)
“Causal model”: p(D|h): easier to think about and obtained
Copyright by Jiawei Han, modified
53
Naïve Bayes Classifier (I)
A simplified assumption: attributes are
conditionally independent:
n
P( C j |V ) P( C j ) P( v i | C j )
i 1
Greatly reduces the computation cost, only count
the class distribution.
Example: p(play|Outlook=s, T=m, H=h, W=t) =
p(play) p(O=s|play) p(T=m|play) p(H=h|play) p(W=t|play)
= 9/14 * 2/9 * … = ...
Copyright by Jiawei Han, modified
54
Project
Go directly to:
www.csd.uwo.ca/faculty/ling/cs411a/411proj.html
Copyright by Jiawei Han, modified
55
Training Dataset
An
Example
from
Quinlan’s
ID3
Outlook Tempreature Humidity Windy Class
sunny hot
high
false
N
sunny hot
high
true
N
overcast hot
high
false
P
rain
mild
high
false
P
rain
cool
normal false
P
rain
cool
normal true
N
overcast cool
normal true
P
sunny mild
high
false
N
sunny cool
normal false
P
rain
mild
normal false
P
sunny mild
normal true
P
overcast mild
high
true
P
overcast hot
normal false
P
rainCopyright mild
high
true
N
by Jiawei Han, modified
56
Naive Bayesian Classifier (II)
Given a training set, we can compute the
probabilities
Extremely efficient; training data not in memory
O u tlo o k
su n n y
o verc ast
rain
T em p reatu re
hot
m ild
cool
P
2 /9
4 /9
3 /9
2 /9
4 /9
3 /9
N
3 /5
0
2 /5
2 /5
2 /5
1 /5
H u m id ity P
h ig h
3 /9
n o rm al
6 /9
N
4 /5
1 /5
W in d y
tru e
false
3 /5
2 /5
Copyright by Jiawei Han, modified
3 /9
6 /9
57
Bayesian Belief Networks (I)
Family
History
Smoker
(FH, S) (FH, ~S)(~FH, S) (~FH, ~S)
LungCancer
Emphysema
LC
0.8
0.5
0.7
0.1
~LC
0.2
0.5
0.3
0.9
The conditional probability table
for the variable LungCancer
PositiveXRay
Dyspnea
Bayesian Belief Networks
Copyright by Jiawei Han, modified
58
Bayesian Belief Networks (II)
Bayesian belief network allows a subset of the
variables conditionally independent
A graphical model of causal relationships
Several cases of learning Bayesian belief networks:
Given
both network structure and all the
variables: easy.
Given
network structure but only some variables.
When
the network structure is not known in
advance.
Copyright by Jiawei Han, modified
59
Session 5. Classification and Prediction
Introduction
Decision Tree Induction
Bayesian Classification
Neural Networks
Other Classification Methods
Prediction Methods
Copyright by Jiawei Han, modified
60
Neural Networks
Advantages
prediction accuracy is generally high
robust, works when training examples contain errors
output may be discrete, real-valued, or a vector of
several discrete or real-valued attributes
fast evaluation of the learned target function.
Criticism
long training time
difficult to understand the learned function (weights).
not easy to incorporate domain knowledge
ftp://ftp.sas.com/pub/neural/FAQ.html
Copyright by Jiawei Han, modified
61
A Neuron
- mk
x0
w0
x1
w1
xn
f
output y
wn
Input
weight
vector x vector w
weighted
sum
Activation
function
The n-dimensional input vector x is mapped into
variable y by means of the scalar product and a
nonlinear function mapping
Copyright by Jiawei Han, modified
62
Multi Layer Perceptron
Output vector
S ip
h
m v mp )
m 1
( x)
Output nodes
1
1 e x
vp m
n
i m
f
(
(
x
l wl ) r m )
m
Hidden nodes
l 1
wlm
e x e x
f ( x ) :: ( x ) x x
e e
Input nodes
Input vector: xi
Copyright by Jiawei Han, modified
63
Models and Architectures
Learning paradigms
classification
clustering
reinforcement
Network topologies
feed-forward
limited recurrent
fully recurrent
Learning algorithm
Cross-validation may minimize the risk of overfitting.
Copyright by Jiawei Han, modified
64
Learning Paradigms (I)
(1) Classification
adjust weights using
Error = Desired - Actual
(2) Reinforcement
adjust weights
using reinforcement
Inputs
Actual Output
Copyright by Jiawei Han, modified
65
Learning Paradigms(II) --- Clustering
(2) Adjust weights
of winner toward
input pattern
(1) Outputs
compete
to be
winner
(0) Inputs
Copyright by Jiawei Han, modified
66
Learning Algorithms
Back propagation for classification
Kohonen feature maps for clustering
Recurrent back propagation for classification
Radial basis function for classification
Adaptive resonance theory
Probabilistic neural networks
Copyright by Jiawei Han, modified
67
Major Steps
Constructing a network
input data representation
selection of number of layers, number of nodes in each
layer
Training the network using training data
Pruning the network
Interpret the results
Copyright by Jiawei Han, modified
68
Constructing the Network (I) --- Input
Data Representation
3
doctor
Categorical
Hashed or
Discrete
Numeric
Look up
Normalized
0.3
Coded
1 of N 100
Thermometer
111
Binary
011
Thresholded or
discretized
Continuous
Numeric
Copyright by Jiawei Han, modified
69
Constructing the Network (II)
The number of input nodes: corresponds to the
dimensionality of the input tuples
Thermometer coding:
– age 20-80: 6 intervals
– [20, 30) 000001, [30, 40) 000011, …., [70, 80) 111111
Number of hidden nodes: adjusted during
training
Number of output nodes: number of classes
Copyright by Jiawei Han, modified
70
Session 5. Classification and Prediction
Introduction
Decision Tree Induction
Bayesian Classification
Neural Networks
Other Classification Methods
Prediction Methods
Copyright by Jiawei Han, modified
77
Other Classification Methods
Genetic algorithm
Instance-based method
k-nearest neighbor classifier
case-based reasoning
Fuzzy logic
Copyright by Jiawei Han, modified
78
Genetic Algorithm (I)
GA: based on an analogy to biological evolution.
Encoding the problem/solution by string(s) of “genes”
A diverse population (pool) of competing hypotheses is
maintained.
At each iteration (generation), String is evaluated as
how fit they are. The most fit members are selected to
produce new offspring that replace the least fit ones.
Hypotheses are encoded by strings that are combined
by crossover operations, and subject to random
mutation, to produce offspring.
Learning is viewed as a special case of optimization.
Copyright by Jiawei Han, modified
79
Genetic Algorithm (II)
IF (level = doctor) and (GPA = 3.6)
THEN result=approval
level
001
GPA result
111
10
00111110
10001101
10011110
00101101
Copyright by Jiawei Han, modified
80
Instance-Based Methods
Instance-based learning: Store training examples
and delay the processing (“lazy evaluation”) until a
new instance must be classified.
Typical approaches:
k-nearest neighbor approach:
– Instances represented as points in a Euclidean space.
Locally
weighted regression:
– Constructs local approximation.
Case-based
reasoning:
– Uses symbolic representations and knowledge-based
inference.
Copyright by Jiawei Han, modified
81
The k-Nearest Neighbor Algorithm
All instances correspond to points in the n-D space.
The nearest neighbor are defined in terms of Euclidean
distance.
The target function could be discrete- or real- valued.
For discrete-valued, the k-NN returns the most common
value among the k training examples nearest to xq.
Vonoroi diagram: the decision surface induced by 1-NN
for a typical set of training examples.
.
_
_
_
+
_
_
.
+
xq
_
+
.
+
Copyright by Jiawei Han, modified
.
.
.
82
Discussion on the k-NN Algorithm
The k-NN algorithm for continuous-valued target functions.
Calculate the mean values of the k nearest neighbors.
Distance-weighted nearest neighbor algorithm.
Weight the contribution of each of the k neighbors
according to their distance to the query point xq.
– giving greater weight to closer neighbors:
1
d ( xq , xi )2
Similarly, we can distance-weight the instances for realvalued target functions.
Robust to noisy data by averaging k-nearest neighbors.
Curse of dimensionality: distance between neighbors could
be dominated by irrelevant attributes. To overcome it,
axes stretch or elimination of the least relevant attributes.
w
Copyright by Jiawei Han, modified
83
Locally Weighted Regression
Construct an explicit approximation to f over a local
region surrounding query instance xq.
Locally weighted linear regression:
The target function f is approximated near xq using the
linear function:
f ( x) w w a ( x)wnan ( x)
0
11
minimize the squared error: distance-decreasing weight K
E ( xq ) 1
( f ( x) f ( x))2 K(d ( xq , x))
2 xk _nearest _neighbors_of _ x
q
the gradient descent training rule:
w j
K (d ( xq , x))(( f ( x) f ( x))a j ( x)
x k _ nearest _ neighbors_ of _ xq
In most cases, the target function is approximated
by a constant, linear, or quadratic function.
Copyright by Jiawei Han, modified
84
Case-Based Reasoning
Similarity (to k-nearest neighbors and locally weighted
regression): lazy evaluation + analyzing similar instances.
Difference: Instances are not “points in a Euclidean space”.
Example: Water faucet problem in CADET (Sycara et al’92).
Methodology:
Instances represented by rich symbolic descriptions (e.g.,
function graphs).
Multiple retrieved cases may be combined.
Tight coupling between case retrieval, knowledge-based
reasoning, and problem solving.
Research issues
Indexing based on syntactic similarity measure, and when
failure, backtracking,
adapting
Copyrightand
by Jiawei
Han, modifiedto additional cases.
85
Remarks on Lazy vs. Eager Learning
Instance-based learning: lazy evaluation
Decision-tree and Bayesian classification: eager evaluation.
Key differences:
Lazy method may consider query instance xq when
deciding how to generalize beyond the training data D.
Eager method cannot since they have already chosen
global approximation when seeing the query.
Efficiency: Lazy - less time training but more time predicting.
Accuracy:
Lazy method effectively uses a richer hypothesis space
since it uses many local linear functions to form its implicit
global approximation to the target function.
Eager: must commit to a single hypothesis that covers the
entire instance space.
Copyright by Jiawei Han, modified
86
Session 5. Classification and Prediction
Introduction
Decision Tree Induction
Bayesian Classification
Neural Networks
Other Classification Methods
Prediction Methods
Copyright by Jiawei Han, modified
87
Predictive Modeling in Databases
Predictive modeling: Predict data values or construct
generalized linear models based on the database data.
One can only predict value ranges or category distributions.
Method outline:
Minimal generalization
Attribute relevance analysis
Generalized linear model construction
Prediction.
Determine the major factors which influence the prediction.
Data relevance analysis: uncertainty measurement, entropy
analysis, expert judgement, etc.
Multi-level prediction: drill-down and roll-up analysis.
Copyright by Jiawei Han, modified
88
Regress Analysis and Log-Linear Models
in Prediction
Linear regression: Y = + X
Two parameters , and specify the line and are to be
estimated by using the data at hand.
using the least squares criterion to the known values of
Y1, Y2, …, X1, X2, ….
Multiple regression: Y = b0 + b1 X1 + b2 X2.
Many nonlinear functions can be transformed into the
above.
Log-linear models:
The multi-way table of joint probabilities is
approximated by a product of lower-order tables.
Probability: p(a, b, c, d) = ab acad bcd
Copyright by Jiawei Han, modified
89
Prediction: Numerical Data
Copyright by Jiawei Han, modified
90
Prediction: Categorical Data
Copyright by Jiawei Han, modified
91
Conclusions
Classification is an extensively studied problem (mainly in
statistics, machine learning & neural networks)
Classification is probably one of the most widely used data
mining techniques with a lot of applications.
Scalability is still an important issue for database
applications.
Combining classification with database techniques should
be a promising research topic.
Research Direction: Classification of non-relational data,
e.g., text, spatial, multimedia, etc..
Copyright by Jiawei Han, modified
92