Download Lecture notes for chapter 7

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Midterm Review/Coverage
Data mining concepts: what, process, techniques,
problems can solve (functionality), evaluation
 DW and OLAP: concept and functionality, cube
and cuboid, aggregation, OLAP operators (drill
down, …), dimension and hierarchy
 Data Preprocessing: concept, missing data, noisy
data, data normalization, data reduction, sampling,
discretization
 Data mining primitives: concept and application

Copyright by Jiawei Han, modified
1
Midterm Review/Coverage

Decision Tree Methods: concept, cross-validation,
tree construction method (gain and gain ratio),
continuous variables, missing values, bias, pruning
method (concept), rules
Copyright by Jiawei Han, modified
2
If you need help…
My office hour (this week only): Wed: 3-5 pm.
 TA Office hour:

Huajie Zhang: MC 21, Tuesday, 2-4 pm
 Wenxia Jiang, ???, Tuesday, 2-4 pm

Copyright by Jiawei Han, modified
3
Data Mining and Data Warehousing
Introduction
 Data warehousing and OLAP
 Data preprocessing for mining and warehousing
 Concept description: characterization and
discrimination
 Classification and prediction
 Association analysis
 Clustering analysis
 Mining complex data and advanced mining
techniques
 Trends and research issues

Copyright by Jiawei Han, modified
4
Data Mining and Data Warehousing: Session 5
Classification and Prediction
Copyright by Jiawei Han, modified
5
Session 5. Classification and Prediction

Introduction

Decision Tree Induction

Bayesian Classification

Neural Networks

Other Classification Methods

Prediction Methods
Copyright by Jiawei Han, modified
6
Mining Classification and Prediction

Classification:
Independent variables (description, features) =>
dependent variables (target, class label)
 Training set

Prediction:
 Typical Applications

credit approval
 target marketing
 medical diagnosis
 treatment effectiveness analysis
 …..

Copyright by Jiawei Han, modified
7
Classification Process(I)
Training
Data
NAME
M ike
M ary
B ill
Jim
D ave
A nne
RANK
YEARS TENURED
A ssistant P rof
3
no
A ssistant P rof
7
yes
P rofessor
2
yes
A ssociate P rof
7
yes
A ssistant P rof
6
no
A ssociate P rof
3
no
Copyright by Jiawei Han, modified
Classification
Algorithms
Classifier
(Model)
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
8
Classification Process(II)
Classifier
Testing
Data
Unseen Data
(Jeff, Professor, 4)
NAME
T om
M erlisa
G eorge
Joseph
RANK
YEARS TENURED
A ssistant P rof
2
no
A ssociate P rof
7
no
P rofessor
5
yes
A ssistant P rof
7
yes
Copyright by Jiawei Han, modified
Tenured?
9
Supervised vs. Unsupervised Learning


Supervised learning (classification)

Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations

Based on the training set to classify new data
Unsupervised learning (clustering)

We are given a set of measurements, observations, etc
with the aim of establishing the existence of classes or
clusters in the data

No training data, or the “training data” are not
accompanied by class labels
Copyright by Jiawei Han, modified
10
Evaluating Classification Methods





Predictive accuracy (and more)
Speed and scalability
 time to learn
 speed of the classifier
Robustness
 noise
 missing values
Explanability: e.g., decision trees vs. neural networks
Goodness of rules
 decision tree size
 the compactness of classification rules
Copyright by Jiawei Han, modified
11
Classification Accuracy: Estimating Error Rates



Partition: Training-and-testing

use two independent data sets, e.g., training set (2/3), test
set(1/3)

used for data set with large number of samples
Cross-validation

divide the data set into k subsamples

use k-1 subsamples as training data and one sub-sample
as test data --- do k times: k-fold cross-validation

for data set with moderate size
Bootstrapping (leave-one-out: k = sample size)

for small size data
Copyright by Jiawei Han, modified
12
Session 5. Classification and Prediction

Introduction

Decision Tree Induction

Bayesian Classification

Neural Networks

Other Classification Methods

Prediction Methods
Copyright by Jiawei Han, modified
13
Training Dataset

An
Example
from
Quinlan’s
ID3
Outlook Tempreature Humidity Windy Class
sunny hot
high
false
N
sunny hot
high
true
N
overcast hot
high
false
P
rain
mild
high
false
P
rain
cool
normal false
P
rain
cool
normal true
N
overcast cool
normal true
P
sunny mild
high
false
N
sunny cool
normal false
P
rain
mild
normal false
P
sunny mild
normal true
P
overcast mild
high
true
P
overcast hot
normal false
P
rainCopyright mild
high
true
N
by Jiawei Han, modified
14
A Sample Decision Tree
Outlook
sunny overcast
overcast
humidity
rain
windy
P
high
normal
true
false
N
P
N
P
Copyright by Jiawei Han, modified
15
Decision-Tree Classification Methods

The basic top-down decision tree generation
approach usually consists of two phases:
 Tree
construction
– At start, all the training examples are at the root.
– Partition examples recursively based on selected
attributes.
 Tree
pruning
– Aiming at removing tree branches that may lead to
errors when classifying test data (training data may
contain noise, statistical fluctuations, …)
Copyright by Jiawei Han, modified
16
Primary Issues in Tree Construction




Split criterion: Goodness function
 Used to select the attribute to be split at a tree node during
the tree generation phase
 Different algorithms may use different goodness functions:
– information gain
– gini index
Branching scheme:
 Determining the tree branch to which a sample belongs
 binary versus k-ary splitting
When to stop the further splitting of a node, e.g. impurity
measure
Labeling rule: a node is labeled as the class to which most
samples at the node belongs
Copyright by Jiawei Han, modified
17
Information Gain (ID3/C4.5)

Assume that there are two classes, P and N.
 Let the set of examples S contain p elements of class P and
n elements of class N.
 The amount of information, needed to decide if an
arbitrary example in S belong to P or N is defined as
I ( p, n )  

p
p
n
n

log2
log2
pn
pn
pn
pn
Assume that using attribute A as the root in the tree will
partition S in sets {S1, S2 , …, Sv}.
 If Si contains pi examples of P and ni examples of N, the
information needed to classify objects in all subtrees Si :
v p n
i
E ( A)   i
I ( pi , ni )
i 1 p  n
Copyright by Jiawei Han, modified
18
Information Gain -- Example

The attribute A is selected such that the information gain
gain(A) = I(p, n) - E(A)
is maximal, that is, E(A) is minimal since I(p, n) is the same
to all attributes at a node.

In the given sample data, attribute outlook is chosen to split
at the root :
gain(outlook) = 0.246
gain(temperature) = 0.029
gain(humidity) = 0.151
gain(windy) = 0.048
Copyright by Jiawei Han, modified
19
Improved Measures for Selecting Attributes


Info gain naturally favors attributes with many values.
One alternative measure: gain ratio (Quinlan’86) which is to
penalize attribute with many values.
|S |
|S |
SplitInfo(S , A)   i log i .
|S| 2 |S|
GainRatio(S , A)  Gain(S , A) .
SplitInfo(S , A)
Problem: denominator can be 0 or close which makes
GainRatio very large.
There are many other measures. Mingers’91 provides an
experimental analysis of effectiveness of several selection
measures over a variety of problems.


Copyright by Jiawei Han, modified
20
Gini Index

If a data set T contains examples from n classes, gini index,
n
gini(T) is defined as
gini (T )  1   p2j
j 1
where pj is the relative frequency of class j in T.
 If a data set T is split into two subsets T1 and T2 with sizes N1
and N2 respectively, the gini index of the split data contains
examples from n classes, the gini index gini(T) is defined as
ginisplit (T )  N 1 gini(T1)  N 2 gini(T 2)
N
N

The attribute provides the smallest ginisplit(T) is chosen to split
the node (need to enumerate all possible splitting points for each
attribute).
Copyright by Jiawei Han, modified
21
Continuous in Decision-Tree Induction

Dynamically define new discrete-valued attributes that
partition the continuous attribute value into a discrete set of
intervals.
Temperature 40 48 60 72 80 90
play tennis



No No Yes Yes Yes No
Sort the examples according to the continuous attribute A, then
identify adjacent examples that differ in their target
classification, generate a set of candidate thresholds midway,
and select the one with the maximum gain.
Compare the best numerical split with discrete variable split
Extensible to split continuous attributes into multiple intervals.
Copyright by Jiawei Han, modified
22
Dealing with Missing Values in C4.5

During tree construction:
gain = gain of known cases * prob of known
 Partition of training cases: weigh according to partition


During tree testing
Weigh according to partition
 Weighted sum on different predictions, choose the most
likely one.
 Better than assigning the most common value

Copyright by Jiawei Han, modified
23
Search and Bias in Decision-Tree Induction

Inductive bias (preference bias): the induction prefers
smallest trees, and
 trees that place high info gain attribute close to the root.
 no representation bias


Search strategies:


Why prefer small hypotheses?


Hill-climbing without backtrack; local optimal
Occam’s razor: prefer the simplest hypothesis
Decision separators formed by decision trees
Copyright by Jiawei Han, modified
24
Avoid Overfitting in Classification




A trees generated may overfit the training examples due to
noise or too small set of training data.
Two approaches to avoid overfitting:
 (Stop earlier): Stop growing the tree earlier.
 (Post-prune): Allow overfit and then post-prune the tree.
Approaches to determine the correct final tree size:
 Separate training and testing sets or use cross-validation.
 Use all the data for training, but apply a statistical test to
estimate whether expanding or pruning a node may
improve over entire distribution.
 Use minimum description length (MDL) principle: halting
growth of the tree when the encoding is minimized.
Rule post-pruning (C4.5):
to rules, then pruning.
Copyright byconverting
Jiawei Han, modified
25
Tree Pruning



A decision tree constructed using the training data may have
too many branches/leaf nodes.
 Caused by noise, overfitting
 May result poor accuracy for unseen samples
Prune the tree: merge a subtree into a leaf node.
 Using a set of data different from the training data.
 At a tree node, if the accuracy without splitting is higher
than the accuracy with splitting, replace the subtree with
a leaf node, label it using the majority class.
Issues:
 Obtaining the testing data
 Criteria other than accuracy (e.g. minimum description
Copyright by Jiawei Han, modified
length)
26
Pruning Criterion

Use a separate set of examples to evaluate the utility of postpruning nodes from the tree


CART uses cost-complexity pruning
Apply a statistical test to estimate whether expanding (or
pruning) a particular node

C4.5 uses pessimistic pruning
– error rate: upper limit of the binomial distribution

Minimum Description Length

SLIQ and SPRINT use MDL pruning
Copyright by Jiawei Han, modified
27
C4.5: Popular Decision Tree Learner
Ross Quinlan, a machine learning researcher
 Improved (commercial) version: See5/C5
 Free demo version (max 200 cases)
 http://www.rulequest.com/
 Try it out!


Also Cognos Scenario Demo
Copyright by Jiawei Han, modified
28
C4.5rules: rules form tree and pruning
For each leaf, form a rule by collecting all
attribute-value pairs from the path from root
 The rule set is equivalent to the tree
 Pruning:

For each condition in rule, delete it, if better when testing
on the cross-val set, or using estimated error on the same
training set
 May greatly simplify the rule set


Difference between rule set and tree
Format of the rule vs tree
 Unique decision vs possible conflicts that need resolution

Copyright by Jiawei Han, modified
29
More to Think about…

Different cost for errors
Classification errors
 Tree that minimizes the total loss


Different cost in testing attributes


Tree that minimizes the cost of testing/using the tree
Nodes that take combination of attributes
Logical combinations
 Linear combination

Copyright by Jiawei Han, modified
30
Boosting Techniques (I)
Boosting increases classification accuracy.
 It can be applied to decision trees or Bayesian
classifier
 Learn a series of classifiers
 Combine classifiers by (weighted) voting
 Boosting requires only linear time and constant
space increase

Copyright by Jiawei Han, modified
31
Bagging
Random sampling with replacement, size N
 Build a decision tree from the sample
 Do that 10-20 times
 To predict: all trees predict, get the voted
prediction
 Improve unstable classifiers

Copyright by Jiawei Han, modified
32
Adaptive Boosting: AdaBoost
Assign every example a equal weight 1/N
 Do For t = 1, 2, …, T

Obtain a hypothesis (classifier) h(t) under w(t)
 Calculate the error of h(t) and re-weight the examples
based on the error
 Normalize w(t+1) to sum to 1


Output a weighted sum of all the hypothesis, with
each hypothesis weighted according to its accuracy
on the training set
Copyright by Jiawei Han, modified
33
Classification and Databases

Classification is a classical problem extensively
studied by
statisticians
 AI, especially machine learning researchers


Database researchers re-examined the problem in
the context of large databases


most previous studies used small size data, and most
algorithms are memory resident
recent data mining research contributes to
Scalability
 Generalization-based classification
 Parallel and distributed processing

Copyright by Jiawei Han, modified
34
Classifying Large Dataset

Decision trees seem to be a good choice
relatively faster learning speed than other classification
methods
 can be converted into simple and easy to understand
classification rules
 can be used to generate SQL queries for accessing
databases
 has comparable classification accuracy with other methods
 Objectives


Classifying data-sets with millions of examples and a
few hundred even thousands attributes with
reasonable speed.
Copyright by Jiawei Han, modified
35
Scalable Decision Tree Methods

Most algorithms assume data can fit in memory.

Data mining research contributes to the scalability
issue, especially for decision trees.

Successful examples

SLIQ (EDBT’96 -- Mehta et al.’96)

SPRINT (VLDB96 -- J. Shafer et al.’96)
PUBLIC (VLDB98 -- Rastogi & Shim’98)
 RainForest (VLDB98 -- Gehrke, et al.’98)

Copyright by Jiawei Han, modified
36
RainForest
Gehrke, Ramakrishnan, and Ganti (VLDB’98)
 A generic algorithm that separates the scalability
aspects from the criteria that determine the quality
of the tree.
 Based on two observations:

Tree classifiers follow a greedy top-down induction
schema
 When evaluating each attribute, the information about
the class label distribution is enough.
 AVC-list (attribute, value, class label) data structure

Copyright by Jiawei Han, modified
44
Data Cube-Based Decision-Tree Induction
Integration of generalization with decision-tree
induction (Kamber et al’97).
 Classification at primitive concept levels
 E.g., precise temperature, humidity, outlook, etc.
 Low-level concepts, scattered classes, bushy
classification-trees
 Semantic interpretation problems.
 Cube-based multi-level classification
 Relevance analysis at multi-levels.
 Information-gain analysis with dimension + level.

Copyright by Jiawei Han, modified
45
Presentation of Classification Rules
Copyright by Jiawei Han, modified
46
Session 5. Classification and Prediction

Introduction

Decision Tree Induction

Bayesian Classification

Neural Networks

Other Classification Methods

Prediction Methods
Copyright by Jiawei Han, modified
47
Bayesian Classification: Why?

Probabilistic learning: Calculate explicit probabilities for
hypothesis, among the most practical approaches to certain
types of learning problems.

Incremental: Each training example can incrementally increase
or decrease the probability that a hypothesis is correct. Prior
knowledge can be combined with observed data.

Probabilistic prediction: Predict multiple hypotheses, weighted
by their probabilities.

Standard: Even in cases where Bayesian methods prove
computationally intractable, they can provide a standard of
optimal decision making against which other methods can be
measured.
Copyright by Jiawei Han, modified
48
Bayesian Theorem

Given training data D, posteriori probability of a
hypothesis h, P(h|D) follows the Bayes theorem:
P(h | D)  P(D | h)P(h)
P(D)

MAP (maximum posteriori) hypothesis:
h
 arg max P(h | D)  arg max P(D | h)P(h).
MAP hH
hH

Practical difficulty: require initial knowledge of
many probabilities, significant computational cost.
Copyright by Jiawei Han, modified
49
More on Bayesian Theorem...

p(h|D) = p(D|h) p(h)/p(D)
p(h)=p(h|_): priori probability before data (evidence)
 D: data, observation, evidence, facts…
 P(h|D): posteriori probability after seeing evident D (only)


Hard to calculate… but just to compare:
p(h1/D)/p(h2/D)= p(D|h1)p(h1) / p(D|h2)p(h2)
Copyright by Jiawei Han, modified
50

A true equation underlying all causal modeling
diagnosis: medical, car, …
p(+|cancer)=0.98 so p(- |cancer)=0.02
p(+ |no_cancer)=0.03 so p(- |no_cancer)=0.97
If I am tested positive, how likely do I have cancer?
p(cancer|+) / p(no_cancer|+) = p(+|c)p(c) / p(+|no_c)p(no_c)
= 0.98 p(c) / 0.03 p(no_c)
If p(cancer)=0.05, then the ratio = 0.98*0.05 / 0.02*0.95= 2.58
so p(cancer|+) = 2.58/(1+2.58)= 72%
if p(cancer)=0.001, then ratio = 0.98*0.001/0.03*0.999=0.033
so p(cancer|+)=0.033/1.033=3.2%


Priori probability matters
a lot
Copyright by Jiawei Han, modified
51
A true equation underlying all causal modeling
 Fortune teller: are they real?

p(long_life | born_in_may) / p(short_life | born_in_may)
= p(m | l) p(l) / p(m | s) p(s)
= p(l) / p(s)
If fortune tellers want to maximize predictive
accuracy, they just try to predict most likely events
according to your age (and look, …)
Copyright by Jiawei Han, modified
52
More on Bayesian Theorem...

D is a set of “observations”: D = d1, d2, …
P(h | d1, d 2,...)  P(d1, d 2,...| h)P(h)
P(d1, d 2,...)
di not in D means not observed (unknown), not false
 D is continuously updating, so is p(h|D)
 To calculate p(h|D), try p(D|h). Why p(D|h) easier?

“Diagnostic model” p(h|D)
 “Causal model”: p(D|h): easier to think about and obtained

Copyright by Jiawei Han, modified
53
Naïve Bayes Classifier (I)

A simplified assumption: attributes are
conditionally independent:
n
P( C j |V )  P( C j ) P( v i | C j )
i 1
Greatly reduces the computation cost, only count
the class distribution.
 Example: p(play|Outlook=s, T=m, H=h, W=t) =

p(play) p(O=s|play) p(T=m|play) p(H=h|play) p(W=t|play)
= 9/14 * 2/9 * … = ...
Copyright by Jiawei Han, modified
54
Project
Go directly to:
www.csd.uwo.ca/faculty/ling/cs411a/411proj.html

Copyright by Jiawei Han, modified
55
Training Dataset

An
Example
from
Quinlan’s
ID3
Outlook Tempreature Humidity Windy Class
sunny hot
high
false
N
sunny hot
high
true
N
overcast hot
high
false
P
rain
mild
high
false
P
rain
cool
normal false
P
rain
cool
normal true
N
overcast cool
normal true
P
sunny mild
high
false
N
sunny cool
normal false
P
rain
mild
normal false
P
sunny mild
normal true
P
overcast mild
high
true
P
overcast hot
normal false
P
rainCopyright mild
high
true
N
by Jiawei Han, modified
56
Naive Bayesian Classifier (II)
Given a training set, we can compute the
probabilities
 Extremely efficient; training data not in memory

O u tlo o k
su n n y
o verc ast
rain
T em p reatu re
hot
m ild
cool
P
2 /9
4 /9
3 /9
2 /9
4 /9
3 /9
N
3 /5
0
2 /5
2 /5
2 /5
1 /5
H u m id ity P
h ig h
3 /9
n o rm al
6 /9
N
4 /5
1 /5
W in d y
tru e
false
3 /5
2 /5
Copyright by Jiawei Han, modified
3 /9
6 /9
57
Bayesian Belief Networks (I)
Family
History
Smoker
(FH, S) (FH, ~S)(~FH, S) (~FH, ~S)
LungCancer
Emphysema
LC
0.8
0.5
0.7
0.1
~LC
0.2
0.5
0.3
0.9
The conditional probability table
for the variable LungCancer
PositiveXRay
Dyspnea
Bayesian Belief Networks
Copyright by Jiawei Han, modified
58
Bayesian Belief Networks (II)

Bayesian belief network allows a subset of the
variables conditionally independent

A graphical model of causal relationships

Several cases of learning Bayesian belief networks:
 Given
both network structure and all the
variables: easy.
 Given
network structure but only some variables.
 When
the network structure is not known in
advance.
Copyright by Jiawei Han, modified
59
Session 5. Classification and Prediction

Introduction

Decision Tree Induction

Bayesian Classification

Neural Networks

Other Classification Methods

Prediction Methods
Copyright by Jiawei Han, modified
60
Neural Networks

Advantages
prediction accuracy is generally high
 robust, works when training examples contain errors
 output may be discrete, real-valued, or a vector of
several discrete or real-valued attributes
 fast evaluation of the learned target function.


Criticism
long training time
 difficult to understand the learned function (weights).
 not easy to incorporate domain knowledge


ftp://ftp.sas.com/pub/neural/FAQ.html
Copyright by Jiawei Han, modified
61
A Neuron
- mk
x0
w0
x1
w1
xn
f
output y
wn
Input
weight
vector x vector w


weighted
sum
Activation
function
The n-dimensional input vector x is mapped into
variable y by means of the scalar product and a
nonlinear function mapping
Copyright by Jiawei Han, modified
62
Multi Layer Perceptron
Output vector
S ip
h
    m v mp )
m 1
 ( x) 
Output nodes
1
1  e x
vp m
n
i m

f
(
(
x
 l wl )  r m )

m
Hidden nodes
l 1
wlm
e x  e x
f ( x ) ::  ( x )  x  x
e e
Input nodes
Input vector: xi
Copyright by Jiawei Han, modified
63
Models and Architectures

Learning paradigms
classification
 clustering
 reinforcement


Network topologies
feed-forward
 limited recurrent
 fully recurrent


Learning algorithm

Cross-validation may minimize the risk of overfitting.
Copyright by Jiawei Han, modified
64
Learning Paradigms (I)
(1) Classification
adjust weights using
Error = Desired - Actual
(2) Reinforcement
adjust weights
using reinforcement
Inputs
Actual Output
Copyright by Jiawei Han, modified
65
Learning Paradigms(II) --- Clustering
(2) Adjust weights
of winner toward
input pattern
(1) Outputs
compete
to be
winner
(0) Inputs
Copyright by Jiawei Han, modified
66
Learning Algorithms

Back propagation for classification

Kohonen feature maps for clustering

Recurrent back propagation for classification

Radial basis function for classification

Adaptive resonance theory

Probabilistic neural networks
Copyright by Jiawei Han, modified
67
Major Steps

Constructing a network
input data representation
 selection of number of layers, number of nodes in each
layer

Training the network using training data
 Pruning the network
 Interpret the results

Copyright by Jiawei Han, modified
68
Constructing the Network (I) --- Input
Data Representation
3
doctor
Categorical
Hashed or
Discrete
Numeric
Look up
Normalized
0.3
Coded
1 of N 100
Thermometer
111
Binary
011
Thresholded or
discretized
Continuous
Numeric
Copyright by Jiawei Han, modified
69
Constructing the Network (II)

The number of input nodes: corresponds to the
dimensionality of the input tuples

Thermometer coding:
– age 20-80: 6 intervals
– [20, 30)  000001, [30, 40)  000011, …., [70, 80)  111111

Number of hidden nodes: adjusted during
training

Number of output nodes: number of classes
Copyright by Jiawei Han, modified
70
Session 5. Classification and Prediction

Introduction

Decision Tree Induction

Bayesian Classification

Neural Networks

Other Classification Methods

Prediction Methods
Copyright by Jiawei Han, modified
77
Other Classification Methods

Genetic algorithm

Instance-based method


k-nearest neighbor classifier

case-based reasoning
Fuzzy logic
Copyright by Jiawei Han, modified
78
Genetic Algorithm (I)


GA: based on an analogy to biological evolution.

Encoding the problem/solution by string(s) of “genes”

A diverse population (pool) of competing hypotheses is
maintained.

At each iteration (generation), String is evaluated as
how fit they are. The most fit members are selected to
produce new offspring that replace the least fit ones.

Hypotheses are encoded by strings that are combined
by crossover operations, and subject to random
mutation, to produce offspring.
Learning is viewed as a special case of optimization.
Copyright by Jiawei Han, modified
79
Genetic Algorithm (II)

IF (level = doctor) and (GPA = 3.6)
THEN result=approval
 level
001
GPA result
111
10
 00111110
10001101
10011110
00101101
Copyright by Jiawei Han, modified
80
Instance-Based Methods
Instance-based learning: Store training examples
and delay the processing (“lazy evaluation”) until a
new instance must be classified.
 Typical approaches:
 k-nearest neighbor approach:

– Instances represented as points in a Euclidean space.
 Locally
weighted regression:
– Constructs local approximation.
 Case-based
reasoning:
– Uses symbolic representations and knowledge-based
inference.
Copyright by Jiawei Han, modified
81
The k-Nearest Neighbor Algorithm





All instances correspond to points in the n-D space.
The nearest neighbor are defined in terms of Euclidean
distance.
The target function could be discrete- or real- valued.
For discrete-valued, the k-NN returns the most common
value among the k training examples nearest to xq.
Vonoroi diagram: the decision surface induced by 1-NN
for a typical set of training examples.
.
_
_
_
+
_
_
.
+
xq
_
+
.
+
Copyright by Jiawei Han, modified
.
.
.
82
Discussion on the k-NN Algorithm


The k-NN algorithm for continuous-valued target functions.
 Calculate the mean values of the k nearest neighbors.
Distance-weighted nearest neighbor algorithm.
 Weight the contribution of each of the k neighbors
according to their distance to the query point xq.
– giving greater weight to closer neighbors:

1
d ( xq , xi )2
Similarly, we can distance-weight the instances for realvalued target functions.
Robust to noisy data by averaging k-nearest neighbors.
Curse of dimensionality: distance between neighbors could
be dominated by irrelevant attributes. To overcome it,
 axes stretch or elimination of the least relevant attributes.


w
Copyright by Jiawei Han, modified
83
Locally Weighted Regression
Construct an explicit approximation to f over a local
region surrounding query instance xq.
 Locally weighted linear regression:

The target function f is approximated near xq using the
linear function:
f ( x)  w  w a ( x)wnan ( x)
0
11
 minimize the squared error: distance-decreasing weight K


E ( xq )  1
( f ( x)  f ( x))2 K(d ( xq , x))

2 xk _nearest _neighbors_of _ x
q
the gradient descent training rule:
w j  

K (d ( xq , x))(( f ( x)  f ( x))a j ( x)

x k _ nearest _ neighbors_ of _ xq
In most cases, the target function is approximated
by a constant, linear, or quadratic function.
Copyright by Jiawei Han, modified
84
Case-Based Reasoning





Similarity (to k-nearest neighbors and locally weighted
regression): lazy evaluation + analyzing similar instances.
Difference: Instances are not “points in a Euclidean space”.
Example: Water faucet problem in CADET (Sycara et al’92).
Methodology:
 Instances represented by rich symbolic descriptions (e.g.,
function graphs).
 Multiple retrieved cases may be combined.
 Tight coupling between case retrieval, knowledge-based
reasoning, and problem solving.
Research issues
 Indexing based on syntactic similarity measure, and when
failure, backtracking,
adapting
Copyrightand
by Jiawei
Han, modifiedto additional cases.
85
Remarks on Lazy vs. Eager Learning





Instance-based learning: lazy evaluation
Decision-tree and Bayesian classification: eager evaluation.
Key differences:
 Lazy method may consider query instance xq when
deciding how to generalize beyond the training data D.
 Eager method cannot since they have already chosen
global approximation when seeing the query.
Efficiency: Lazy - less time training but more time predicting.
Accuracy:
 Lazy method effectively uses a richer hypothesis space
since it uses many local linear functions to form its implicit
global approximation to the target function.
 Eager: must commit to a single hypothesis that covers the
entire instance space.
Copyright by Jiawei Han, modified
86
Session 5. Classification and Prediction

Introduction

Decision Tree Induction

Bayesian Classification

Neural Networks

Other Classification Methods

Prediction Methods
Copyright by Jiawei Han, modified
87
Predictive Modeling in Databases





Predictive modeling: Predict data values or construct
generalized linear models based on the database data.
One can only predict value ranges or category distributions.
Method outline:
 Minimal generalization
 Attribute relevance analysis
 Generalized linear model construction
 Prediction.
Determine the major factors which influence the prediction.
 Data relevance analysis: uncertainty measurement, entropy
analysis, expert judgement, etc.
Multi-level prediction: drill-down and roll-up analysis.
Copyright by Jiawei Han, modified
88
Regress Analysis and Log-Linear Models
in Prediction

Linear regression: Y =  +  X
Two parameters ,  and  specify the line and are to be
estimated by using the data at hand.
 using the least squares criterion to the known values of
Y1, Y2, …, X1, X2, ….


Multiple regression: Y = b0 + b1 X1 + b2 X2.
Many nonlinear functions can be transformed into the
above.
 Log-linear models:
 The multi-way table of joint probabilities is
approximated by a product of lower-order tables.


Probability: p(a, b, c, d) = ab acad bcd
Copyright by Jiawei Han, modified
89
Prediction: Numerical Data
Copyright by Jiawei Han, modified
90
Prediction: Categorical Data
Copyright by Jiawei Han, modified
91
Conclusions

Classification is an extensively studied problem (mainly in
statistics, machine learning & neural networks)

Classification is probably one of the most widely used data
mining techniques with a lot of applications.

Scalability is still an important issue for database
applications.

Combining classification with database techniques should
be a promising research topic.

Research Direction: Classification of non-relational data,
e.g., text, spatial, multimedia, etc..
Copyright by Jiawei Han, modified
92