Download Classification and Prediction

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Machine learning wikipedia , lookup

Cross-validation (statistics) wikipedia , lookup

The Measure of a Man (Star Trek: The Next Generation) wikipedia , lookup

Data (Star Trek) wikipedia , lookup

Pattern recognition wikipedia , lookup

Time series wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Objectives
Introduction
What is Classification?
Classification vs Prediction
Supervised and Unsupervised Learning
D t P
Data
Preparation
ti
Classification Accuracy
ID3 Algorithm
Information Gain
Bayesian Classification
Predictive Modelling
`
`
`
`
`
`
Classification and Prediction
`
Lecture 5/DMBI/IKI83403T/MTI/UI
`
Yudho Giri Sucahyo, Ph.D, CISA ([email protected])
Faculty of Computer Science, University of Indonesia
`
`
2
Introduction
`
`
`
`
`
What is Classification? – A two-step
two step process
Databases are rich with hidden information that can be used
for making intelligent business decisions.
Classification and prediction can be used to extract models
d
describing
ibi iimportant d
data classes
l
or to predict
di ffuture data
d
trends.
Classification predicts categorical labels.
labels Ex: categorize bank
loan applications Æ safe or risky.
Prediction models continuous-valued functions. Ex: predict the
expenditures of potential customers on computer equipment
given their income and occupation.
Typical Applications:
`
`
University of Indonesia
`
Model construction:
`
`
`
`
`
Credit approval, target marketing,
Medical diagnosis,
g
, treatment effectiveness analysis
y
Each tuple is assumed to belong to a predefined class, as
determined by one of the attributes, called the class label.
Data tuples are also referred to as samples, examples, or objects.
All tuples used for construction is called training set.
Since the class label of each training sample is provided Æ
supervised learning. In clustering (unsupervised learning),
th class
the
l llabels
b l off each
h training
t i i sample
l iis nott known,
k
and
d th
the
number or set of classes to be learned may not be known in
advance.
The model is represented in the following forms:
`
University of Indonesia
4
Classification rules,, ((IF-THEN statements),
), decision tree,, mathematical
formulae
University of Indonesia
What is Classification? – A two-step
two step process (2)
Classification Process (1)
The model is used for classifying future or
unknown objects.
`
`
Training
Data
First, the predictive accuracy of the model is estimated
`
`
`
`
Classification
Algorithms
The known label of test sample is compared with the classified result
from the model.
Accuracy rate is the percentage of test set samples that are correctly
classified by the model.
Test set is independent of training set otherwise over-fitting (it may
have incorporated some particular anomalies of the training data that
are not present in the overall sample population) will occur.
If the accuracy of the model is considered acceptable
acceptable, the
model can be used to classify future objects for which the
)
class label is not known ((unknown, ppreviouslyy unseen data).
NAM E
M ike
M ary
Bill
Jim
Dave
Anne
RANK
YEARS TENURED
Assistant Prof
3
no
Assistant Prof
7
yes
Professor
2
yes
Associate Prof
7
yes
Assistant Prof
6
no
Associate Prof
3
no
Classifier
(Model)
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
yes
6
5
University of Indonesia
Classification Process (2)
University of Indonesia
What is Prediction?
Prediction is similar to classification
`
`
Classifier
`
First, construct model.
Second, use model to ppredict future or unknown objects
j
`
Testing
Data
7
RANK
YEARS TENURED
Assistant Prof
2
no
Associate Prof
7
no
Professor
5
yes
Assistant Prof
7
yes
es
…
Unseen Data
…
(Jeff, Professor, 4)
NAM E
Tom
M erlisa
George
Joseph
Major method for prediction is regression:
Prediction is different from classification
`
`
Tenured?
University of Indonesia
Linear and multiple regression
Non-liner regression
`
8
Classification refers to predict categorical class label.
Prediction refers to predict continuous value.
University of Indonesia
Classification vs Prediction
Supervised vs Unsupervised Learning
Sending out promotional literature to every new
customer in the database can be quite costly. A more cosefficient method would be to target only those new
customers who
h are likely
lik l to purchase
h
a new computer Æ
classification.
P d the
Predict
h number
b off major purchases
h
that
h a customer
will make during a fiscal year Æ prediction.
`
`
9
`
Remove/reduce noise and the treatment of missing values
Relevance Analysis
`
`
`
11
Supervised learning (classification)
`
Supervision: The training data (observations, measurements,
p
byy labels indicatingg the class of the
etc.)) are accompanied
observations
`
Based on the trainingg set to classifyy new data
Unsupervised learning (clustering)
`
We are given a set of measurements,
measurements observations
observations, etc with
the aim of establishing the existence of classes or clusters in
the data
`
No training data, or the “training data” are not accompanied
by class labels
10
University of Indonesia
Issues – Data Preparation
Data preprocessing can be used to help improve the
accuracy, efficiency, and scalability of the classification or
prediction process.
Data Cleaning
`
`
`
University of Indonesia
Issues – Data Preparation
`
`
`
Data Transformation
`
`
`
`
`
Many of the attributes in the data may be irrelevant to the
classification or prediction task. Ex: data recording the day of
the week on which a bank loan application was filed is unlikely
to be relevant to the success of the application.
application
Other attributes may be redundant.
This step is known as feature selection.
selection
University of Indonesia
`
12
Data can be generalized to higher-level concepts.
Useful fot continuous-valued attributes.
Income can be generalized Æ low, medium, high.
Street Æ city.
Generalization compresses the original training data, fewer
input/output operations may be involved during learning.
Wh using
When
i neurall networks
t
k (or
( other
th methods
th d involving
i l i
distance measurements), data may also be normalized.
University of Indonesia
Comparing Classification Method
Classification Accuracy: Estimating Error Rates
Predictive accuracy
Speed and scalability
`
`
`
`
`
`
handling noise and missing values
Cross-validation
` divide the data set into k subsamples
` use k-1 subsamples
p as trainingg data and one sub-sample
p as test
data --- k-fold cross-validation
` for data set with moderate size
`
Bootstrapping (leave-one-out)
` for small size data
efficiency in large databases (not memory resident data)
Interpretability:
`
`
the level of understanding and insight provided by the model
Goodness of rules
`
`
`
decision tree size
the compactness of classification rules
13
14
University of Indonesia
What is a decision tree?
`
`
Internal node denotes a test on an attribute
Branch represents an outcome of the test
` All tuples in branch have the same value for the tested
attribute.
Leaf node represents class label or class label distribution.
`
An Example
from Quinlan’s
ID3
To classify an unknown sample, the attribute values of the
sample are tested against the decision tree. A path is traced
from the root to a leaf node that holds the class prediction
f that
for
h sample.
l
Decision trees can easily be converted to classification rules.
15
University of Indonesia
Training Dataset
A decision tree is a flow-chart-like tree structure.
`
`
`
time to construct the model
time to use the model
Scalability
`
`
Partition: Training-and-testing
` use two independent data sets, e.g., training set (2/3), test
set(1/3)
` used for data set with large number of samples
Robustness
`
`
`
University of Indonesia
16
Outlook
sunny
sunny
overcast
rain
rain
rain
overcast
sunny
y
sunny
rain
sunny
overcast
overcast
rain
Tempreature Humidity Windy Class
hot
high
false
N
hot
high
true
N
hot
high
g
false
P
mild
high
false
P
cool
normal false
P
cool
normal true
tr e
N
cool
normal true
P
mild
high
g
false
N
cool
normal false
P
mild
normal false
P
mild
ild
normall true
t
P
mild
high
true
P
hot
normal false
P
mild
high
true
N
University of Indonesia
A Sample Decision Tree
Decision Tree Classification Methods
Decision-Tree
`
Outlook
sunny
overcast
overcast
humidity
The basic top-down decision tree generation approach
usually consists of two phases:
`
rain
windy
P
`
high
normal
true
false
N
P
N
P
17
`
`
`
`
`
`
return N as a leaf node labeled with the most common class
`
label N with the split-attribute
f eachh value
for
l Ai off split-attribute,
lit tt ib t grow a branch
b
h from
f
Node
N d N
let Si be the branch in which all tuples have the value Ai for split- attribute
if Si is empty then
…
…
19
`
select split-attribute with highest information gain
`
`
`
return N as a leaf node labeled with C
if attribute-list is empty then
`
`
Partition examples recursively based on selected
attributes.
Tree pruning
Aiming at removing tree branches that may lead to errors
when
h classifying
l if i ttestt d
data
t (t
(training
i i d
data
t may contain
t i noise,
i
outliers, …)
18
`
Create a node N;
if samples are all of the same class C, then
`
At start, all the training examples are at the root.
University of Indonesia
Choosing Split Attribute –
Information Gain (ID3/C4
(ID3/C4.5)
5) (1)
All attributes are categorical
`
`
`
University of Indonesia
ID3 Algorithm
`
Tree construction
Assume all attributes to be categorical (discrete-values).
Continuous-valued attributes must be discretized.
Used to select the test attribute at each node in the tree.
Also called measure of the goodness of split.
The attribute with the highest
g
information ggain is chosen
as the test attribute for the current node.
attach a leaf labeled with the most common class
Else recursively run the algorithm at Node Si
until all branches reach leaf nodes
University of Indonesia
20
University of Indonesia
Information Gain (ID3/C4
(ID3/C4.5)
5) (2)
`
Assume that there are two classes, P and N.
` Let
L the
h set off examples
l S contain p elements
l
off class
l P and
dn
elements of class N.
` The amount of information
information, needed to decide if an arbitrary
example in S belong to P or N is defined as
I ( p, n ) = −
`
Information Gain (ID3/C4
(ID3/C4.5)
5) (3)
p
p
n
n
−
lo g 2
lo g 2
p+n
p+n
p+n
p+n
`
is maximal,, that is,, E(A)
( ) is minimal since I(p,
(p, n)) is the same to all
attributes at a node.
`
Assume that using attribute A as the root in the tree will partition
S in sets {S1, S2 , …, Sv}.
}
` If Si contains pi examples of P and ni examples of N, the information
needed to classify objects in all subtrees Si :
E( A) =
v
∑
i =1
pi + ni
p+n
`
`
`
gain(temperature) = 0.029
gain(humidity) = 0.151
22
University of Indonesia
`
See Table 7.1.
Class label: buys_computer. Two values:YES, NO.
m = 2. C1 correspond to yes, C2 correspond to no.
9 samples of class yes and 5 samples of class no.
Compute the expected information needed to classify a given
sample
9
9
5
5
I ( s1 , s 2 ) = I (9,5) = − log
− log
= 0 .940
2 14
2 14
14
14
University of Indonesia
Next, compute the entropy of each attribute. Let’s start with the
attribute
ib
age.
`
`
`
`
For age = “<= 30”: s11 = 2 s21 = 3 I (s11, s21) = 0.971
For age = “31
31..40
40”:: s12 = 4 s22 = 0 I (s12, s22) = 0
For age = “>40”: s13 = 3 s23 = 2 I (s13, s23) = 0.971
Using equation (7.2),
(7 2) the expected information needed to classify
a given sample if the samples are partitioned by age is
E ( age ) =
`
`
23
University of Indonesia
Information Gain (ID3/C4
(ID3/C4.5)
5) (4)
Examples:
`
gain(outlook)
i ( tl k) = 0.246
0 246
gain(windy) = 0.048
Information Gain (ID3/C4
(ID3/C4.5)
5) (3)
`
In the given sample data,
data attribute outlook is chosen to split at
the root :
I ( pi , ni)
21
`
The attribute A is selected such that the information gain
gain(A) = I(p, n) - E(A)
5
4
5
I ( s 11 , s 21 ) +
I ( s 12 , s 22 ) +
I ( s 13 , s 23 ) = 0 . 694
14
14
14
Hence, the gain in information from such a partitioning:
Gain(age)
( g ) = I ((s1, s2) – E ((age)
g ) = 0.246
Similarly, we can compute Gain(income) = 0.029, Gain(student) =
0.151, Gain(Credit_rating) = 0.048.
24
University of Indonesia
How to use a tree?
`
Directly
`
`
`
Tree Pruning
`
test the attribute value of unknown sample against the tree.
A path is traced from root to a leaf which holds the label
Indirectly
`
`
`
decision tree is converted to classification rules
one rule is created for each path from the root to a leaf
IF-THEN is easier for humans to understand
`
`
Example:
IF age = “<=30” AND student = “no” THEN buys_computer = “no”
`
A decision tree constructed using the training data may have
too many branches/leaf
/ f nodes.
` Caused by noise, overfitting
` May
M result
l poor accuracy for
f unseen samples
l
Prune the tree: merge a subtree into a leaf node.
` Using
U i a set off d
data different
diff
from
f
the
h training
i i data.
d
` At a tree node, if the accuracy without splitting is higher than
the accuracy with splitting
splitting, replace the subtree with a leaf node
node,
label it using the majority class.
Pruning Criterion:
`
`
`
25
`
`
`
`
`
statisticians
AI, especially machine learning researchers
`
`
`
27
`
most previous studies used small size data, and most
algorithms are memory resident
`
`
Recent data mining research contributes to
`
Decision trees seem to be a good choice
`
Database researchers re-examined the problem in the
context of large databases
`
University of Indonesia
Classifying Large Dataset
Classification is a classical problem extensively studied by
`
26
University of Indonesia
Classification and Databases
Pessimistic pruning: C4.5
MDL: SLIQ and SPRINT
Cost complexity pruning: CART
Scalability
Generalization-based classification
Parallel and distributed processing
University of Indonesia
relatively faster learning speed than other classification
methods
can be converted into simple and easy to understand
classification rules
can be used to generate SQL queries for accessing databases
has comparable classification accuracy with other methods
Classifying data-sets
data sets with millions of examples and a few
hundred even thousands attributes with reasonable
speed.
speed
28
University of Indonesia
Scalable Decision Tree Methods
Previous Efforts on Scalability
`
Most algorithms assume data can fit in memory.
`
Data mining research contributes to the scalability issue,
p
y for decision trees.
especially
`
Successful examples
`
SLIQ (EDBT
(EDBT’96
96 -- Mehta et al.
al ’96)
96)
`
SPRINT (VLDB96 -- J. Shafer et al.’96)
`
PUBLIC (VLDB98 -- Rastogi & Shim
Shim’98)
98)
RainForest (VLDB98 -- Gehrke, et al.’98)
`
` Incremental tree construction (Quinlan’86)
(
)
`
`
Data reduction (Cattlet’91)
`
`
`
`
`
University of Indonesia
Presentation of Classification Rules
reducing data size by sampling and discretization.
still a main memory algorithm.
Data partition and merge (Chan and Stolfo’91)
`
`
29
using partial data to build a tree.
testingg other examples
p and those mis-classified ones are used
to rebuild the tree interactively.
partitioning data and building trees for each partition.
merging multiple trees into a combined tree.
experiment results indicated reduced classification accuracy.
30
University of Indonesia
Other Classification Methods
Bayesian Classification
` Neural Networks
` Genetic Algorithm
` Rough Set Approach
` k-Nearest Neighbor Classifier
` Case-Based Reasoning (CBR)
` Fuzzy Logic
` Support Vector Machine (SVM)
`
31
University of Indonesia
32
University of Indonesia
Bayesian Classification
`
`
`
`
`
Bayes Theorem (1)
Bayesian classifiers are statistical classifiers.
They can predict class membership probabilities, such as
the probability that a given sample belongs to a particular
class.
Bayesian classification is based on Bayes theorem.
Naive Bayesian Classifier is comparable in performance
with decision tree and neural network classifiers.
Bayesian classifiers also have high accuracy and speed
when applied to large databases.
33
`
`
`
`
`
`
The probability that any given data sample is an apple,
regardless of how the data sample looks.
University of Indonesia
See example 7.4 for example on Naive Bayesian
Classification.
University of Indonesia
What if we would like to predict a continuous value,
rather than a categorical label?
`
The posterior probability is based on more information
(such as background knowledge) than the prior
probability
b bili which
hi h iis iindependent
d
d
off X.
X
Bayes theorem is P(H | X ) = P( X | H )P(H )
35
Support the world of data samples consists of fruits, described
by their color and shape., Suppose that X is red and round, and
that
h H is
i the
h hhypothesis
h i that
h X is
i an apple.
l Th
Then P(H|X)
reflects our confidence that X is an apple given that we have
seen that X is red and round.
34
`
Prediction of continuous values can be modeled by statistical
techniques
h i
off regression.
i
Example:
`
`
P(X
(X )
`
Let X be a data sample whose class label is unknown.
Let H be some hypothesis, such as that the data sample X
belongs to a specified class C.
We want to determine P(H|X), the probability the the
hypothesis H holds given the observed data sample X.
P(H|X) is the posterior probability or a posteriori
probability, of H conditioned on X.
Predictive Modeling in Databases
P(H) is the prior probability or a priori probability, of H.
`
`
University of Indonesia
Bayes Theorem (2)
`
`
`
`
Am
modle
dle tto predict
redict the salary
salar off ccollege
lle e graduates
rad ates with
ith 10 years
ears off
work experience.
Potential sales of a new product given its price.
Many problems can be solved by linear regression.
Software packages for solving regression problems:
`
36
SAS, SPSS, S-Plus
University of Indonesia
Linear Regression
`
`
`
Data are modeled using a straight line.
The simplest form of regression
Bivariate liner regressions
g
models a random variable Y
(called a response variable), as a linear function of another
random variable, X (called a predictor variable)
`
`
`
Prediction: Numerical Data
Y=α+βX
See Example
p 7.6 for an example
p of linear regression.
g
Other regression models
`
`
Multiple regression
Log-linear models
37
38
University of Indonesia
Prediction: Categorical Data
University of Indonesia
Conclusion
`
Classification is an extensively studied problem (mainly in
statistics, machine learning & neural networks)
`
Classification is probably one of the most widely used data
mining techniques with a lot of applications.
`
Scalability is still an important issue for database applications.
`
Combiningg classification with database techniques
q
should be a
promising research topic.
`
Research Direction: Classification of non-relational
non relational data,
data e.g.,
eg
text, spatial, multimedia, etc..
39
University of Indonesia
40
University of Indonesia
References
`
References (2)
C. Apte and S. Weiss. Data mining with decision trees and decision rules. Future Generation
`
J. Magidson. The chaid approach to segmentation modeling: Chi-squared automatic
g
, editor,, Advanced Methods of Marketingg Research,, ppages
g
interaction detection. In R. P. Bagozzi,
118-159. Blackwell Business, Cambridge Massechusetts, 1994.
`
M. Mehta, R. Agrawal, and J. Rissanen. SLIQ : A fast scalable classifier for data mining. In Proc.
1996 Int. Conf. Extending Database Technology (EDBT'96), Avignon, France, March 1996.
`
S. K. Murthy, Automatic Construction of Decision Trees from Data: A Multi-Diciplinary Survey,
Data Mining and Knowledge Discovery 2(4): 345-389, 1998
`
J.J R. Quinlan. Bagging,
gg g boosting,
g and c4.5. In Proc. 13th Natl. Conf. on Artificial Intelligence
g
(AAAI'96), 725-730, Portland, OR, Aug. 1996.
`
R. Rastogi and K. Shim. Public: A decision tree classifer that integrates building and pruning. In
Proc. 1998 Int. Conf. Very Large Data Bases, 404-415, New York, NY, August 1998.
`
J. Shafer, R. Agrawal, and M. Mehta. SPRINT : A scalable parallel classifier for data mining. In
Proc. 1996 Int. Conf. Very Large Data Bases, 544-555, Bombay, India, Sept. 1996.
`
S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn: Classification and Prediction
Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems. Morgan
Kaufman, 1991.
C
Computer
t SSystems,
t
13,
13 1997.
1997
`
L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth
International Group,
p 1984.
`
P. K. Chan and S. J. Stolfo. Learning arbiter and combiner trees from partitioned data for scaling
machine learning. In Proc. 1st Int. Conf. Knowledge Discovery and Data Mining (KDD'95), pages
39 44 Montreal,
39-44,
M
l Canada,
C d August
A
1995.
1995
`
U. M. Fayyad. Branching on attribute values in decision tree generation. In Proc. 1994 AAAI Conf.,
ppages
g 601-606,, AAAI Press,, 1994.
`
J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest: A framework for fast decision tree
construction of large datasets. In Proc. 1998 Int. Conf. Very Large Data Bases, pages 416-427, New
York, NY, August 1998.
`
M. Kamber, L. Winstone, W. Gong, S. Cheng, and J. Han. Generalization and decision tree induction:
Efficient classification in data mining. In Proc. 1997 Int. Workshop Research Issues on Data
Engineering (RIDE'97), pages 111-120, Birmingham, England, April 1997.
41
University of Indonesia
42
University of Indonesia