Download 4、Classification and Prediction (6hrs)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

K-nearest neighbors algorithm wikipedia , lookup

Transcript
4、Classification and Prediction (6hrs)
4.1 What is classification? What is prediction?
4.2 Issues regarding classification and prediction
4.3 Classification by decision tree induction
4.4 Bayesian classification
4.5 Classification by back propagation
4.6 Support Vector Machines (SVM)
4.7 Prediction
4.8 Accuracy and error measures
4.9 Model selection
4.10 Summary
Key Points:Definitions; Bayesian Classification ; Decision Trees
Notes: More details on related algorithms.
Q&A:
1.
What is the difference between Classification and prediction
Classification is the process of finding a set of models (or functions) that describe
and distinguish data classes or concepts, for the purpose of being able to use the model
to predict the class of objects whose class label is unknown. In prediction, rather than
predicting class labels, the main interest (usually) is missing or unavailable data values.
(Han & Kamber)
So, although classification is actually the step of finding the models, the goal of
both methods is to predict something about unknown data objects. The difference is
that in classification that “something” is the class of objects, whereas in prediction it
is the missing data values.
2. Briefly outline the major steps of decision tree classification.
Answer:
The major steps are as follows:
• The tree starts as a single root node containing
• If the tuples
all of the training tuples.
are all from the same class, then
the node becomes a leaf,
labeled with that class.
• Else, an attribute selection method
criterion.
Such a method
may using a heuristic
information gain or gini index)
into
individual
classes.
to select the
The
attribute and may also indicate
described
splitting
either
splitting
or statistical measure
(e.g.,
way to separate the
tuples
“best”
criterion
consists
of a splitting
a split-point or a splitting
subset,
as
below.
• Next,
a test
is called to determine the
at
outcomes
the
node is labeled
the
node.
of the
with the
A branch
splitting
splitting
criterion,
is grown from the
criterion
and
the
which serves as
node to each of the
tuples
are partitioned
accordingly.
• The algorithm recurses to create
a decision tree for the tuples at each
partition.
3. Why is tree pruning useful in decision tree induction? What is
a drawback of using a separate set of tuples to evaluate pruning?
Answer:
The decision tree built may overfit the training data.
branches,
There could be too many
some of which may reflect anomalies in the training data
or outliers.
Tree pruning addresses this issue of overfitting the
the
reliable
least
results
in a more compact
accurate
The
in its classification
drawback
that it may not
original
to
the
pruning
of the tree.
statistical measures).
generally
and reliable decision tree that is faster
and more
of data.
be representative of the
If the
pruned
separate set of tuples
tree
means there
would not
to evaluate
training tuples
pruning
used to create
the
pruned
Furthermore, using a separate set of tuples
are less tuples
is
are skewed, then using them
be a good indicator of the
While this is considered a drawback
not be so in data
by removing
This
classification accuracy.
evaluate
(using
of using a separate set of tuples
decision tree.
evaluate
tree’s
branches
data
due to noise
to use for creation
in machine
mining due to the availability of larger data
and testing
learning,
sets.
to
it may
4. Given
a decision
converting
the
the
resulting
then
tree,
decision
you have
tree
to
rules
rules, or (b) pruning
converting
the
pruned
the
option
and
then
the decision
tree
to
of (a)
pruning
tree and
rules.
What
advantage does (a) have over (b)?
Answer:
If pruning a subtree,
(b).
we would remove the subtree
However, with method
precondition of it.
(a), if pruning
a rule, we may remove any
The latter is less restrictive.
5. It is important to calculate
complexity
completely with method
of the
the
decision tree
worst-case
computational
algorithm. Given data
set D,
the number of attributes n, and the number of training tuples
|D|, show that the computational cost of growing a tree is at most n
× |D| × log(|D|).
Answer:
The worst-case
scenario
occurs
attributes as possible before being able
maximum
depth
compute
attribute).
The
of the
the
number
to
to classify each group
is log(|D|).
attribute selection
total
the partitions). Thus,
tree
when we have
of tuples
At each
measure
use as many
of tuples.
level we will have
O(n)
times
(one
on each level is |D| (adding
The
to
per
over all
the computation per level of the tree is O(n × |D|). Summing
over all of the levels we obtain
O(n × |D| × log(|D|)).
6. Given a 5 GB data set with 50 attributes (each containing 100
distinct values) and 512 MB of main memory in your
laptop,
outline an efficient method that constructs decision trees in
such large data sets.
Justify your answer by rough calculation
of your main memory usage.
Answer:
We will use the RainForest algorithm for this problem.
class labels.
the
tree.
The most memory required
To compute
database once and
the
construct the
size of each AVC-list
AVC-list
The
100 × C × 50, which will easily fit into
The
computation of other
be smaller because there
of scans we can compute
parallel.
With
will be for AVC-set
AVC-set
is 100 × C .
AVC-sets
Assume there
for the
root
root
50 attributes.
size of the
AVC-set
will be less attributes available.
way but
The
is then
512MB of memory for a reasonable
is done in a similar
of
node, we scan the
for each of the
total
for the
are C
they
C.
will
To reduce the number
the AVC-set for nodes at the same level of the tree in
such small AVC-sets
per node, we can probably fit the level in
memory.
7. Compare
the
classification
(e.g.,
advantages and
decision
versus lazy classification
tree,
disadvantages of eager
Bayesian,
neural network)
(e.g., k-nearest neighbor,
case-based
reasoning).
Answer:
Eager classification is faster at classification than lazy classification because it
constructs a generalization model before receiving any new tuples to classify.
Weights can be assigned to attributes, which can improve classification accuracy.
Disadvantages of eager classification are that it must commit to a single hypothesis
that covers the entire instance space, which can decrease classification, and more
time is needed for training.
Lazy classification uses a richer hypothesis space, which can improve
classification accuracy.
It requires less time for training than eager classification.
A disadvantages of lazy classification is that all training tuples need to be stored,
which leads to expensive storage costs and requires efficient indexing techniques.
Another disadvantage is that it is slower at classification because classifiers are not
built until new tuples need to
be classified.
Furthermore, attributes are all
equally weighted, which can decrease classification accuracy.
(Problems may arise
due to irrelevant attributes in the data.)
8. What is association-based classification? Why is association-based
classification able to achieve higher classification accuracy than a
classical decision-tree method? Explain how association-based
classification can be used for text document classification.
Answer:
Association-based classification is a method
generated and
associations
and
analyzed
between
class labels.
where association
rules are
We first search
for strong
for use in classification.
frequent patterns (conjunctions of attribute- value pairs)
Using such strong
Association-based classification
associations
we classify new examples.
can achieve higher accuracy
than
decision tree because it overcomes the constraint of decision
consider
rules
one attribute at
that combine multiple
For
a
only
text
the
data
and
uses very
we can
model
items
that correspond
to do stemming
and
which
confidence
termk
=0.9].
for classification,
a new document
arrives
→ classi
support and confidence that matches
combination of rules as in CMAR.
each
document as
words.)
(We can
We also
find frequent patterns and
output rules of the form term1 , term2 , ...,
rule with highest
high
to terms.
remove stop
add the document class to the transaction. We then
When
trees,
attributes.
document classification,
transaction containing
preprocess
a time,
a classical
[sup =0.1,
conf
we can apply the
the document, or apply
a
9. The support vector machine (SVM) is a highly accurate
classification
method.
slow processing
tuples.
However,
when
SVM classifiers suffer from
training with
a large
set
of data
Discuss how to overcome this difficulty and develop a
scalable SVM algorithm for efficient SVM classification in large
datasets.
Answer:
We
data
can
sets
in Proc.
Mining
use
using
micro-clustering technique
the
SVM
2003 ACM
with
SIGKDD
(KDD’03), pages
SVM (CB-SVM)
hierarchical clusters”
Int.
Conf.
306-315, Aug.
method
is described
in
“Classifying
large
by Yu, Yang, and
Han,
Knowledge Discovery and Data
2003 [YYH03]. A
as follows:
1.
Construct the microclusters using a CF-Tree
2.
Train
3.
Decluster
4.
Repeat
the SVM training with the additional entries.
5.
Repeat
the above until
an SVM on the centroids
entries
Cluster-Based
(Chapter 7).
of the microclusters.
near the boundary.
convergence.
10. What is boosting ? State why it may improve the accuracy of
decision tree induction.
Answer:
Boosting
We are
is a technique
given a set
S
used
of s tuples.
to
help
classifier accuracy.
For iteration t, where t = 1, 2, . . . , T , a
training set St is sampled
with replacement from S.
within
Create
that training set.
improve
a classifier, Ct
Assign weights to the tuples
from St . After Ct
is created,
update the weights of the tuples
will have a a greater
constructed.
Ct+1 .
This
so that the tuples
causing classification
probability of being selected for the
will help improve
the
accuracy
of the
Using this technique, each classifier should have greater
its predecessor.
error
next
classifier
next
classifier,
accuracy than
The final boosting classifier combines the votes of each individual
classifier, where the weight of each classifier’s vote is a function
of its accuracy.