Download Supervised Learning:Classification

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Naive Bayes classifier wikipedia , lookup

Types of artificial neural networks wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Classification and Learning
Classification is the process of distributing objects,
items, concepts into classes or categories of the
same type.
This is synonymous to: groupings, indexing,
relegation, taxonomy etc.
A classifier is the tool that obtains classification of
things.
Process of classification is one important tool of
data-mining. What knowledge can we extract
from a given set of data? Given data D , can we
assert
D1  M 1 ... Di  M i
How do we learn these? In what other forms,
knowledge could be extracted from the data
given?
Data
Mining
Classification
Clustering
Classification: What type? What class? What
group? -- A labeling process
Clustering: partitioning data into similar
groupings. A cluster is grouping of 'similar' items!
Clustering is a process that partitions a set of
objects into equivalence classes.
Classification is used extensively in
■ marketing
■ healthcare outcomes
■ fraud detection
■ homeland security
■ investment analysis
■ automatic website and image classifications
Classification allows prioritization and filtering. It
accommodates key-words search.
A typical example.
Data: A relational database comprising tuples
about emails passing through a port.
Each tuple = < sender, recipient, date, size>
Classify each mail either as an authentic or a
junk.
Data preparation before data mining:
► Normal data to be mined is noisy with many
unwanted attributes, etc.
►Discretization of continuous data
►Data normalization [ -1 .. + 1] or [0 .. 1] range
►Data smoothing to reduce noise, removal of
outliers, etc.
►Relevance analysis: feature selection to ensure
relevant set of wanted features only
Process of classification:
► Model construction
 each tuple belongs to a predefined class and is
given a class label
 set of all tuples thus offered is called a
training set
 The attendant model is expressed as
** classification rule (IF-THEN
statements)
** decision tree
** mathematical formula
► Model evaluation
 Estimation of accuracy on the test set.
Compare the known label of the test sample
with the computed label. Compute percentage
of error. Ensure test set  training set
► Implement model
 to classify unseen objects
** assign a label to a new tuple
** predict the value of an actual attribute
Training
Data
Classification
Algorithm
Classifier model
Rule1: (term = "cough" && term =~"chest Xray")
ignore;
Rule2: (temp = > 103 bp=180/100)
malaria || pneumonia;
Rule3: (term = "general pain") && (term="LBC")
infection



Different techniques from statistics, information
retrieval and data mining are used for
classification. Included in it are:
■ Bayesian methods
■ Bayesian belief networks
■ Decision trees
■ Neural networks
■ Associative classifier
■ Emerging patterns
■ Support vector machines.
Combinations of these are used as well. The basic
approach to a classification model construction:
If one or several attributes or features ai  A
occur together in more than one itemset (data
sample, target data) assigned the topic T , then
output a rule
Rule : ai  a j ...  am  T or

Rule   P  at | X    thres  T
t 1
Examples. Naïve Bayes Classifiers.
Best for classifying texts, documents, ...
Major drawback: unrealistic independent
assumption among individual items.
Basic issue here: Do we accept a document d in
class C ? If we do, what is the penalty for
misclassification?
For a good mail classifier, a junk mail should be
assigned "junk" label with a very high
probability. The cost of doing this to a good mail is
very high.
The probability that a document di belongs to
topic C j is computed by Bayes' rule
P  C j | di  
P  di | C j  P  C j 
P  di 
... (1)
Define priori odds on C j as
O( C j ) 
P( C j )
1  P( C j )
... (2)
Then Bayes' equation gives us the posteriori odds
O( C j | di )  O( C j )
P( di | C j )
P( di | C j )
 O( C j )L( di | C j )
... (3)
where L( di | C j ) is the likelihood ratio. This is one
way we could use the classifier to yield posteriori
estimate of a document.
Another way would be to go back to (1). Here
P  C j | di  
P  di | C j  P  C j 
P  di 
with P  C j  
nd  C j 
| D|
... (4)
where | D | is the total volume of the documents in
the database, and nd  C j  is the number of
documents in class C j . The following from Julia
Itskevitch's work is outlined 1 .
1
Julia Itskevitch, `` Automatic Hierarchical E-mail Classification Using Association
Rules '', M.Sc. thesis, Computing Science, Simon Fraser University, July ...
www-sal.cs.uiuc.edu/~hanj/pubs/theses.html
Multi-variate Bernoulli model
Assumption: Each document is a collection of set
of terms (key-words, etc.) t   . Either a specific
term is present or it's absent (we are not
interested in its count, or in its position in the
document).
In this model,

P  di | C j    P t | C j 
t 1



   it P t | C j   1   it  1  P t | C j 
t 1
  ...(5)
it  1, if t  di or zero, otherwise.
If we use (5) P t | C j  
1  nd  C j ,t 
2  nd  C j 
... (6)
Alternative to this would be a term-count model
as outlined below:
In this, for every term we count its frequency of
occurrence if it is present. Positional effects are
ignored. In that case,

P t | C j   co nst 
t 1
P t | C j 
 it
 it !
... (7)
Jason Rennie's Ifile Naïve Baysian approach
outlines a multinomial model 2 .
Reference: http://cbbrowne.com/info/mail.html#IFILE
Every new item considered will be allowed to
change the frequency count dynamically. The
frequent terms are kept, non-frequent terms are
abandoned if their count  log2 ( age )  1 where age
is the total time (space) elapsed since first
encounter.
ID3 Classifier. (Quinlan 1983)
A decision-tree approach. Suppose the problem
domain is on a feature space ai  . Should we use
every feature to discriminate? Isn't there a
possibility that some feature is more important
(revealing) than others and therefore should be
used more heavily?
e.g. TB or ~TB case.
Training space: Three features: (Coughing, Temp,
and Chest-pain). Possible values over a vector of
features:
Coughing (yes, no)
Temp (hi, med, lo)
Chest-pain (yes, no)
Case
1.
2.
3.
4.
5.
6.
7.
8.
Description
(yes, hi, yes, T)
(no, hi, yes, T)
(yes, lo, yes, ~T)
(yes, med, yes, T)
(no, lo, no, ~T)
(yes, med, no, ~T)
(no, hi, yes, T)
(yes, lo, no, ~T)
Consider the feature "Coughing". Just on this
feature, the training set splits into two groups: a
"yes" group, and a "no" group. The decision tree
on this feature appears as:
Coughing
Yes
1. (yes, hi, yes, T)
3. (yes, lo, yes, ~T)
4. (yes, med, yes, T)
6. (yes, med, no, ~T)
8. (yes, lo, no, ~T)
No
2. (no, hi, yes, T)
5. (no, lo, no, ~T)
7. (no, hi, yes, T)
Entropy of the overall system before further
discrimination based on "coughing" feature:
4
4 4
4
T   log 2  log 2  1 bit
8
8 8
8
Entropy of the (yes | coughing) with 2 positive and
3 negative cases is:
2
2 3
3
Tyes|coughing   log 2  log 2  0.9710 bit
5
5 5
5
Similarly, the entropy of the branch
(no | coughing) with 2 positives and 1 negative
gives us:
1
1 2
2
Tno|coughing   log 2  log 2  0.9183 bit
3
3 3
3
Entropy of the combined branches (weighed by
their probabilities)
5
3
Tcoughing  Tyes|coughing  Tno|coughing  0.951 bit
8
8
Therefore, information gained by testing the
"coughing" feature gives us T  Tcoughing  0.49 bit
ID3 yields a strategy of testing attributes in
succession to discriminate on feature space. What
combination of features one should take and in
what order to determine a class membership is
resolved here.
Learning discriminants: Generalized Perceptron.
Widrow-Hoff Algorithm (1960).
Adaline. (Adaptive Linear neuron that learns via
Widrow-Hoff Algorithm)
Given an object with components xi on a feature
space i . A neuron is a unit that receives the
object components and processes them as follows.
1. Each neuron in a NN has a set of links to
receive weighted input. Each link i receives
its input xi , weigh it by i and then sends it
out to the adder to get summed.
u   j x j
j 1
produces.
This is what adder
2. If the sum is greater than the activation
function y  ( u  b ), the adder fires its
output.
The choice of  (.) determines the neuron model.
Step function. ( v )  1, if v  b
 0, otherwise
Sigmoid function. ( v ) 
1
1  exp(   v   )
Gaussian function.
 1  v   2 
1
( v ) 
exp   
 
2

2 

 

We first consider a single neuron system. This
could be generalized to a more complex system.
The single neuron Perceptron is a single neuron
system that has a step-function activation unit
( v )  1 if v  0
 1, otherwise
This is used for binary classification. Given
training vectors and two classes C1 and C2 , if the
output ( v )  1 assign class C1 to the vector,
otherwise class C2 .
To train the system is equivalent to adjusting its
weights associated with its links.
How do we adjust the weights?
To train the system is equivalent to adjusting its
weights associated with its links.
1. k=1
2. get  k . Initial weights could be randomly chosen between (0,1)
3. while (misclassified training examples)
k 1  k   x ;  is learning rate parameter
and the error function   w.x  1
Since the correction to weights could be expressed
as    x , the rule is known as delta-rule.
A perceptron can only model linearly separable
functions like AND, OR, NOT. But it can't model
XOR.
Generlaized  -rule for the Semilinear feedback
Net with backpropagation of Error.
Output Layer
k
wkj
Hidden Layer
j
w ji
Input Layer
i
INPUT PATTERN
The net input to a node in layer j is
net j   w ji oi
… (1)
The output of a node j is
o j  f (net j ) 
1
1 e
( net j  j ) /  0
… (2)
This is the nonlinear sigmoid activation function;
this tells us how the hidden layer nodes would fire
if they at all fire.
The input to the nodes of layer k (here the output
layer) is
netk   wkj o j
… (3)
and the output of the layer k is
ok  f (netk )
… (4)
In the learning phase a number of training
samples would be sequentially introduced.
Let x p  {i pi } be one such input object. The net
seeing this adjusts its weights to its links. The
output pattern {o pk } might be different from the
ideal input pattern {t pk }. The network link-weight
adjustment strategy is to
“adjust the link weights so that net square of
1
2
the error E 
 (t pk  o pk ) is
2P k
minimized.”
For convenience, we omit the subscript p and ask
for what changes in the weights
1
E   (t k  ok ) 2 … (5) is minimized.
2k
We attempt to do so by gradient descent to the
minimum. That is,
wkj  const
 
E
wkj
E
wkj
… (6)
Now
E
E netk

wkj netk wkj
and
netk


 wkj o j  o j
wkj
wkj
E
E ok

netk
ok netk
Let us compute
k  
According to (5),
E
 (t k  pk )
ok
And
… (7)
ok
 f k (netk )
netk
… (8)
… (9)
So that, eqn (6) can now be expressed as
wkj   (t k  pk ) f k (netk )
… (10a)
  k o j
Similarly, the weight adjustment within the inside
is
E
E net j
E
w ji  
 
 oi
w ji
net j w ji
net j
E o j
E

 oi
 oi f j (net j )(
)
o j net j
o j
 oi j
E
cannot be computed directly. Instead, we
o j
express it in terms of the known
But

 E
E
E netk
 
   
o j
k netk o j
k  netk
 E 
wkj    k wkj
   
k  netk 
k
 

 wkm om
 o j m
Thus, it implies  j  f j (net j )  k wkj
k
In other words, the deltas in the hidden nodes can
be evaluated by the deltas at the output layer.
Note that given o j  f (net j ) 
o j
net j
1
1 e
( net j  j ) /  0
 o j (1  o j )
This results for the following delta-rules in the
upper and the hidden layer respectively
and
 pk  (t pk  o pk )o pk (1  o pk )
 pj  o pj (1  o pj )  pk wkj
k