Download 09sede_all_NNC_3 - NDSU Computer Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia, lookup

K-nearest neighbors algorithm wikipedia, lookup

Transcript
The Universality of Nearest Neighbor Sets
in
Classification and Prediction
Dr. William Perrizo, Dr. Gregory Wettstein, Dr. Amal Shehan Perera
and Tingda Lu
Computer Science
North Dakota State University
Fargo, ND 58108 USA
Example: Medical Expert System (Ask a Nurse)
Symptoms plus past diagnoses are collected into a table called CASES.
For each undiagnosed new_symptoms,CASES is searched for matches:
SELECT DIAGNOSIS
FROM CASES
WHERE CASES.SYMPTOMS = new_symptoms
If there is a predominant DIAGNOSIS,
then report it
elseIf there's no predominant DIAGNOSIS,
then Classify instead of Query
i.e.,
find the fuzzy matches (near neighbors)
SELECT DIAGNOSIS
FROM CASES
WHERE CASES.SYMPTOMS ≅ new_symptoms
else call your doctor in the morning
Almost universally, Decision Making is, consulting a database
of past expert decisions (your own or that of other experts))
Past similar decisions are collected into a table called CASES (either explicitly or in
one’s head). For each new decision CASE, search for descriptive feature matches (or
near matches). Decide based on the predominant case. Sometimes this is called
“CASE-BASED REASONING”.
SELECT CASE
FROM CASES
WHERE CASES.DESCRIPTIVE_FEATURES = [or ~=] new_descriptive_features
If there is a predominant CASE,
then report it
elseIf there's no predominant CASE,
then Classify instead of Query
i.e.,
find the fuzzy matches (near neighbors)
SELECT DIAGNOSIS
FROM CASES
WHERE CASES.DESCR_FEATURES = [or ~=] new_descr_features
else make default decision.
Near Neighbor Classification
Given a (large) TRAINING SET T(A1, ..., An, C) with
FEATURES A=(A1,...,An), C
CLASS,
C, and
Classification of an unclassified sample, (a1, ..., an) is just:
SELECT Max (Count (T.Ci))
FROM
T
WHERE T.A1=a1 AND T.A2=a2 ... AND T.An=an
GROUP BY T.C;
i.e., It is just a SELECTION, since C-Classification is assigning to (a1, ..., an) the
most frequent C-value in RA=(a1, ..., an).
But, if the EQUALITY SELECTION is empty, then we need a FUZZY QUERY to
find NEAR NEIGHBORs instead of exact matches.
That's Nearest Neighbor Classification (NNC).
Based on the definition of “Near”
Essentially all classification and prediction algorithms are nearest
neighbour vote based algorithms.
Data mining
• Data mining, has 3 general methodologies for extracting
information and knowledge from data.
– Rule Mining: Strong antecedent consequent relationships
among the subsets of the column attributes.
– Classification & Prediction: Discovering signatures for the
individual values in a specified column (class attribute) or
attribute from values of the other attributes (feature attributes).
– Clustering: using some notion of tuple similarity to group
together training table rows so that within a group (a cluster)
there is high.
Prediction and Classification
The Classification and Prediction problem is a very interesting data mining
problem.
The problem is to predict a class label based on past (assumed correct)
prediction activities.
Typically the training datasets of past predictions are extremely large (which is
good for accuracy but bad for speed of prediction).
Immediately one runs into the famous problems,
the curse of cardinality and
the curse of dimensionality.
The curse of cardinality if the number of horizontal records in a training file
is very large, standard vertical processing of horizontally record structures can
take an unacceptably long time (e.g., if there are millions or billions of horizontal
records to scan); and the gold standard method prediction/classification method,
k-Nearest Neighbor, will not yield the most accurate result unless a second scan
is made. Thus the curse of cardinality is both a time curse and an accuracy
curse.
The curse of cardinality
The curse of cardinality is a very serious problem for Near Neighbor Classifiers
(NNCs) of horizontal data (speed and accuracy).
NNCs for horizontally structured data require a scan of the entire training
set to determine the [k] nearest neighbors
Take records 1,2,...,k as the initial k Nearest Nbr set (kNN set)
Get the k+1st record. If it is closer than any one in the kNN set, replace.
Get k+2nd. If it is closer than any one in the kNN set, replace.
Get k+3rd ... ……)
With horizontally structured data and only one vertical scan, the best one can do is
determine one of many “k nearest neighbor sets” i.e., there may be many training
points that are just as near as the kth one selected).
Our solution is to use a vertical data organization so that one horizontal
scan yields the Closed KNN set (CKNN) immediately (all neighbors within a
given distance).
Cost is dependent on the number of attributes
This ameliorates the Curse of Cardinality problem considerably
(PA-KDD 2003 paper).
Curse of Cardinality - 2
With horizontally structured data, he only way to get a fair classification vote in which
all neighbors at a given distance get equal vote, is to make a second vertical scan of
the entire training set, which is expensive.
For that reason, most kNNC implementations on horizontally structured data disregard
the other neighbors at the same distance from the unclassified sample, not because
the k voters suffice, but because it is too expensive to find the other neighbors at that
same distance and to enfranchise them for voting as well.
Of course, if the training set is such that any neighbor gives a representative vote for
all neighbors at that same distance, then kNN is just as good as Closed kNN.
But that would have to mean that all neighbors at the same distance have the same
class, which means the classes are concentric rings around the unclassified sample.
If that is known, no sophisticated analysis is required to classify.
So, we solve the curse of cardinality (speed and accuracy) by using vertical data.
However, it is more common to ”solve” the curse of cardinality by resorting to
model-based classification (in which, first, a compact model is built to represent the
training set information (the training phase) then that closed form, compact model is
used over and over again to classify and predict (the prediction phase).
k-Nearest Neighbor Classification (kNNC) and
closed-k-Nearest Neighbor Classification (ckNNC)
1) Select a suitable value for k
2) Determine a suitable distance or similarity notion (definition of near)
3) Find the k nearest neighbor set [closed] of the unclassified sample.
4) Find the plurality class in the nearest neighbor set.
5) Assign the plurality class as the predicted class of the sample
T
Let T be the unclassified point or sample. Using Euclidean distance and k = 3:
Find the 3 closest neighbors. Move out from T until ≥ 3 neighbors are found.
That's
216 !(more than
3NN arbitrarily
select
3 !)
one point from this boundary line as the 3rd
nearest neighbor, whereas, C3NN includes all points on this boundary
line.
closed k Nearest Neighbor Classification
(ckNNC)
Often yields much higher classification accuracy than traditional kNNC.
At what additional cost?
Actually, it can be at a negative additional cost
i.e., It can be
faster
AND
more accurate
when using vertical data organization (because it yields the closed nearest
neighbor set with just one horizontal scan of the vertical data structures).
The NEXT SLIDE describes our Ptree vertical data organization which facilitates
this faster and more accurate ckNNC.)
Predicate tree technology: vertically project each attribute,
then vertically project each bit position of each attribute,
then compress each bit slice into a basic Ptree.
e.g., compression of R11 into P11 goes as follows:
Current practice: Structure data into
horizontal records. Process vertically (scans)
Base 2
Base 10
R(A1
Horizontally
structured
records
Scanned
vertically
2
6
3
2
3
2
7
7
R[A1] R[A2] R[A3] R[A4]
A2 A 3 A4 )
7
7
7
7
2
2
0
0
6
6
5
5
1
1
1
1
1
0
1
7
4
5
4
4
=
010
011
010
010
011
010
111
111
111
111
110
111
010
010
000
000
110
110
101
101
001
001
001
001
001
000
001
111
100
101
100
100
R11
0
0
0
0
0
0
1
1
pure1? false=0
pure1? true=1
Top-down construction of
the 1-dimensional Ptree
representation of R11,
denoted, P11, is built by
recording the truth of the
universal predicate pure 1
in a tree recursively on
halves (1/21 subsets), until
purity is achieved.
pure1? false=0 pure1? false=0
pure1? false=0
1. Whole is pure1? false  0
2. Left half pure1? false  0
3. Right half pure1? false  0
4. Left half of rt half ? false0
5. Rt half of right half? true1
010
011
010
010
011
010
111
111
111
111
110
111
010
010
000
000
110
110
101
101
001
001
001
001
001
000
001
111
100
101
100
100
R11 R12 R13 R21 R22 R23 R31 R32 R33
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
0
1
0
0
1
0
1
1
1
1
1
1
0
0
0
0
1
1
1
1
1
1
0
0
1
1
0
1
0
0
0
0
1
1
1
1
0
0
0
0
1
1
0
0
0
0
0
0
0
0
1
1
1
1
1
1
R41 R42 R43
0
0
0
1
1
1
1
1
0
0
0
1
0
0
0
0
1
0
1
1
0
1
0
0
Horizontally AND basic Ptrees
But it is pure
(pure0) so this
branch ends
P11 P12 P13
P11
0
0
0
01
1
0
0
0
0 0
0
0
01
1
P21 P22 P23 P31 P32 P33 P41 P42 P43
0
0
0
0
0
0
0
0
0
0
0 0 1 0 1 0
0 01 0 0 0 0 1 0 1 0 0 0 0
10 10
10 01
00 00
0001
0100
^ 10
^
^
^ 01
^
^
^ 10
01
01
01
01
Curse of Dimensionality
Curse of dimensionality:
As files get very wide (have many columns)
Intuition breaks down
The critical notion of “near” ceases to work in Near Neighbor Classification.
Why?
To what limit does the volume of a unit disk go, as n -->  ?
For n=1, the volume is
For n=2, the volume is π r2 or
2
(length of line from -1 to 1).
~3.1416
For n=3, the volume is 4/3 π r3 ~4.1888, etc.
Intuition might tell us
That volume is heading toward  as n goes to .
Or that it will top out asymptotically at some number > 4.1888.
The volume of the unit disk goes to 0 as n --> ,
Reaching its maximum at dimension = 5.
-1
+1
Therefore, there are almost no [Euclidean
distance] Nbrs in high Dimensions
2
-1
21=2
+1
2
The volume of the unit disk goes to 0 as n --> , with max at dim = 5.
But the volume of the unit cube goes to  as n --> .
2
22=4
This is not intuitive to most people
It has consequences in NNC.
This tells us that there's a lot of volume in a high dimensional unit cube, but,
in its inscribed unit disk, there is almost nothing (no “space”, no volume, and
23=8
therefore no points (relatively speaking). ).
E.g, there are lots of potential near neighbors in a high dimensional unit
cube. Thus, when using, e.g., L1 or Hamming distance to define
neighborhoods, there are plenty of neighbors, but there are almost no near
neighbors in a high dimensional unit disk when using Euclidean distance.
Our solution: use cubes, not disks as neighborhoods (this is typically the
distance of choice in Genomics, Bioinformatics and BioMedical Informatics).
Ptree vertical technology facilitates the construction and analysis of
Cube neighborhoods.
16
32
64
.
.
.
2n
.
.
.

All classification is Near
Neighbor Classification ?
NNC approach takes into consideration typical training data set size
characteristics, which cause:
cardinality (curse of cardinality)
dimension (curse of dimensionality)
Model-based classifiers are [often less accurate, in general, and are really]
Near Neighbor Classifiers using an alternate idea of “near”.
We conclude that the improved speed of CNNC using the Ptree vertical
approach a good choice since
It retains (improves) the speed.
It retains (improves) the accuracy of the Near Neighbor approach.
Near Neighbor Classification
Given a (large) TRAINING SET T(A1, ..., An, C) with
FEATURES A=(A1,...,An), C
CLASS,
C, and
Classification (in psuedo-SQL) of an unclassified sample, (a1, ..., an) is just:
SELECT Max (Count (T.Ci))
FROM
T
WHERE T.A1=a1 AND T.A2=a2 ... AND T.An=an
GROUP BY T.C;
i.e., It is just a SELECTION, since C-Classification is assigning to (a1, ..., an) the
most frequent [neighboring] C-value in RA=(a1, ..., an).
If the EQUALITY SELECTION is empty, then we need a FUZZY QUERY to find
NEAR NEIGHBORs instead of exact matches.
That's Nearest Neighbor Classification (NNC).
Nearest Neighbor Classification (sample-based)
and
Eager Classification (model-based)
Given a TRAINING SET, R(A1,..,An, C), with C = CLASSES and
(A1,...,An)=FEATURES
Nearest Neighbor Classification (NNC)
Selecting a set of R-tuples with similar features
Letting the corresponding class values vote.
Nearest Neighbor Classification won't work very well if
The vote is inconclusive (close to a tie) or
if similar (near) is not well defined,
Then we build a MODEL of TRAINING SET
(at, possibly, great 1-time [build phase] expense?)
When a MODEL is built - Eager classification uses the model to assign class.
Model-less methods like Nearest Neighbor - Lazy or Sample-based.
Eager Classification (model-based)
Eager Classifiers models examples
Decision trees, Probabilistic models (Bayesian Classifier, Neural
Networks, Support Vector Machines, etc.)
How do you decide when an EAGER model is good enough to use?
How do you decide if a Nearest Neighbor Classifier is working well
enough?
We have a TEST PHASE.
Typically, we set aside some training tuples as a Test Set
(Test tuples cannot be used in model building or and cannot be used
as nearest neighbors).
If the classifier passes the test (a high enough % of Test tuples are
correctly classified by the classifier) it is accepted.
EXAMPLE: Computer Ownership
The TRAINING SET for predicting who owns a computer is:
Customer (
Age
| 24
| 58
| 48
| 58
| 28
Salary
| 55,000
| 94,000
| 14,000
| 19,000
| 18,000
Job
| Programmer
| Doctor
| Laborer
| Domestic
| Builder
Owns Computer )
| yes
|
| no
|
| no
|
| no
|
| no
|
A Decision Tree (model) classifier built from TRAINING:
Is this a Near Neighbor Classifier?
Where are Near Neighborhoods involved?
Training subset at the bottom of each decision
path represents a near neighborhood of any
unclassified sample that traverses the decision
tree to that leaf.
The concept of “near” or “highly correlated” :
The unclassified sample meets the same set of
conditions or criteria as the near neighbors at the
bottom of that path of condition criteria.
We are using a different (accumulative)
“correlation” definition along each branch of the
decision tree and the subsets at the leaf of each
branch are true Near Neighbor sets for the
respective correlations or notions of nearness.
Neural Network classifiers
Neural Network classifier Training Adjusting the weights and biases
Through back-propagation
Until an acceptable performance.
Matrix of weights and biases –
Determiners of our near neighbor sets.
We continue to train by adjusting weights and biases until:
Near neighbor sets – inputs producing same class
Are sufficiently “near” to each other
To give us a level of accuracy.
Support Vector Machine (SVM)
classifiers
The very first step in Support Vector Machines (SVM) classification is
To isolate a neighborhood in which to examine the boundary
and the margins of the boundary between classes
(assuming a binary classification problem).
Thus, Support Vector Machines are Nearest Neighbor Classifiers also.
CONCLUSIONS AND FUTURE WORK
We have made the case that classification and prediction algorithms are nearest neighbor vote
classification and predictions.
The conclusion depends upon how one defines “near”.
We have shown that there are clearly “nearness” or “correlations” or “similarities that provide
these definitions.
Broadly speaking, this (NNC) is the way we always proceed in Classification.
Faced with a classification or prediction problem?
Head off the standard way of approaching Classification,
That of using a model-based classification method unless it just doesn’t work well
enough and only then using Nearest Neighbor Classification.
“It is all Nearest Neighbor Classification”
That standard NNC should be used UNLESS it takes too long.
Only then should one consider giving up accuracy (of your near neighbor set) for speed by using
a model (Decision Tree or Neural Network).
With a vertical data structure like P-trees NNC can be applied in most cases efficiently.