Download Data Mining Reference Books Supervised vs. Unsupervised Learning

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia, lookup

K-nearest neighbors algorithm wikipedia, lookup

Transcript
Reference Books
y Han, J. , Kamber, M., Pei, J., (2011). Data Mining: Concepts
Data Mining
and Techniques. Third edition. San Francisco: Morgan Kaufmann
Publishers.
y Larose, Daniel T. (2005). Discovering Knowledge In Data – An
Introduction to Data Mining. New Jersey: John Wiley and Sons
Ltd.
y Pang-NingTan,Michael Steinbach, Vipin Kumar (2006),
Introduction to Data Mining, AddisonWesley.
y Alpaydın, E. (2010). Introduction to Machine Learning. Second
Ed. London:MIT Press.
Classification
k-nearest neighbors
Dr. Engin YILDIZTEPE
2
Supervised vs. Unsupervised Learning
Classification: Definition
y In
y Supervised learning
y Supervision: The training data (observations, measurements, etc.)
are accompanied by labels indicating the class of the observations
classification, there is a target categorical
variable, (e.g., income bracket), which is partitioned
into predetermined classes or categories, such as
hi h income,
high
i
middle
iddl income,
i
and
d low
l
i
income.
y New data is classified based on the training set
y Unsupervised learning
y The class labels of training data is unknown
y Given a set of measurements, observations, etc. with the aim of
establishing the existence off classes or clusters in the data
3
1
Classification: Definition
Illustrating Classification Task
y Training
set : The set of tuples used for model
construction is training set. Given a collection of records.
y Each record contains a set of attributes, one of the
attributes is the class.
y Each tuple/record is assumed to belong to a
predefined class, as determined by the class attribute.
y Test set: A
test set is used to determine the accuracy of
the model. Usually, the given data set is divided into
training and test sets, with training set used to build the
model
d l and
d test
t t sett used
d to
t validate
lid t it.
it
y Test set is independent of training set (otherwise
overfitting)
Tid
Attrib1
1
Yes
Large
Attrib2
125K
Attrib3
No
Class
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Tid
Attrib1
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
Learn
Model
10
Attrib2
Attrib3
Class
Apply
Model
10
Classification: Definition
y Model construction: Find a
model
for class attribute as a
function of the values of other attributes.
y The model is represented as classification rules, decision
trees, or mathematical formulae
Process (1): Model Construction
Training
Data
y Goal: previously unseen or new records should be assigned a
class as accurately as possible.
„
Accuracy rate is the percentage of test set samples that are
correctly classified by the model. If the accuracy is acceptable,
use the model to classify data tuples whose class labels are
not known
NAME
M ike
M ary
Bill
Jim
Dave
Anne
RANK
YEARS TENURED
Assistant Prof
3
no
Assistant Prof
7
yes
Professor
2
yes
Associate Prof
7
yes
Assistant Prof
6
no
Associate Prof
3
no
Classification
Algorithms
Classifier
(Model)
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
8
2
Process (2): Using the Model in Prediction
Prediction Problems: Classification vs. Numeric
Prediction
y Classification
y predicts categorical class labels (discrete or nominal)
y classifies
f
data (constructs a model) based on the
Classifier
Testing
Data
Unseen Data
training set and the values (class labels) in a classifying
attribute and uses it in classifying new data
y Numeric Prediction
y models continuous-valued functions, i.e., predicts
unknown or missing values
(Jeff Professor,
(Jeff,
Professor 4)
NAME
Tom
Merlisa
George
Joseph
RANK
YEARS TENURED
Assistant Prof
2
no
Associate Prof
7
no
Professor
5
yes
Assistant Prof
7
yes
Tenured?
9
Classification Techniques
y Instance-Based Classifiers
y Decision Tree based Methods
y Neural Networks
10
Lazy Learning
y Lazy learning
y Lazy learning (e.g., instance-based learning): Simply stores training
data (or only minor processing) and waits until it is given a test tuple
y Lazy: less time in training but more time in predicting
y Other Classification Methods
y Rule-based Methods
y Memory based reasoning
y Bayes Classification Methods
y Support Vector Machines
12
3
Instance-Based Methods (Lazy Learner)
y Instance-based learning:
y Store training examples and delay the processing
K-Nearest Neighbor Classifiers
y Basic idea:
y If it walks like a duck, quacks like a duck, then it’s
probably a duck
( lazy evaluation”)
(“lazy
evaluation ) until a new instance must be
classified
y Typical approaches
y k-nearest neighbor approach
Compute
Distance
Test Record
y Instances represented as points in a Euclidean space.
y Locally weighted regression
y Case-based reasoning
Training
Records
Choose k of the
“nearest” records
13
K-Nearest-Neighbor Classifiers
Unknown record
z
Definition of k-Nearest Neighbor
Requires three things
– The set of stored records
– Distance Metric to compute
distance between records
– The value of k, the number of
nearest neighbors to retrieve
X
z
X
X
(b) 2-nearest neighbor
(c) 3-nearest neighbor
To classify an unknown record:
– Compute distance to other
training records
– Identify
Id tif k nearestt neighbors
i hb
– Use class labels of nearest
neighbors to determine the
class label of unknown record
(e.g., by taking majority vote)
(a) 1-nearest neighbor
K-nearest neighbors of a record x are data points that have the k
smallest distance to x
4
Nearest Neighbor Classification
y Compute distance between two points:
y Euclidean distance
d ( p, q ) =
Nearest Neighbor Classification…
y Choosing the value of k:
y If k is too small, sensitive to noise points
y If k is too large,
large neighborhood may include points
∑ ( pi
i
−q )
2
from other classes
i
y Determine the class from nearest neighbor list
y take the majority vote of class labels among the k-
nearest neighbors
hb
Nearest Neighbor Classification…
y Scaling issues
y Attributes may
y have to be scaled to p
prevent distance
measures from being dominated by one of the
attributes (normalization)
y Example:
y height of a person may vary from 1.5m to 1.8m
Nearest neighbor Classification
y k-NN classifiers are lazy learners
y It does not build models explicitly
p
y
y They’re not eager learners such as decision tree
induction and rule-based systems
y Classifying unknown records are relatively expensive.
y weight of a person may vary from 40kg to 100kg
y income of a person may vary from $10K to $1M
5
Example
X1
5
4
1
2
2
2
1
4
6
4
10
5
10
8
8
5
4
7
6
7
5
10
10
10
8
X2
4
7
6
7
4
2
6
1
1
1
10
8
5
4
6
8
5
4
9
7
8
6
6
4
3
class
+
+
+
+
+
+
+
+
+
+
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
?
21
6