Download SATOMGI Data Mining and Matching

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Multinomial logistic regression wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Lecture outline
SATOMGI
Data Mining and
Matching
Lecture 5: Introduction to
classification and prediction
Classification and prediction
• Classification
• Predicts categorical class labels (discrete or nominal)
• Classifies data (constructs a model) based on the training set and the
values (class labels) in a classifying attribute and uses it in classifying
new data
• Prediction
• Models continuous-valued functions (predicts unknown or missing values)
• Typical applications
Credit approval
Target marketing
Medical diagnosis
Fraud detection
...
• Classification and prediction
• Classification – A two step process
• Supervised versus un-supervised learning
• Evaluating classification methods
• Decision tree induction
• Overfitting
• Enhancements to basic decision tree induction
• Classification in large databases
• Other classification techniques
• Measuring classifier accuracy
• Evaluating the accuracy of a classifier
• What is prediction
Classification – A two step process
• Model construction: Describing a set of predetermined
classes
• Each tuple/sample/record is assumed to belong to a predefined class, as
determined by the class label attribute
• The set of tuples used for model construction is the training set
• The model is represented as classification rules, decision trees, or
mathematical formulas
• Model usage: For classifying future or unknown objects
• Need to estimate the accuracy of the model
• The known label of a test sample is compared with the classified results
from the model
• Accuracy rate is the percentage of test set samples that are correctly
classified by the model
• The test set is independent of the training set, otherwise overfitting will
occur
• If the accuracy is acceptable, use the model to classify data objects whose
class labels are not known
Example: Using the model in classification
Example: Model construction
Classification
Algorithm
Accuracy: 75%
Training
Data
NAME
Mike
Mary
Bill
Jim
Dave
Anne
RANK
YEARS TENURED
Assistant Prof
3
no
Assistant Prof
7
yes
Professor
2
yes
Associate Prof
7
yes
Assistant Prof
6
no
Associate Prof
3
no
Testing
Data
Classifier
(Model)
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Source: Han and Kamber, DM Book, 2nd Ed. (Copyright © 2006 Elsevier Inc.)
Supervised versus un-supervised learning
Classifier
NAME
Tom
Merlisa
George
Joseph
RANK
YEARS TENURED
Assistant Prof
2
no
Associate Prof
7
no
Professor
5
yes
Assistant Prof
7
yes
Unseen Data
(Jeff, Professor, 4)
Tenured?
Source: Han and Kamber, DM Book, 2nd Ed. (Copyright © 2006 Elsevier Inc.)
Evaluating classification methods
• Accuracy (or other quality measures)
• Supervised learning (classification and prediction)
• Supervision: The training data (observations, measurements, etc.)
are accompanied by labels indicating the class of the observations
• New data is classified based on the training set
• Un-supervised learning (clustering and association rules
mining)
• The class label of training data is unknown
• Given a set of measurements, observations, etc. with the aim of
establishing the existence of associations or clusters in the data
• Classifier accuracy: Predicting class label
• Predictor accuracy: Guessing value of predicted attributes
• Speed and complexity
• Time to construct the model (training time)
• Time to use the model (classification/prediction time)
• Scalability: Efficiency in handling disk-based databases
• Robustness: Handling noise and outliers
• Interpretability: Understanding and insight provided by
the model
• Other measures (for example goodness of rules)
• Such as decision tree size or compactness of classification rules
Decision tree induction: Example training data set
What rule
would you
learn to
identify
who buys a
computer?
Age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
Income Student Credit rating Buys computer
high
no fair
no
high
no excellent
no
high
no fair
yes
medium
no fair
yes
low
yes fair
yes
low
yes excellent
no
low
yes excellent
yes
medium
no fair
no
low
yes fair
yes
medium
yes fair
yes
medium
yes excellent
yes
medium
no excellent
yes
high
yes fair
yes
medium
no excellent
no
Source: Han and Kamber, DM Book, 2nd Ed. (Copyright © 2006 Elsevier Inc.)
Algorithm for decision tree induction
• Basic algorithm (a greedy algorithm)
• Tree is constructed in a top-down recursive divide-and-conquer manner
• At start, all training examples are at the root
• Take the best immediate (local) decision while building the tree
• Data is partitioned recursively based on selected attributes
• Attributes are categorical (if continuous-valued, they need to be discretised
in advance)
• Attributes are selected on the basis of a heuristic or statistical measure
• Conditions for stopping partitioning
• All samples for a given node belong to the same class
• There are no remaining attributes for further partitioning (majority voting is
employed for classifying the leaf)
• There are no samples left
Example output: A decision tree for “Buys
computer”
age?
<=30
31..40
overcast
student?
no
no
>40
credit rating?
yes
yes
yes
excellent
no
fair
yes
Source: Han and Kamber, DM Book, 2nd Ed. (Copyright © 2006 Elsevier Inc.)
Overfitting and tree pruning
• Overfitting: An induced tree may overfit the training data
• Too many branches, some may reflect anomalies due to noise or outliers
• Classify training records very well, but poor accuracy for unseen records
• Two approaches to avoid overfitting
• Prepruning: Halt tree construction early: do not split a node if this would
result in a goodness measure falling below a threshold (difficult to choose
an appropriate threshold)
• Postpruning: Remove branches from a “fully grown” tree: get a sequence of
progressively pruned trees (use a set of data different from the training data
to decide which is the “best pruned tree”)
Enhancements to basic decision tree induction
• Allow for continuous-valued attributes
• Dynamically define new discrete-valued attributes that partition the
continuous attribute value into a discrete set of intervals
• Handle missing attribute values
• Assign the most common value of the attribute
• Assign probability to each of the possible values
• Attribute construction
• Create new attributes based on existing ones that are sparsely represented
• This reduces fragmentation, repetition, and replication
Classification in large databases
• Classification: a classical problem extensively studied by
statisticians and machine learning researchers
• Scalability: classifying data sets with millions of examples
and hundreds of attributes with reasonable speed
• Why decision tree induction in data mining?
• Relatively fast learning speed (compared to other classification methods)
• Convertible to simple and easy to understand classification rules
• Can use SQL queries for accessing databases
• Comparable classification accuracy with other methods
Other classification methods
Measuring classifier accuracy
• Bayesian classification (a statistical classifier based on
probabilities, assign new cases to most likely class)
• Rule based classification (rules can be generated from a
decision tree, or directly from data)
• Classification of new tuples/records results in a confusion
matrix with positive and negative examples
• For example: if age = “31..40” then “buys computer” = “yes”
• Support vector machines (transform data into a highdimensional space to make classes linearly separable)
• Artificial neural networks (uses simple neurons and learn
weights of connections between neurons)
• k-nearest-neighbour classifier (classify a new record
according to the class of it's k nearest neighbours)
Predicted class
True
class
C1
C2
C1
True positive
False negative
C2
False positive
True negative
• Accuracy of a classifier M, acc(M) is the percentage of
test set tuples that are correctly classified by the model M
• accuracy = (true_pos + true_neg) / (all_pos + all_neg)
• Error rate (misclassification rate) of M = 1 – acc(M)
Classifier accuracy measures
Evaluating the accuracy of a classifier
classes
buy_computer = yes
buy_computer = no
total
recognition(%)
buy_computer = yes
6954
46
7000
99.34
buy_computer = no
412
2588
3000
86.27
total
7366
2634
10000
95.42
• Accuracy in example here is: (6954+2588) / 10000 = 95.4%
• Ideally, non-diagonal entires should be zero or close to it
• Alternative accuracy measures:
• sensitivity = true_pos / all_pos (true positive recognition rate)
(also called recall)
• specificity = true_neg / all_neg (true negative recognition rate)
• precision = true_pos / (true_pos + false_pos)
(also called positive predictor value)
• This model can also be used for cost-benefit analysis
What is prediction?
• (Numerical) prediction is similar to classification
• Construct a model
• Use model to predict continuous or ordered value for a given input
• Prediction is different from classification
• Classification refers to predict categorical class label
• Prediction models continuous-valued functions
• Major method for prediction: regression
• Model the relationship between one or more independent or
predictor variables and a dependent or response variable
• Regression analysis
• Linear and multiple regression
• Non-linear regression
• Other regression methods: generalised linear model, Poisson regression,
log-linear models, regression trees
• Holdout method
• Given data is randomly partitioned into two independent sets
Training set (for example, 2/3) for model construction
Test set (for example, 1/3) for accuracy estimation
Random sampling: a variation of holdout
Repeat holdout k times, accuracy = average of the accuracies
obtained
• Cross-validation (k-fold, where k=10 is most popular)
• Randomly partition the data into k mutually exclusive subsets, each
approximately equal size
• At i-th iteration, use Di as test set and others as training set
• Leave-one-out: k folds where k = number of tuples, for small sized data
• Stratified cross-validation: folds are stratified so that class distribution in
each fold is approximately the same as that in the initial data
Lecture summary
• Classification and prediction are supervised learning
techniques (training data with know class labels needed)
• Many different techniques available
• Various measures to get accuracy of a classifier
• For data mining, classification techniques have to be
scalable with size of the data as well as its dimensionality
(number of attributes)
• Prediction can be used for continuous values, classification
for categorical values
• Predict salary values based on past salary and other attributes
• Classify if a patient should get cancer test or not