Download ppt

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
ML410C
Projects in health informatics –
Project and information management
Data Mining
Last time…
• Why do we need data analysis?
• What is data mining?
• Examples where data mining has been useful
• Data mining and other areas of computer science
and mathematics
• Some (basic) data mining tasks
The Knowledge Discovery Process
Knowledge Discovery in Databases (KDD) is the nontrivial
process of identifying valid, novel, potentially useful, and
ultimately understandable patterns in data.
U.M. Fayyad, G. Piatetsky-Shapiro and P. Smyth, “From Data Mining to
Knowledge Discovery in Databases”, AI Magazine 17(3): 37-54 (1996)
CRISP-DM: CRoss Industry Standard
Process for Data Mining
Shearer C., “The CRISP-DM model: the new blueprint for data mining”,
Journal of Data Warehousing 5 (2000) 13-22 (see also www.crisp-dm.org)
Today
DATE
TIME
ROOM
TOPIC
MONDAY
2013-09-09
WEDNESDAY
2013-09-11
FRIDAY
2013-09-13
10:00-11:45
502
Introduction to data mining
09:00-10:45
501
10:00-11:45
Sal C
Decision trees, rules and
forests
Evaluating predictive models
and tools for data mining
Today
• What is classification
• Overview of classification methods
• Decision trees
• Forests
Predictive data mining
Our task
• Input: data representing objects that
have been assigned labels
• Goal: accurately predict labels for new
(previously unseen) objects
An example: email classification
Features (attributes)
Examples (observations)
Ex.
All
No. excl.
Missing
No. digits
Image
Spam
caps
marks
date
in From:
fraction
e1
yes
0
no
3
0
yes
e2
yes
3
no
0
0.2
yes
e3
no
0
no
0
1
no
e4
no
4
yes
4
0.5
yes
e5
yes
0
yes
2
0
no
e6
no
0
no
0
0
no
Decision tree
Spam = yes
Spam = yes
Spam = no
Rules
Spam = yes
Spam = yes
Spam = no
Spam = no
Spam = no
Forests
Classification
• What is the class of the following e-mail?
– No Caps: Yes
– No. excl. marks: 0
– Missing date: Yes
– No. digits in From: 4
– Image fraction: 0.3
Classification
•
•
•
•
•
•
What is classification?
Issues regarding classification and prediction
Classification by decision tree induction
Classification by Naïve Bayes classifier
Classification by Nearest Neighbor
Classification by Bayesian Belief networks
Classification
• Classification:
– predicts categorical class labels
– classifies data (constructs a model) based on the training set and the
values (class labels) in a classifying attribute
– uses the model for classifying new data
• Typical Applications
– credit approval
– target marketing
– medical diagnosis
– treatment effectiveness analysis
Why Classification? A motivating application
• Credit approval
– A bank wants to classify its customers based on whether
they are expected to pay back their approved loans
– The history of past customers is used to train the classifier
– The classifier provides rules, which identify potentially
reliable future customers
Why Classification? A motivating application
• Credit approval
– Classification rule:
• If age = “31...40” and income = high
• then credit_rating = excellent
– Future customers
• Paul: age = 35, income = high  excellent credit rating
• John: age = 20, income = medium  fair credit rating
Classification—A Two-Step Process
• Model construction: describing a set of predetermined
classes
– Each tuple/sample is assumed to belong to a predefined
class, as determined by the class label attribute
– The set of tuples used for model construction: training set
– The model is represented as classification rules, decision
trees, or mathematical formulas
Classification—A Two-Step Process
• Model usage: for classifying future or unknown objects
– Estimate accuracy of the model
• The known label of test samples is compared with the
classified result from the model
• Accuracy rate is the percentage of test set samples that
are correctly classified by the model
• Test set is independent of training set, otherwise overfitting will occur
Classification Process (1): Model
Construction
Classification
Algorithms
Training
Data
NAME
Mike
Mary
Bill
Jim
Dave
LDL
normal
high
normal
high
normal
Glu
3
8
7
5
4
H/ATTACK
no
yes
yes
yes
no
Classifier
(Model)
IF LDL = ‘high’
OR Gluc > 6 mmol/lit
THEN Heart attack = ‘yes’
Classification Process (2): Use the
Model in Prediction
Accuracy=?
Classifier
Testing
Data
Unseen Data
(Jeff, high, 7.5)
NAME
Tom
Mellisa
George
Joseph
LDL
normal
low
high
high
Glu
5.5
5
7.5
5
H/ATTACK
no
no
yes
yes
Heart attack?
Supervised vs. Unsupervised Learning
• Supervised learning (classification)
– Supervision: The training data (observations, measurements, etc.) are
accompanied by labels indicating the class of the observations
– New data is classified based on the training set
• Unsupervised learning (clustering)
– The class labels of training data is unknown
– Given a set of measurements, observations, etc. with the aim of
establishing the existence of classes or clusters in the data
Issues regarding classification and prediction:
Evaluating Classification Methods
• Predictive accuracy
• Speed
– time to construct the model
– time to use the model
• Robustness
– handling noise and missing values
• Scalability
– efficiency in disk-resident databases
• Interpretability:
– understanding and insight provided by the model
• Goodness of rules (quality)
– decision tree size
– compactness of classification rules
Classification by Decision Tree Induction
• Decision tree
–
–
–
–
A flow-chart-like tree structure
Internal node denotes a test on an attribute
Branch represents an outcome of the test
Leaf nodes represent class labels or class distribution
• Decision tree generation consists of two phases
– Tree construction
• At start, all the training examples are at the root
• Partition examples recursively based on selected attributes
– Tree pruning
• Identify and remove branches that reflect noise or outliers
• Use of decision tree: Classifying an unknown sample
– Test the attribute values of the sample against the decision tree
Training Dataset
Example
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
income student credit_rating
high
no
fair
high
no
excellent
high
no
fair
medium
no
fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no
fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no
excellent
high
yes fair
medium
no
excellent
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
Output: A Decision Tree for
“buys_computer”
age?
<=30
student?
30..40
overcast
yes
>40
credit rating?
no
yes
excellent
fair
no
yes
no
yes
Algorithm for Decision Tree Induction
• Basic algorithm (a greedy algorithm)
–
–
–
–
–
Tree is constructed in a top-down recursive divide-and-conquer manner
At start, all the training examples are at the root
Attributes are categorical (if continuous-valued, they are discretized in advance)
Samples are partitioned recursively based on selected attributes
Test (split) attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
• Conditions for stopping partitioning
– All samples for a given node belong to the same class
– There are no remaining attributes for further partitioning – majority voting is
employed for classifying the leaf
– There are no samples left
Algorithm for Decision Tree Induction
(pseudocode)
Algorithm GenDecTree(Sample S, Attlist A)
1. create a node N
2. If all samples are of the same class C then label N with C; terminate;
3. If A is empty then label N with the most common class C in S (majority
voting); terminate;
4. Select aA, with the highest information gain; Label N with a;
5. For each value v of a:
a.
b.
c.
d.
Grow a branch from N with condition a=v;
Let Sv be the subset of samples in S with a=v;
If Sv is empty then attach a leaf labeled with the most common class in S;
Else attach the node generated by GenDecTree(Sv, A-a)
Attribute Selection Measure: Information Gain

Let pi be the probability that an arbitrary tuple in D belongs
to class Ci, estimated by |Ci, D|/|D|
- where Ci, D denotes the set of tuples that belong to class Ci

Expected information (entropy) needed to classify a tuple
in D:
m
Info( D)   pi log 2 ( pi )
i 1
- where m is the number of classes
Attribute Selection Measure: Information Gain

Information needed (after using A to split D into v
partitions) to classify D:
v
| Dj |
j 1
|D|
InfoA ( D)  

 I (D j )
Information gained by branching on attribute A
Gain(A)  Info(D)  InfoA(D)
Attribute Selection: Information Gain


v
| Dj |
j 1
|D|
InfoA ( D)  
age
<=30
31…40
>40
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
m
Info( D)   pi log 2 ( pi )
Class P: buys_computer = “yes”
Class N: buys_computer = “no”
pi
2
4
3
 I (D j )
Info( D)  I (9,5)  
9
9
5
5
log 2 ( )  log 2 ( ) 0.940
14
14 14
14
Infoage ( D) 
ni I(pi, ni)
3 0.971
0 0
2 0.971
income student credit_rating
high
no
fair
high
no
excellent
high
no
fair
medium
no
fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no
fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no
excellent
high
yes fair
medium
no
excellent
i 1

buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
5
4
I (2,3) 
I (4,0)
14
14
5
I (3,2)  0.694
14
Gain(age)  Info( D)  Infoage ( D)  0.246
Gain(income)  0.029
Gain( student )  0.151
Gain(credit _ rating )  0.048
Splitting the samples using age
age?
>40
<=30
30...40
income student credit_rating
high
no fair
high
no excellent
medium
no fair
low
yes fair
medium yes excellent
buys_computer
no
no
no
yes
yes
income student credit_rating
high
no fair
low
yes excellent
medium
no excellent
high
yes fair
income student credit_rating
medium
no fair
low
yes fair
low
yes excellent
medium yes fair
medium
no excellent
buys_computer
yes
yes
yes
yes
buys_computer
yes
yes
no
yes
no
labeled yes
Gini index
• If a data set D contains examples from n classes, gini
index, gini(D) is defined as
n 2
gini( D) 1  p j
j 1
- where pj is the relative frequency of class j in D
• If a data set D is split on A into two subsets D1 and D2,
the gini index gini(D) is defined as
|D1|
|D2 |
gini(D1) 
gini(D2)
gini A (D) 
|D|
|D|
Gini index
• Reduction in Impurity:
gini( A)  gini(D)  giniA(D)
• The attribute that provides the smallest ginisplit(D) (or
the largest reduction in impurity) is chosen to split
the node
Gini index (CART, IBM IntelligentMiner)
Example:
• D has 9 tuples in buys_computer = “yes” and 5 in “no”
n 2
gini( D) 1  p j
j 1
2
2
9  5
gini ( D)  1        0.459
 14   14 
• Suppose that attribute “income” partitions D into 10
records (D1: {low, medium}) and 4 records (D2: {high}).
Gini index
• Then:
 10 
4
giniincome{low,medium} ( D)   Gini ( D1 )   Gini ( D2 )
 14 
 14 
= 0.45
and gini{medium,high} = 0.30
• All attributes are assumed continuous-valued
• May need other tools, e.g., clustering, to get the
possible split values
Comparing Attribute Selection Measures
• The two measures, in general, return good results but
– Information gain:
• biased towards multivalued attributes
– Gini index:
• biased to multivalued attributes
• has difficulty when # of classes is large
• tends to favor test sets that result in equal-sized partitions and
purity in both partitions
Overfitting due to noise
Decision boundary is distorted by noise point
Overfitting due to insufficient samples
Why?
Overfitting due to insufficient samples
Lack of data points in the lower half of the diagram makes it difficult to
predict correctly the class labels of that region
- Insufficient number of training records in the region causes the
decision tree to predict the test examples using other training records
that are irrelevant to the classification task
Overfitting and Tree Pruning
• Overfitting: An induced tree may overfit the training data
– Too many branches, some may reflect anomalies due to noise or outliers
– Poor accuracy for unseen samples
• Two approaches to avoid overfitting
– Prepruning: Halt tree construction early—do not split a node if this would result
in the goodness measure falling below a threshold
• Difficult to choose an appropriate threshold
– Postpruning: Remove branches from a “fully grown” tree—get a sequence of
progressively pruned trees
Occam’s Razor
• Given two models of similar generalization errors, one should prefer the
simpler model over the more complex model
• Therefore, one should include model complexity when evaluating a model
“entia non sunt multiplicanda praeter ecessitatem”,
which translates to:
“entities should not be multiplied beyond necessity”.
Pros and Cons of decision trees
• Cons
• Pros
– Cannot handle complicated
+ Reasonable training time
relationship between features
+ Fast application
– simple decision boundaries
+ Easy to interpret
– problems with lots of
+ Easy to implement
missing data
+ Can handle large number
of features
Some well-known decision
tree learning implementations
CART Breiman L, Friedman JH, Olshen RA, Stone CJ
(1984) Classification and Regression Trees. Wadsworth
ID3
Quinlan JR (1986) Induction of decision trees.
Machine Learning 1:81–106
C4.5 Quinlan JR (1993) C4.5: Programs for machine
learning. Morgan Kaufmann
J48
Implementation of C4.5 in WEKA
Handling missing values
• Remove attributes with missing values
• Remove examples with missing values
• Assume most frequent value
• Assume most frequent value given a class
• Learn the distribution of a given attribute
• Find correlation between attributes
Handling missing values
Example
A1
…
Class
e1
yes
+
e2
no
+
e3
yes
-
e4
?
-
A1
yes
e1 (w=1)
e3 (w=1)
e4 (w=2/3)
no
e2 (w=1)
e4 (w=1/3)
k-nearest neighbor classifiers
X
(a) 1-nearest neighbor
X
X
(b) 2-nearest neighbor
(c) 3-nearest neighbor
k-nearest neighbors of a record x are data points that have
the k smallest distance to x
k-nearest neighbor classification
• Given a data record x find its k closest points
– Closeness: ?
• Determine the class of x based on the classes
in the neighbor list
– Majority vote
– Weigh the vote according to distance
• e.g., weight factor, w = 1/d2
Characteristics of nearest-neighbor
classifiers
• No model building (lazy learners)
– Lazy learners: computational time in classification
– Eager learners: computational time in model
building
• Decision trees try to find global models, k-NN
take into account local information
• K-NN classifiers depend a lot on the choice of
proximity measure
Condorcet’s jury theorem
• If each member of a jury is more likely to be right than
wrong,
• then the majority of the jury, too, is more likely to be
right than wrong
• and the probability that the right outcome
is supported
Condorcet,
1785
by a majority of the jury is a (swiftly) increasing
function of the size of the jury,
• converging to 1 as the size of the jury tends to infinity
Condorcet’s jury theorem
0.3
Probability of error
0.25
0.2
0.15
0.1
0.05
0
0
5
10
No. of jury members
15
20
Random forests
Random forests (Breiman 2001) are generated by
combining two techniques:
• bagging (Breiman 1996)
• the random subspace method (Ho 1998)
L. Breiman. Random forests. Machine Learning, 45(1):5–32, 2001
L. Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996
T. K. Ho. The random subspace method for constructing decision forests,
IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8):832844, 1998
Bagging
Ex.
Ex.
Other
Other
Bar
Bar
Fri/Sat Hungry
Hungry Guests
Guests
Fri/Sat
Wait
Wait
e2
e1
yes
yes
no
no
no
no
yes
yes
full
some
no
yes
e2
e2
yes
no
no
no
no
yes
yes
full
full
nono
e3
e3
no
yes
yes
no
no
no
no
some
some
yes
yes
e4
e4
yes
no
no
yes
yes
yes
yes
full
full
yes
yes
e5
e4
yes
yes
no
no
yes
yes
no
yes
none
full
noyes
e6
e6
no
no
yes
yes
no
no
yes
yes
some
some
yes
yes
A bootstrap replicate E’ of a set of examples E is created by
randomly selecting n = |E| examples from E with replacement.
Forests
Today
• What is classification
• Overview of classification methods
• Decision trees
• Forests
Next time
DATE
TIME
ROOM
TOPIC
MONDAY
2013-09-09
WEDNESDAY
2013-09-11
FRIDAY
2013-09-13
10:00-11:45
502
Introduction to data mining
09:00-10:45
501
10:00-11:45
Sal C
Decision trees, rules and
forests
Evaluating predictive models
and tools for data mining