Download CSE591 Data Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
3. Classification Methods
Patterns and Models
Regression, NBC
k-Nearest Neighbors
Decision Trees and Rules
Large size data
9/03
Data Mining – Classification
G Dong
1
Models and Patterns
• A model is a global description of data, or an
abstract representation of a real-world process
– Estimating parameters of a model
– Data-driven model building
– Examples: Regression, Graphical model (BN), HMM
• A pattern is about some local aspects of data
– Patterns in data matrices
• Predicates (age < 40) ^ (income < 10)
– Patterns for strings (ASCII characters, DNA alphabet)
– Pattern discovery: rules
9/03
Data Mining – Classification
G Dong
2
Performance Measures
• Generality
– How many instances are covered
• Applicability
– Or is it useful? All husbands are male.
• Accuracy
– Is it always correct? If not, how often?
• Comprehensibility
– Is it easy to understand? (a subjective measure)
9/03
Data Mining – Classification
G Dong
3
Forms of Knowledge
• Concepts
– Probabilistic, logical (proposition/predicate), functional
• Rules
• Taxonomies and Hierarchies
– Dendrograms, decision trees
• Clusters
• Structures and Weights/Probabilities
– ANN, BN
9/03
Data Mining – Classification
G Dong
4
Induction from Data
• Inferring knowledge from data - generalization
• Supervised vs. unsupervised learning
– Some graphical illustrations of learning tasks
(regression, classification, clustering)
– Any other types of learning?
• Compare: The task of deduction
– Infer information/fact that is a logical consequence of
facts in a database
• Who is John’s grandpa? (deduced from e.g. Mary is John’s
mother, Joe is Mary’s father)
– Deductive databases: extending the RDBMS
9/03
Data Mining – Classification
G Dong
5
The Classification Problem
• From a set of labeled training data, build a system (a
classifier) for predicting the class of future data
instances (tuples).
• A related problem is to build a system from training
data to predict the value of an attribute (feature) of
future data instances.
9/03
Data Mining – Classification
G Dong
6
What is a bad classifier?
• Some simplest classifiers
– Table-Lookup
• What if x cannot be found in the training data?
• We give up!?
– Or, we can …
• A simple classifier Cs can be built as a reference
– If it can be found in the table (training data), return its
class; otherwise, what should it return?
• A bad classifier is one that does worse than Cs.
• Do we need to learn a classifier for data of one class?
9/03
Data Mining – Classification
G Dong
7
Many Techniques
•
•
•
•
•
•
•
9/03
Decision trees
Linear regression
Neural networks
k-nearest neighbour
Naïve Bayesian classifiers
Support Vector Machines
and many more ...
Data Mining – Classification
G Dong
8
Regression for Numeric Prediction
• Linear regression is a statistical technique when
class and all the attributes are numeric.
• y = α + βx, where α and β are regression
coefficients
• We need to use instances <xi,y> to find α and β
– by minimizing SSE (least squares)
– SSE = Σ(yi-yi’)2 = Σ(yi- α - βxi)2
• Extensions
– Multiple regression
– Piecewise linear regression
– Polynomial regression
9/03
Data Mining – Classification
G Dong
9
Nearest Neighbor
• Also called instance based learning
• Algorithm
– Given a new instance x,
– find its nearest neighbor <x’,y’>
– Return y’ as the class of x
• Distance measures
– Normalization?!
• Some interesting questions
– What’s its time complexity?
– Does it learn?
9/03
Data Mining – Classification
G Dong
10
Nearest Neighbor (2)
• Dealing with noise – k-nearest neighbor
– Use more than 1 neighbor
– How many neighbors?
– Weighted nearest neighbors
• How to speed up?
– Huge storage
– Use representatives (a problem of instance selection)
• Sampling
• Grid
• Clustering
9/03
Data Mining – Classification
G Dong
11
Naïve Bayes Classification
• This is a direct application of Bayes’ rule
• P(C|x) = P(x|C)P(C)/P(x)
x - a vector of x1,x2,…,xn
• That’s the best classifier you can ever build
– You don’t even need to select features, it takes care of it
automatically
• But, there are problems
– There are a limited number of instances
– How to estimate P(x|C)
9/03
Data Mining – Classification
G Dong
12
NBC (2)
•
•
•
•
Assume conditional independence between xi’s
We have P(C|x) ≈ P(x1|C) P(xi|C) (xn|C)P(C)
How good is it in reality?
Let’s build one NBC for a very simple data set
– Estimate the priors and conditional probabilities with
the training data
– P(C=1) = ? P(C=2) =? P(x1=1|C=1)? P(x1=2|C=1)? …
– What is the class for x=(1,2,1)?
P(1|x) ≈ P(x1=1|1) P(x2=2|1) P(x3=1|1) P(1), P(2|x) ≈
– What is the class for (1,2,2)?
9/03
Data Mining – Classification
G Dong
13
Example of NBC
C
7=
A1=0
A1=1
A1=2
A2=0
A2=1
A2=2
A3=1
A3=2
9/03
1
4
2
3
2
2
0
0
1
2
A1
A2
A3
C
1
2
1
1
0
0
1
1
2
1
2
2
1
2
1
2
0
1
2
1
2
2
2
2
1
0
1
1
Data Mining – Classification
G Dong
14
Golf Data
Outlook Temp Humidity Windy Class
Sunny
Sunny
O’cast
Rain
Rain
Rain
O’cast
Sunny
Sunny
Rain
Sunny
O’cast
O’cast
Rain
9/03
Hot
Hot
Hot
Mild
Cool
Cool
Cool
Mild
Cool
Mild
Mild
Mild
Hot
Mild
High
High
High
Normal
Normal
Normal
Normal
High
Normal
Normal
Normal
High
Normal
High
No
Yes
No
No
No
Yes
Yes
No
No
No
Yes
Yes
No
Yes
Data Mining – Classification
G Dong
Yes
Yes
No
No
No
Yes
No
Yes
No
No
No
No
No
Yes
15
Decision Trees
• A decision tree
Outlook
sunny
Humidity
high
NO
9/03
overcast
rain
Wind
YES
normal
YES
Data Mining – Classification
G Dong
strong
weak
NO
YES
16
How to `grow’ a tree?
• Randomly  Random Forests (Breiman, 2001)
• What are the criteria to build a tree?
– Accurate
– Compact
• A straightforward way to grow is
– Pick an attribute
– Split data according to its values
– Recursively do the first two steps until
• No data left
• No feature left
9/03
Data Mining – Classification
G Dong
17
Discussion
• There are many possible trees
– let’s try it on the golf data
• How to find the most compact one
– that is consistent with the data?
• Why the most compact?
– Occam’s razor principle
• Issue of efficiency w.r.t. optimality
– One attribute at a time or …
9/03
Data Mining – Classification
G Dong
18
Grow a good tree efficiently
• The heuristic – to find commonality in feature
values associated with class values
– To build a compact tree generalized from the data
• It means we look for features and splits that can
lead to pure leaf nodes.
• Is it a good heuristic?
–
–
–
–
9/03
What do you think?
How to judge it?
Is it really efficient?
How to implement it?
Data Mining – Classification
G Dong
19
Let’s grow one
• Measuring the purity of a data set – Entropy
• Information gain (see the brief review)
• Choose the feature with max gain
Outlook (7,7)
Sun
(5)
9/03
Data Mining – Classification
G Dong
OCa
(4)
Rain
(5)
20
Different numbers of values
• Different attributes can have varied numbers of
values
• Some treatments
– Removing useless attributes before learning
– Binarization
– Discretization
• Gain-ratio is another practical solution
– Gain = root-Info – InfoAttribute(i)
– Split-Info = -((|Ti|/|T|)log2 (|Ti|/|T|))
– Gain-ratio = Gain / Split-Info
9/03
Data Mining – Classification
G Dong
21
Another kind of problems
• A difficult problem. Why is it difficult?
• Similar ones are Parity, Majority problems.
XOR problem
000
011
101
110
9/03
Data Mining – Classification
G Dong
22
Tree Pruning
• Overfitting: Model fits training data too well, but
won’t work well for unseen data.
• An effective approach to avoid overfitting and for
a more compact tree (easy to understand)
• Two general ways to prune
– Pre-pruning: stop splitting further
• Any significant difference in classification accuracy before and
after division
– Post-pruning to trim back
9/03
Data Mining – Classification
G Dong
23
Rules from Decision Trees
• Two types of rules
– Order sensitive (more compact, less efficient)
– Order insensitive
• The most straightforward way is …
• Class-based method
– Group rules according to classes
– Select most general rules (or remove redundant ones)
• Data-based method
– Select one rule at a time (keep the most general one)
– Work on the remaining data until all data is covered
9/03
Data Mining – Classification
G Dong
24
Variants of Decision Trees and Rules
• Tree stumps
• Holte’s 1R rules (1992)
– For each attribute A
• Sort according to its values v
• Find the most frequent class value c for each v
– Breaking tie with coin flipping
• Output the most accurate rule as if A=v then c
– An example (the Golf data)
9/03
Data Mining – Classification
G Dong
25
Handling Large Size Data
• When data simply cannot fit in memory …
– Is it a big problem?
• Three representative approaches
– Smart data structures to avoid unnecessary recalculation
• Hash trees
• SPRINT
– Sufficient statistics
• AVC-set (Attribute-Value, Class label) to summarize the class
distribution for each attribute
• Example: RainForest
– Parallel processing
• Make data parallelizable
9/03
Data Mining – Classification
G Dong
26
Ensemble Methods
• A group of classifiers
– Hybrid (Stacking)
– Single type
• Strong vs. weak learners
• A good ensemble
– Accuracy
– Diversity
• Some major approaches form ensembles
– Bagging
– Boosting
9/03
Data Mining – Classification
G Dong
27
Bibliography
• I.H. Witten and E. Frank. Data Mining – Practical Machine
Learning Tools and Techniques with Java Implementations.
2000. Morgan Kaufmann.
• M. Kantardzic. Data Mining – Concepts, Models,
Methods, and Algorithms. 2003. IEEE.
• J. Han and M. Kamber. Data Mining – Concepts and
Techniques. 2001. Morgan Kaufmann.
• D. Hand, H. Mannila, P. Smyth. Principals of Data Mining.
2001. MIT.
• T. G. Dietterich. Ensemble Methods in Machine Learning.
I. J. Kittler and F. Roli (eds.) 1st Intl Workshop on Multiple
Classifier Systems, pp 1-15, Springer-Verlag, 2000.
9/03
Data Mining – Classification
G Dong
28