Download 3.1 UNIT-3 Material

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Classification
Classification
•Ex 90<= marks A
80<= marks 90 B
70<= marks < 80 C
60<= marks < 70 D
marks<60
5/1/2017
F
Data Mining -By Dr. S. C. Shirwaikar
1
Classification
Classification
•predicts categorical class labels (discrete or nominal)
•classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying new data
Defn: Given a Database D={t1,t2,…tn} of tuples and a
set C={C1,C2,…Cm}, the classification problem is to
define a mapping f: D C where each ti is assigned to
one class Cj.
Second Problem Overfitting
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
2
Classification
Three basic methods used to solve classification
problems
•Specifying boundaries
•Using probability distributions p(ti/Cj)
•Using posterior probabilities p(Cj/ti)
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
3
Typical applications
•Credit approval-applicant as good or poor
credit risk
•Target marketing-profile of a good customer
•Medical diagnosis- Develop a profile of
stroke victims
•Fraud detection -Determine a credit card
purchase is fraudulent
Classification is a two-step process
Classifier is built from a data set- learning step
The training data set contains tuples having
attributes one of which is a class label attribute
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
4
Example
Training Data Set
Attributes
Class label
Patien Sore throat
t Id
Feve
r
Swollen Congestion Headach
Glands
e
Diagnosis
1
Yes
Yes
Yes
Yes
Yes
Strep throat
2
No
No
No
Yes
Yes
Allergy
3
Yes
Yes
No
Yes
No
Cold
4
Yes
No
No
No
No
Strep throat
5
No
Yes
No
Yes
No
Cold
6
No
No
No
Yes
No
Allergy
7
No
No
Yes
No
No
Strep throat
8
Yes
No
No
Yes
Yes
Allergy
9
No
Yes
No
Yes
Yes
Cold
10
Yes
Yes
No
Yes
Yes
Cold
Supervised learning (classification)
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
5
Since class label is provided it is known as
supervised learning
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
6
Model construction:
•Describing a set of predetermined classes
•Each tuple/sample is assumed to belong to a predefined
class, as determined by the class label attribute
•The set of tuples used for model construction is training set
•The model is represented as classification rules, decision
trees, or mathematical formulae
Model usage:
•for classifying future or unknown objects
•Estimate accuracy of the model The known label of test
sample is compared with the classified result from the model
•Accuracy rate is the percentage of test set samples that are
correctly classified by the model
•Test set is independent of training set, otherwise over-fitting
will occur
•If the accuracy is acceptable, use the model to classify data
tuples whose class labels are not known
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
7
Classification
Algorithms
Training
Data
Patie
nt Id
Sore
throat
Fev
er
Swolle
n
Gland
s
Congestio
n
Headac
he
Diagnosis
1
Yes
Yes
Yes
Yes
Yes
Strep throat
2
No
No
No
Yes
Yes
Allergy
3
Yes
Yes
No
Yes
No
Cold
4
Yes
No
No
No
No
Strep throat
5
No
Yes
No
Yes
No
Cold
6
No
No
No
Yes
No
Allergy
7
No
No
Yes
No
No
Strep throat
8
Yes
No
No
Yes
Yes
Allergy
9
No
Yes
No
Yes
Yes
Cold
10
Yes
Yes
No
Yes
Yes
Cold
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
Classifier
(Model)
Swollen
Glands
N
o
Fever
N
Diagnosis=Allergy
o
Y
e
Diagnosis=Strep Throat
s
Y
Diagnosis e
=Cold
s
8
Preparing data for classification
• Data cleaning
– Preprocess data in order to reduce noise and
handle missing value
Ignore missing data
Assume a value for the missing data.This meas that the
value of missing data is taken to be a specific value all
of its own.
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
9
• Relevance analysis (feature selection)
– Remove the irrelevant or redundant attributes
– Redundant attributes may be able to be detected by
correlation analysis
– Improves classification efficiency and scalability
• Data transformation
– Generalize and/or normalize data
-- Data Reduction
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
10
Choosing Classification Algorithms
• Algorithm categorization
• Distance based
• Statistical
• Decision Tree Based
• Neural network
• Rule based
• Classification categorization
• Specifying boundaries-divides input space into regions
• Probabilistic- determine probability for each class and
assign tuple to the class with highest probability
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
11
Measuring Performance
• Performance of classification algorithm is by
evaluating accuracy of the classification
• Computational costs -Space and time
requirements• Scalability-efficient even for large databases
• Robustness-ability to make correct
classification in the presence of noisy data
• Overfitting problem- the classification fits
the training data exactly but may not be
applicable to a broader population of data
• Interpretability- insight provided by classifier
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
12
Statistical-based algorithms
Straight line regression analysis involves a response variable
y and a single predictor variable x and models y is a linear
function of x
y = w0 + w1 x
where w0 (y-intercept) and w1 (slope) are regression
coefficients
These coefficients can be solved by the method of least
squares which estimates the best fitting straight line
D be the training data set containing n data points
(x1,y1),(x2,y2)…(xn,yn), regression coefficients can be
estimated as
∑ (xi- x) (yi –y)
w1 = ----------------------- w0 = y – w1 x
∑ (xi- x)2
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
13
Yrs exprnce
Salary in k
3
30
8
57
9
64
13
72
3
36
6
43
100
80
60
40
11
59
21
90
20
1
20
16
83
x= 9.1 y =55.4
(3-9.1)(30-55.4)+………..
W1= ---------------------------------- = 3.5 W0 = 55.4-(3.5)(9.1)=23.6
(3-9.1)2 +(8-9.1)2…….
y=23.6+3.5 x Using this equation we can predict salary given experience
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
14
Multiple Linear regression
It is an extension of Straight line regression analysis
so as to involve more than one predictor variable
It allows response variable y to be modeled as a
linear function of n predictor variables or attributes
describing a tuple x as (x1,x2,…xn)
y = w0 + w1 x1 +w2 x2+w3x3+…..wnxn
The method of least squares can be extended to
solve for w0, w1 etc. the equations are much more
complex and can be solved by using statistical
software packages
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
15
The linear model gets affected by the presence of noise
or outliers (extreme, exceptional values)
Nonlinear regression
• Some nonlinear models can be modeled by a polynomial
function
• A polynomial regression model can be transformed into
linear regression model. For example,
y = w0 + w1 x + w2 x2 + w3 x3
convertible to linear with new variables: x2 = x2, x3= x3
y = w0 + w1 x + w2 x2 + w3 x3
• Some models are intractable nonlinear (e.g., sum of
exponential terms)
– possible to obtain least square estimates through
extensive calculation on more complex formulae
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
16
Logistic regression
It uses a logistic curve.
The logistic curve gives a value
between 0 and 1 so it can be
interpreted as the probability of
class membership.
e (1+x) /1+e (1+x)
The formula for a univariate logistic
curve is
p= e (c0+c1x1) /1+ e (c0+c1x1)
log(p/1-p)=c0+c1x1
Here p is the probability of being in
the class
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
17
Bayesian Classification:
It is based on Bayes’ Theorem of conditional
probability.
It is a statistical classifier: performs probabilistic
prediction, i.e., predicts class membership
probabilities
A simple Bayesian classifier, naïve Bayesian
classifier, assumes that different attribute values
are independent which simplifies computational
process
It has comparable performance with decision tree
and selected neural network classifiers
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
18
Let X be a data tuple (“evidence”): described by
values of its n attributes
Let H be a hypothesis that X belongs to class C
Classification is to determine P(H|X), the probability
that the hypothesis holds given the observed data
sample X
Probability that X belongs to class C having
known the attribute description of X
Given that X is 31..40 and medium income , X will
buy computer
P(H) (prior probability H ), the initial probability
E.g., X will buy computer, regardless of age,
income, …
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
19
Let X be a data tuple (“evidence”): described by
values of its n attributes
Let H be a hypothesis that X belongs to class C
Classification is to determine P(H|X), the probability
that the hypothesis holds given the observed data
sample X
Probability that X belongs to class C having
known the attribute description of X
Given that X is 31..40 and medium income , X will
buy computer
P(H) (prior probability H ), the initial probability
E.g., X will buy computer, regardless of age,
income, …
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
20
P(H/X) (posteriori probability of H ), the probability of
H when attributes of X are known
P(X): (prior probability of X)
It is probability that sample data is in observed range
Ex It is probability that person is in the range
31..40 and medium income- evidence
P(X/H) (posteriori probability of X) -likelihood
E.g., Given that X will buy computer, the prob.
that X is 31..40, medium income
Baye’s theorem relates all these probabilities
P(H/X)= P(X/H) P(H)
P(X)
Posteriori= Likelihood x priori / evidence
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
21
Let D be a training set of tuples and their
associated class labels, and each tuple is
represented by an n-D attribute vector
X = (x1, x2, …, Xn)
Suppose there are m classes C1, C2, …, Cm.
Classification is to derive the maximum posteriori,
i.e., the maximal P(Ci|X)
This can be derived from Bayes’ theorem
P(Ci/X)= P(X/Ci) P(Ci)
P(X)
Since P(X) is constant for all classes, only
P(X/Ci)P(Ci) needs to be maximized
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
22
If class prior probabilities are not known, it can be
assumed that all classes are equally likely
P(C1)=P(C2) =………………..=P(Cn)
Reduced to maximizing P(X/Ci)
If data set has many attributes, it will be
computationally expensive to compute P(X/Ci)
To reduce computation, assumption of class
conditional independence is made
Attributes are conditionally independent
P(X/Ci)= P(x1/Ci)xP(x2/Ci)x………………x P(xn/Ci)
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
23
P(xk/Ci) is the number of tuples of class Ci in training
set D having the value xk, divided by number of
tuples of class Ci in D
Class:
C1:buys_computer = ‘yes’
C2:buys_computer = ‘no’
Data sample
X = (age <=30,
Income = medium,
Student = yes
Credit_rating = Fair)
5/1/2017
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
income studentcredit_rating
buys_compu
high
no fair
no
high
no excellent
no
high
no fair
yes
medium
no fair
yes
low
yes fair
yes
low
yes excellent
no
low
yes excellent
yes
medium
no fair
no
low
yes fair
yes
medium yes fair
yes
medium yes excellent
yes
medium
no excellent
yes
high
yes fair
yes
medium
no excellent
no
Data Mining -By Dr. S. C. Shirwaikar
24
• P(Ci):
P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
• Compute P(X|Ci) for each class
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
• X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667
= 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”)
= 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
25
Zero-probability problem
Naïve Bayesian prediction requires each conditional
prob. to be non-zero. Otherwise, the predicted prob.
will be zero irrespective of all other probabilities
Ex. Suppose a dataset with 1000 tuples, income=low
(0), income= medium (990), and income = high (10),
Use Laplacian correction (or Laplacian estimator)
Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
The “corrected” prob. estimates are close to their
“uncorrected” counterparts and the problem of zero
probability is solved
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
26
Advantages
• Easy to implement
• Only one scan of training data is required
• Good results obtained in most of the cases
• Can easily handle missing values
Disadvantages
• Assumption: class conditional independence,
therefore loss of accuracy
• Practically, dependencies exist among variables
E.g., hospitals: patients: Profile: age, family history,
etc.
Symptoms: fever, cough etc., Disease: lung cancer,
diabetes, etc.
Dependencies among these cannot be modeled by
Naïve Bayesian Classifier
Bayesian Belief Networks
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
27
Distance-based algorithms
Each tuple is assigned to class to which it is most similar
Each class is represented as a tuple
The representative for each class is the centre or centroid
Each tuple ti is assigned to class Cj such that sim(ti,Cj)
>sim(ti,Cl) for all Cl such that Cl ≠Cj
Each tuple must be compared to the center for a class and
there are fixed number of classes.
The complexity depends on the number of classes
K Nearest Neighbors is a distance based algorithm i.e a
lazy learning algorithm.
Simply stores training data (or only does minor processing)
and waits until it is given a test tuple
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
28
Distance-based algorithms
Similarity or distance measures may be used to
identify the alikeness of different items in the
database
The similarity between two tuples ti and tj sim(ti, tj) , in
a database D is a mapping from DxD to the range
[0,1]
Characteristics of a good similarity measure
1. sim(ti, ti)=1 for all ti
2. sim(ti, tj)=0 if ti and tj are not alike at all
3. sim(ti,tj) < sim(ti, tk) if ti is more like tk than it is like tj
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
29
Dice: sim(ti,tj) =
∑ tik tjk
∑ tik2 + ∑tjk2
Jaccard : sim(ti,tj) =
Cosine : sim(ti,tj) =
∑ tik tjk
∑ tik2 +∑tjk2 - ∑ tik tjk
∑ tik tjk
∑ tik2 ∑tjk2
Overlap : sim(ti,tj) =
∑ tik tjk
min (∑ tik2, ∑tjk2)
Distance or dissimilarity measures are often used instead
of similarity measures-usually distance measures
Euclidean : dis(ti,tj) = ∑ (tih-tjh)2
Manhattan : dis(ti,tj) = ∑ | tih-tik|
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
30
The k-Nearest neighbor algorithm
• K closet neighbors in the training set to the given
tuple are to be determined
• The nearest neighbor are defined in terms of
Euclidean distance, dist(X1, X2)
• The new item is then placed in the class that
contains the most items from this set k of closest
items
• The value of k can be determined experimentally.
Starting with k=1, a test set is used to estimate
the error rate of the classifier. The k value that
gives minimum error rate can be selected
• k-NN for real-valued prediction for a given
unknown tuple returns the mean values of the k
nearest neighbors
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
31
The k-Nearest neighbor algorithm
• Distance-weighted nearest neighbor algorithm
gives greater weight to closer neighbors
• Robust to noisy data by averaging k-nearest
neighbors
• The complexity is O(d), d is the size of training
set, can be reduced to O(logd) by storing training
set in search trees, can be O(1) by using
parallelism
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
32
Decision Tree based algorithms
A decision tree is a flowchart like tree structure , where
each internal node denotes a test on an attribute, each
branch represents an outcome of the test and each leaf
node holds a class label
A path is traced from the root to a leaf node, which holds
the class prediction for that tuple
Swollen
Glands
No
Yes
Diagnosis=Strep Throat
Fever
No
Diagnosis=Allergy
5/1/2017
Yes
Diagnosis =Cold
Data Mining -By Dr. S. C. Shirwaikar
33
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
5/1/2017
income student credit_rating
high
no fair
high
no excellent
high
no fair
medium
no fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no excellent
high
yes fair
medium
no excellent
Data Mining -By Dr. S. C. Shirwaikar
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
34
age?
<=30
student?
no
31..40
overcast
>40
customer
yes
credit rating?
excellent
customer
Non-customer
customer
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
fair
Non-Customer
35
Algorithm Generate_decision_tree(D, attribute_list)
Create a Node N
If tuples in D are in the same class C then return N as leaf node
labeled with class C
If attribute list is empty then return N as a leaf node labeled with
the majority class in D
Apply attribute selection method D to get the best splitting
attribute and label the node accordingly
Split the node depending on attribute domain
For each split
Let Dj be the set of data tuples in D satisfying outcome j
If Dj is empty then attach a leaf labeled with majotrity class in D
Else give a recursive call to Generate _decsion tree for the new
node
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
36
Decision_Tree induction:
Construct a DT using training data
For each ti € D,apply the DT to determine it’s class.
Advantages:
1.Easy to use and efficient
2.Rules can br generated that are easy to interpret and
understand.
3.They scale well for large databases because the tree size is
independent of database size.
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
37
InputD //training data
Output:T //Decision Tree
DTBuild algorithm:
T= ø
Determine best splitting criterion:
T= Create root node and label with splitting attribute;
T= Add arc to root node for each split predicate and label;
for each arc do
D = Database created by applying splitting predicate to D;
If stopping point reached for this path,then
T’= Create leaf node and label with appropriate class;
else
T’=DTBuild(D);
T’= Add T’ to arc;
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
38
Disadvantages:
1.They do not easily handle continuous data
2. These attribute domain must be divided into categories to be handled.
3.
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
39
Issues faced by DT algorithms
•Choosing splitting attributes -best splitting criterion is
when all tuples in the partition belong to the same
class(pure). Some attributes are better than other.
Choice of attribute should be such that it minimizes the
expected number of tests needed to classify a given tuple
and guarantees a simple tree structure
•Ordering of splitting attributes – The order in which the
attributes are chosen is important
•The attributes are ranked based on some attribute
selection measure. The best score attribute is chosen as
splitting attribute
•Splits- number of splits depends on the domain of the
attribute
•Tree structure- a balanced tree with the fewest levels is
desirable
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
40
•Stopping criteria – The creation of the tree stops when the
training data is perfectly classified. Stopping earlier can
prevent overfitting and generation of large trees
•Training data –Training data set can give rise to overfitting
problem
•Pruning- Once a tree is constructed ,some modifications to
the tree might not be specific enough to work properly with
more general data.
ID3(Iterative Dichotomiser)
ID3 technique of building a decision tree is based on
information theory and attempts to minimize excepted number
of comparisons
Like in “Twenty question Game” ask questions that provide the
most information
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
41
Entropy is used to measure the amount of
uncertainity or surprise in a set of data
When all data in a set belongs to the same class
there is no uncertainity – entropy is zero
The objective of decision tree classification is to
iteratively partition the given data set into subsets
where all elements in each final subset belong to the
same class-pure partition
Defn : Given a data set D and probabilities
p1,p2,…pn where ∑ pi=1,pi is the probability that an
arbitrary tuple in D belongs to Class Ci, entropy or
expected information neededd to classify a tuple in D
is defined as H(D)= ∑ pi (log(1/pi)) = - ∑ pi log(pi)
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
42
If selection of an attribute A does not result into pure
partitions then additional information required in order
to arrive at exact classification is measured as
infoA(D) = ∑|Dj|/|D| x Info(Dj)
The term |Dj|/|D| acts as the weight of the jth partition
Information gained by branching on attribute A
Gain(A)=Info(D)-InfoA(D)
The difference between the original information
requirement based on just the proportion of classes
and the new requirement obtained after partitioning
on A
The smaller the expected information ,the greater is
the purity of the partitions
Select the attribute with the highest information gain
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
43
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
5/1/2017
income student credit_rating
high
no fair
high
no excellent
high
no fair
medium
no fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no excellent
high
yes fair
medium
no excellent
Data Mining -By Dr. S. C. Shirwaikar
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
44
H(D) = Info(D) = -9/14 log(9/14) – 5/14 log (5/14) =0.940
If the tuples are classified according to attribute age
The Expected information required for further classification
after partitioning on Attribute age
Info age(d) = 5/14 x( -2/5log(2/5)-3/5log(3/5))
+ 4/14 x( -4/4log(4/4)-0/4log(0/4))
+ 5/14 x( -3/5log(3/5)-2/5log(2/5))
= 0.694 bits
Hence gain in information
Gain(age) = Info(d) – Infoage(D) = 0.940-0.694=0.246
Similarly gains for other attributes can be calculated
Gain(Income) =0.029 Gain(student)=0.151 and
Gain(credit_rating)=0.048
Hence attribute age is chosen as the splitting attribute
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
45
C4.5 is a successor of ID3
It improves ID3 in the following ways
Missing data- instead of ignoring missing data , the value is
predicted based on what is known about the attributes of
other records
Continuous data- discretized by dividing data into ranges
Pruning- with subtree replacement , a subtree is replaced by a leaf
node, if this replacement results in an error rate close to that
of the original tree –bottom up
-With Subtree raising, a subtree is replaced by its most
used subtree. Subtree is raised to a higher location
depending on increase in error rate
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
46
Rules - generates both decision tree and the rules set. .
Some methods are used to simplify rules such as replacing
rule by a simpler version
Splitting – The ID3 approach favors attributes with many
divisions and thus may lead to overfitting . An improvement
can be made by taking into account the cardinality of each
division. The GainRatio is used opposed to Gain
GainRatio(D,S)= Gain(D,S) / H(|Di|/|D|)
C 4.5 uses the largest Gainratio that ensures larger than
average Information gain
For the attribute Income
H(|Di|/|D|) = -4/14log(4/14) -6/14log(6/14) – 4/14log(4/14)
= 0.926
Gainratio= Gain(Income)/0.926=0.029/0.926=0.031
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
47
CART (Classification and regression trees ) is a technique
that generates a binary decision tree
Entrpy is used as a measure to chose the best splitting
attribute
In ID3 one child is created for each subcategory while here
only two children are created
At each step , exhaustive search is used to decide the best
split where best is defined by
Φ(s/t) = 2PLPR ∑ | P(Ci / tL) – P (Ci / tR )|
Here L and R indicate left and right subtrees, PL and PR are
the probabilities that the tuple will be on left or right side of
the tree
P(Ci/TL) denote the probability a tuple is in classs Ci and in
the left subtree
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
48
Scalable DT technique SPRINT(Scalable Parallelizable
Induction of decision tree
It addresses scalability issue by adding parallelism
It uses gini index to find the best split
If a data set D contains examples from n classes, gini index,
gini(D) is defined as
gini(D)= 1 - ∑ pj2
where pj is the relative frequency of class j in D
If a data set D is split on A into two subsets D1 and D2, the
gini index gini(D) is defined as
gini split (D) = n1/n ginii(D1)+ n2/n gini(D2)
The attribute provides the smallest ginisplit(D) (or the largest
reduction in impurity) is chosen to split the node (need to
enumerate all the possible splitting points for each attribute)
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
49
Neural Networks is an information processing
system in the form of a graph with many nodes as
processing elements (Neurons) and arcs as
interconnections between them.
NN can be viewed as a directed graph with source
(input), sink (output) and internal (hidden) nodes.
The input nodes exist in the input layer, output
nodes in the output layer and hidden nodes in one
or more hidden layers
During processing , functions at each node are
applied to the input data to produce the output
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
50
w13
f3
w36
w68
f8
f6
Height
f1
w14
small
w46
w37
w69
w78
f4
w23
w56
f9
w24
w47
w15
Gender
5/1/2017
f2
w25
w79
f7
w57
Medium
w610
w710
f10
tall
f5
Data Mining -By Dr. S. C. Shirwaikar
51
The output of each node I in the NN is based on the
definition of function fi, called activation function associated
with i.
There are many choices for activation functions but are
usually threshold functions generating output only if input is
above a threshold level
Linear- produces a linear output value based on the input
fi(s)=cs
Threshold or step -The output is 1 or 0 ,depending on the
sum of the products of the input values ab\nd their
associated weights
fi(s)=1 if s >T and 0 otherwise
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
52
Sigmoid – This is a S shaped curve with output values
between -1 and 1.
fi(s)= 1/(1+e-cs) is a commonly used logistic sigmoid function
Hyperbolic tangent – This function has output centered at
zero
fi(s)=(1– e–s )/(1+e-cs)
Nodes in NN have an extra input called bias. This bias value
of 1 is input on an arc with a weight of -θ .
The effect of the bias input is to move the activation function
on the X axis by a value of θ.
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
53
- mk
x0
w0
x1
w1
xn

f
output y
wn
Input
weight
vector x vector w
weighted
sum
Activation
function
• The n-dimensional input vector x is mapped into
variable y by means of the scalar product ,bias and a
nonlinear function mapping
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
54
Decision tree
Decision trees have only one
input node
One or more leaf nodes for
each class
Once a decision tree is built ,
it cannot be changed
Neural Network
One input node for each
attribute
One output node for each
class
NN can be changed to
improve future performance
Structure may not change but
the weights (edge labels) may
change
Easy to read and understand They are difficult to explain to
end users-poor interpretability
Can be used with any type of NNs work usually with
data
numeric data
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
55
Backpropagation is a neural network learning algorithm.
During learning phase, the network learns by adjusting the
weights so as to be able to predict the correct class labels of
the input tuples
Neural Networks involve long training times
There are several parameters such as weights , hidden nodes
which are determined empirically.
It is difficult for human beings to interpret the symbolic
meaning behind learned weights or hidden weights and hence
are less desirable for data mining –poor interpretability
They have high tolerance for noisy data and greater ability to
classify patterns on which they are not trained
They are well-suited for continuous-valued inputs and outputs.
Algorithms are inherently parallel
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
56
The backpropagation algorithm performs learning on a
multilayer feed-forward neural network.
The input layer only serves to pass the attribute values to
the next layer.
The network is feed forward in that none of the weights
cycles back to an input unit or to an output unit of a previous
layer.
It is fully connected in that each unit provides input to each
unit in the next forward layer
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
57
Defining a network topology
• First decide the network topology: No of units in the input
layer, No of hidden layers (if > 1), No of units in each
hidden layer, and No of units in the output layer
• Normalizing the input values for each attribute measured in
the training tuples to [0.0—1.0]
• One input unit per domain value, each initialized to 0
• One Output unit can be used represent two classes, for
classification of more than two classes, one output unit per
class is used
• Once a network has been trained and its accuracy is
unacceptable, repeat the training process with a different
network topology or a different set of initial weights
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
58
Backpropagation
• Iteratively process a set of training tuples &
compare the network's prediction with the actual
known target value
• For each training tuple, the weights are modified to
minimize the mean squared error between the
network's prediction and the actual target value
• Modifications are made in the “backwards”
direction: from the output layer, through each
hidden layer down to the first hidden layer, hence
“backpropagation”
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
59
Backpropagation Steps
1. Initialize weights (to small random numberss) and
biases in the network
2. Propagate the inputs forward
Incase of input units output is same as input
Incase of hidden or output units compute the net
input and the bias and input and compute output by
applying activation function
3. For each output unit and hidden units compute the
error
4. Backpropagate the error by updating weights and
biases
5. The steps 2 to 4 are repeated unitil Terminating
condition is satisfied (when error is very small, etc.)
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
60
Using IF-THEN Rules for Classification
Represent the knowledge in the form of IF-THEN rules
R: IF age = youth AND student = yes THEN
buys_computer = yes
“If “ part or left-hand side of rule is called
Rule antecedent or precondition
“Then” part or right-hand side is called.
Rule consequent
Assessment of a rule: coverage and accuracy
ncovers = No of tuples covered by R
ncorrect = No of tuples correctly classified by R
coverage(R) = ncovers /|D|
D: training data set, | D| is the number of tuples in D
accuracy(R) = ncorrect / ncovers
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
61
If a rule R is satisfied by a tuple X then the rule R
is said to be triggered by X
If only one rule is triggered, rule fires by returning
class prediction
If more than one rule is triggered, need conflict
resolution
Size ordering: assign the highest priority to the
triggering rules that has the “toughest” requirement
(i.e., with the most attribute test)
Class-based ordering: decreasing order of
prevalence or misclassification cost per class
Rule-based ordering (decision list): rules are
organized into one long priority list, according to
some measure of rule quality or by experts
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
62
• Rule Extraction from a Decision Tree
• Rules are easier to understand than large trees
• One rule is created for each path from the root to
a leaf
• Each attribute-value pair along a path forms a
conjunction: the leaf holds the class prediction
• Rules are mutually exclusive and exhaustive
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
63
age?
<=30
student?
no
31..40
overcast
customer
yes
>40
credit rating?
excellent
Non-customer
Non-customer
customer
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
fair
Customer
64
• Example: Rule extraction from our buys_computer decisiontree
IF age = young AND student = no THEN buys_computer = no
IF age = young AND student = yes THEN buys_computer = yes
IF age = mid-age
THEN buys_computer = yes
IF age = old AND credit_rating = excellent THEN
buys_computer = yes
IF age = young AND credit_rating = fair THEN
buys_computer = no
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
65
Rule extraction from networks:
Network pruning
•Simplify the network structure by removing weighted
links that have the least effect on the trained network
•Then perform link, unit, or activation value clustering
•The set of input and activation values are studied to
derive rules describing the relationship between the
input and hidden unit layers
Sensitivity analysis: assess the impact that a given input
variable has on a network output. The knowledge gained
from this analysis can be represented in rules
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
66
Rule Extraction from a Training data set
• threshold Sequential covering algorithm: Extracts rules
directly from training data
• Rules are learned sequentially, each for a given class Ci will
cover many tuples of Ci but none (or few) of the tuples of
other classes
• Steps:
– Rules are learned one at a time
– Each time a rule is learned, the tuples covered by the
rules are removed
– The process repeats on the remaining tuples unless
termination condition, e.g., when no more training
examples or when the quality of a rule returned is below
a user-specified
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
67
Combining Techniques
Given a classification problem no one
classification technique yields the best results
•A synthesis of approaches takes multiple
techniques and blends them into a new approach
Example use linear regression to predict missing
values which is then used as input to NN
•CMC ( Combination of multiple classifiers)
Multiple independent approaches can be applied
each yielding its own class prediction. The results
are then compared or combined in some manner
5/1/2017
Data Mining -By Dr. S. C. Shirwaikar
68