Download Department of MCA Test-II S

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Naive Bayes classifier wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
PES Institute of Technology
Bangalore South Campus
(1 K.M before Electronic City ,Bangalore 560100 )
Department of MCA
Test-II Solution Set
Sub: Data Warehousing and Data Mining
Sem & Section:V
Name of the Faculty: Manjulaprasad
Date: 30/09/2014
Duration: 90 min
Max marks: 50
Note: 1. Answer any Five full questions.
2. Draw neat diagram for all necessary questions.
1. a) What is Decision Tree? Write Skeleton decision tree induction algorithm (05 marks)
Decision tree is widely used classification technique.
The tree has three types of nodes:
 A root node that has no incoming edges and zero or more outgoing edges. 

 Inte rnal nodes, each of which has exactly one incoming edge and two or more
outgoing edges. 

 Leaf or terminal nodes, each of which has exactly one incoming edge and no
outgoing edges. 
Algorithm : A skeleton decision tree induction algorithm.
TreeGrowth ( E , F)
1. if stopping_cond(E,F) = true then
2.
leaf = createNode()
3.
leaf.label = Classify(E).
4.
return leaf.
5. else
6 . root = createNode().
7 . root.test-cond = f ind_best_split(E, F) .
8 . let V = { v \ v is a poss ible outc om e of root. test-c ond } .
9 . for each v € V do
10 .
Ev = { e | root.test-cond(e) = v and e € E ) .
11 .
child = TreeGrowth(Ev, F) .
12.
add child as descendent of root and label the edge (root —> child) as v.
13. end for
14. end if
15. return root.
b) Build Decision Tree for the below Training set using Hunts algorithm
(05 marks)
2.a) Find the Gini , Entropy for node N1
Node N1
Class=0
Class=1
Count
1
5
b) Consider a training set that contains 100 positive examples and 400 negative
examples. For each of the following candidate rules,
R1: A −>. + (covers 4 positive and 1 negative examples),
R2: B −.> + (covers 30 positive and 10 negative examples),
R3: C −. >+ (covers 100 positive and 90 negative examples),
(05 marks)
Determine which is the best and the worst rule using FOIL’S OR likelihood ratio statistic.
(05 marks)
FOIL’S – R1=8 , R2=52.2, R3=139.6
Therefore, R3 is the best candidate and R 1 is the worst candidate according to FOIL’s information gain.
likelihood ratio statistic R1= 12, R2= 80.85, R3= 143.09
Therefore, R3 is the best candidate and R 1 is the worst candidate according
to the likelihood ratio statistic.
3. Explain any one Direct & Indirect method for Rule extraction with example by conside ring
any data set
(10 marks)
 The sequential covering algorithm is often used to extract rules directly from data. 

 The algorithm extracts the rules one class at a time for data sets that contain more than two
classes. 
Sequential Covering Algorithm:
1. Start from an empty rule
2. Grow a rule using the Learn-One-Rule function
3. Remove training records covered by the rule
4. Repeat Step (2) and (3) until stopping criterion is met
Steps Followe d to Build Rule In Detail:
1. Learn-One-Rule Function
o The objective of the Learn-One-Rule function is to extract a classification rule that
covers many of the positive examples and none (or very few) of the negative
examples in the training set. However, finding an optimal rule is computationally
expensive given the exponential size of the search space.
2. Rule Growing Strategy:
o There are two common strategies for growing a classification rule:
General-to-specific or Specific-to-general
o Under the general-to-specific strategy, an initial rule r: {} —• y is created, where the
left- hand side is an empty set and the right-hand side contains the target class. The
rule has poor quality because it covers all the examples in the training set.
New conjuncts are subsequently added to improve the rule's quality.
Rule Evaluation:
1. An evaluation metric is needed to determine which conjunct should be added (or
removed) during the rule-growing process.
4. Rule Pruning:
o The Rule generated by the Learn-One-Rule function can be pruned to improve their
generalization errors.
Ex: If the error on validation set decreases after pruning, we should keep the simplified rule.
5. Rationale for sequential covering:
o After a rule is extracted, the sequential covering algorithm must eliminate all the
positive and negative examples covered by the rule.
Indirect Method
It is a method for generating a rule set from a decision tree. In principle, every path from the
root node to the leaf node of a decision tree can be expressed as a classification rule
P
No
Yes
Q
No
Rule Set
R
Yes
- + +
No
Yes
Q
No
Yes
- +
r1: (P=No,Q=No) ==> r2: (P=No,Q=Yes) ==> +
r3: (P=Yes,R=No) ==> +
r4: (P=Yes,R=Yes,Q=No) ==> r5: (P=Yes,R=Yes,Q=Yes) ==> +
Fifth Semester MCA IA Test, 2014
4. a)Explain the K-nearest neighbor classification algorithm with example.
(05 marks)

To determine the class of a new example E:
 Calculate the distance between E and all examples in the training set
 Select K-nearest examples to E in the training set
 Assign E to the most common class among its K-nearest neighbors
K-nearest neighbor classification algorithm
 Let K be the number of nearest neigbhors and D be the set of training examples
 For each test example z=(x’, y’) do
 Compute d(x’,x), the distance between z and every example (x,y) belongs to D
 Select Dz belongs to D, the set of k closest training examples to z
 Y’=argmax £(xi,yi)belongs to Dz
b) Write a note on We b Robot Detection.
(05 marks)
o Web usage mining is the task of applying data mining techniques to extract useful
patterns from Web access logs.
o These patterns can reveal interesting characteristics of site visitors; e.g., people who
repeatedly visit a Web site and view the same product description page are more likely
to
buy the product if certain incentives such as rebates or free shipping are offered.
o In Web usage mining, it is important to distinguish accesses made by human users from
those due to Web robots.
o A Web robot (also known as a Web crawler) is a software program that automatically
locates and retrieves information from the Internet by following the hyperlinks
embedded in Web pages.
Depart ment of M CA
PESIT, BSC
Fifth Semester MCA IA Test, 2014
Decision Tree Classification can be used to distinguish between accesses by human
users and those by web Robots.
o To classify the Web sessions, features are constructed to describe the
characteristics of each session.
o Some of the features used for the Web robot detection task are depth and
Breadth
o Depth determines the maximum distance of a requested page, where
distance is measured in terms of the number of hyperlinks away from the
entry point of the Web site.
o For example, the home page http://www.cs.umn.edu/kumar is assumed
to be at depth 0,
o whereas
http://www.cs.umn.edu/kumar/MINDS/MINDS_papers.htm
is lo-cated at depth 2
o The breadth attribute measures the width of the corresponding Web
graph o Here sessions used by WebRobots belongs to class-1
o Sessions used by human users belongs to class-2
Data Mining & Warehousing
The model suggests that Web robots can be distinguished from human users in the
following way:
1. Accesses by Web robots tend to be broad but shallow, whereas accesses by human
users tend to be more focused (narrow but deep).
2. Unlike human users, Web robots seldom retrieve the image pages associated with a
Web document.
3. Sessions due to Web robots tend to be long and contain a large number of
requested pages.
4. Web robots are more likely to make repeated requests for the same document since
the web pages retrieved by human uses are often cached by the browser.
5.a) Explain Naïve Bayes Classifier with example.
(06 marks)
Steps Involved
1. Each data sample is of the type
X=(x i) i =1(1)n, where x i is the values of X for attribute A i
2. Suppose there are m classes Ci, i=1(1)m.
X  Ci iff
P(Ci|X) > P(Cj|X) for 1 j  m, j i
i.e BC assigns X to class C i having highest posterior probability conditioned on X
The class for which P(Ci|X) is maximized is called the maximum posterior hypothesis.
From Bayes Theorem
3. P(X) is constant. Only
need be maximized.
 If class prior probabilities not known, then assume all classes to be equally likely
 Otherwise maximize
P(Ci) = Si/S
Problem: computing P(X|C i) is unfeasible!
(find out how you would find it and why it is infeasible
Depart ment of M CA
PESIT, BSC
Fifth Semester MCA IA Test, 2014
4. Naïve assumption: attribute independence
P = P(x1 ,…,xn |C) = P(xk |C)
5. In order to classify an unknown sample X, evaluate for each class C i. Sample X is assigned
to the class C i iff
P(X|Ci)P(Ci) > P(X|Cj) P(Cj) for 1 j  m, j i
b) What are the characteristics of decision tree.
(04 marks)
1. Decision tree induction is a non-parametric approach for building classification models.
2. Finding an optimal decision tree is| an NP-complete problem. Many decision tree
algorithms employ a heuristic-based approach to guide their search in the vast
hypothesis space.
3. Techniques developed for constructing decision trees are computationally inexpensive,
making it possible to quickly construct models even when the training set size is very
large.
4. Decision trees, especially smaller-sized trees, are relatively easy to interpret. The
accuracies of the trees are also comparable to other classification techniques for many
simple data sets.
5. Decision trees provide an expressive representation for learning discrete- valued
functions.
6. Decision tree algorithms are quite robust to the presence of noise, especially when
methods for avoiding over- fitting.
7. The presence of redundant attributes does not adversely affect the accuracy of decision
trees.
6. What are the different methods used for estimating predictive accuracy for classification.
Explain.
(10 marks)
 The accuracy of a classification method is the ability of the method to correctly determine
the class of a randomly selected data instance. 

 It may be expressed as the probability of correctly classifying unseen data. 

 Estimating the accuracy of a supervised classification method can be difficult if only the
(raining data is available and all of that data has been used in building the model. 

Accuracy may be measured using a number of metrics. These include: 
Sensitivity
Specificity
Precision and Accuracy.

The methods for estimating errors include: 
Depart ment of M CA
PESIT, BSC
Fifth Semester MCA IA Test, 2014
Holdout
Random
Sub-sampling,
Cross-validation and leave-one-out
Holdout Method:
o
The holdout method requires a training set and a test set.
o
The sets are mutually exclusive.
o A larger training set would produce a better classifier, while a larger test set would
produce a better estimate of the accuracy. A balance must be achieved.
Random Sub-sampling Method:
o Random sub-sampling is very much like the holdout method except that it does not
rely on a single test set.
o The holdout estimation is repeated several times and the accuracy estimate is
obtained by computing the mean of the several trials.
o Random sub-sampling is likely to produce better e rror estimates than those
by the holdout method.
K-fold Cross-validation Method:
o In K-fold cross-validation, the available data is randomly divided into k disjoint
subsets of approximately equal size.
o One of the subsets is then used as the test set and the remaining k - 1 sets are used
for building the classifier.
o
The test set is then used to estimate the accuracy.
o
This is done repeatedly k times so that each subset is used as a test subset once.
o Cross- validation has been tested extensively and has been found to generally work
well when sufficient data is available.
Leave-one-out Method
o
Leave-one-out is a simpler version of K- fold cross-validation.
o In this method. One of the training samples is taken out and the model is generated
using the remaining training data.
o Once the model is built, the one remaining sample is used for testing and the result
is coded as 1 or 0 depending if it was classified correctly or not.
Depart ment of M CA
PESIT, BSC
Fifth Semester MCA IA Test, 2014
7. Explain the algorithm for generating the topology of a Bayesian Network with example .
(10 marks)
 Let X=(x1, x2,…,xn) be a tuple described by variables or attributes Y1, Y2, …,Yn
respectively. Each variable is CI of its nondescendants given its parents
 Allows he DAG to provide a complete representation of the existing Joint Probability
Distribution by:
P(x1, x2, x3,…,xn)=P(xi|Parents(Yi))
where P(x1, x2, x3,…,xn) is the prob. of a particular combination of values of X, and the values for
P(xi|Parents(Yi)) correspond to the entries in CPT for Yi
 A node within the network can selected as an ‘output’ node, representing a class label
attribute More than one output node
 Rather than returning a single class label, the classification process can return a probability
distribution that gives the probability of each class
 Training BBN!!
 A Bayesian network specifies a joint distribution in a structured form
 Represent dependence/independence via a directed graph
 Nodes = random variables
 Edges = direct dependence
 Structure of the graph  Conditional independence relations
 Requires that graph is acyclic (no directed cycles)
 2 components to a Bayesian network
 The graph structure (conditional independence assumptions)
 The numerical probabilities (for each variable given its parents)
 Probability model has simple factored form
 Directed edges => direct dependence
 Absence of an edge => conditional independence
 Also known as belief networks, graphical models, causal networks
 Other formulations, e.g., undirected graphical models
 Consider the following 5 binary variables:
 B = a burglary occurs at your house
 E = an earthquake occurs at your house
 A = the alarm goes off
 J = John calls to report the alarm
 M = Mary calls to report the alarm
 What is P(B | M, J) ? (for example)
 We can use the full joint distribution to answer this question
 Requires 25 = 32 probabilities
• Order the variables in terms of causality (may be a partial order)
e.g., {E, B} -> {A} -> {J, M}
• P(J, M, A, E, B) = P(J, M | A, E, B) P(A| E, B) P(E, B)
~ P(J, M | A)
P(A| E, B) P(E) P(B)
~ P(J | A) P(M | A) P(A| E, B) P(E) P(B)
These CI assumptions are reflected in the graph structure of the Bayesian netwo rk
Depart ment of M CA
PESIT, BSC
Fifth Semester MCA IA Test, 2014
c
8.
T id
g
a te
R efu n d
ic
or
al
c
g
a te
ic
or
al
c
tin
n
o
u
uo
s
s
c la
M arital
S tatus
T axab le
In co m e
E vad e
1
Y es
S ingle
125K
No
2
No
M arried
100K
No
3
No
S ingle
70K
No
4
Y es
M arried
120K
No
5
No
D ivorced
95K
Y es
6
No
M arried
60K
No
7
Y es
D ivorced
220K
No
8
No
S ingle
85K
Y es
9
No
M arried
75K
No
10
No
S ingle
90K
Y es
s
10
By conside ring the above training data, given the Test record X= (Refund=No, Married,
Income=120K) Using Naïve Bayes Classifier find the Class.
(10 marks)
l
Depart ment of M CA
PESIT, BSC
Fifth Semester MCA IA Test, 2014
l
l
P(X|Class=No) = P(Refund=No|Class=No)
 P(Married| Class=No)
 P(Income=120K| Class=No)
= 4/7  4/7  0.0072 = 0.0024
P(X|Class=Yes) = P(Refund=No| Class=Yes)
 P(Married| Class=Yes)
 P(Income=120K| Class=Yes)
= 1  0  1.2  10-9 = 0
Since P(X|No)P(No) > P(X|Yes)P(Yes)
Therefore P(No|X) > P(Yes|X)
=> Class = No
Depart ment of M CA
PESIT, BSC