Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
PES Institute of Technology Bangalore South Campus (1 K.M before Electronic City ,Bangalore 560100 ) Department of MCA Test-II Solution Set Sub: Data Warehousing and Data Mining Sem & Section:V Name of the Faculty: Manjulaprasad Date: 30/09/2014 Duration: 90 min Max marks: 50 Note: 1. Answer any Five full questions. 2. Draw neat diagram for all necessary questions. 1. a) What is Decision Tree? Write Skeleton decision tree induction algorithm (05 marks) Decision tree is widely used classification technique. The tree has three types of nodes: A root node that has no incoming edges and zero or more outgoing edges. Inte rnal nodes, each of which has exactly one incoming edge and two or more outgoing edges. Leaf or terminal nodes, each of which has exactly one incoming edge and no outgoing edges. Algorithm : A skeleton decision tree induction algorithm. TreeGrowth ( E , F) 1. if stopping_cond(E,F) = true then 2. leaf = createNode() 3. leaf.label = Classify(E). 4. return leaf. 5. else 6 . root = createNode(). 7 . root.test-cond = f ind_best_split(E, F) . 8 . let V = { v \ v is a poss ible outc om e of root. test-c ond } . 9 . for each v € V do 10 . Ev = { e | root.test-cond(e) = v and e € E ) . 11 . child = TreeGrowth(Ev, F) . 12. add child as descendent of root and label the edge (root —> child) as v. 13. end for 14. end if 15. return root. b) Build Decision Tree for the below Training set using Hunts algorithm (05 marks) 2.a) Find the Gini , Entropy for node N1 Node N1 Class=0 Class=1 Count 1 5 b) Consider a training set that contains 100 positive examples and 400 negative examples. For each of the following candidate rules, R1: A −>. + (covers 4 positive and 1 negative examples), R2: B −.> + (covers 30 positive and 10 negative examples), R3: C −. >+ (covers 100 positive and 90 negative examples), (05 marks) Determine which is the best and the worst rule using FOIL’S OR likelihood ratio statistic. (05 marks) FOIL’S – R1=8 , R2=52.2, R3=139.6 Therefore, R3 is the best candidate and R 1 is the worst candidate according to FOIL’s information gain. likelihood ratio statistic R1= 12, R2= 80.85, R3= 143.09 Therefore, R3 is the best candidate and R 1 is the worst candidate according to the likelihood ratio statistic. 3. Explain any one Direct & Indirect method for Rule extraction with example by conside ring any data set (10 marks) The sequential covering algorithm is often used to extract rules directly from data. The algorithm extracts the rules one class at a time for data sets that contain more than two classes. Sequential Covering Algorithm: 1. Start from an empty rule 2. Grow a rule using the Learn-One-Rule function 3. Remove training records covered by the rule 4. Repeat Step (2) and (3) until stopping criterion is met Steps Followe d to Build Rule In Detail: 1. Learn-One-Rule Function o The objective of the Learn-One-Rule function is to extract a classification rule that covers many of the positive examples and none (or very few) of the negative examples in the training set. However, finding an optimal rule is computationally expensive given the exponential size of the search space. 2. Rule Growing Strategy: o There are two common strategies for growing a classification rule: General-to-specific or Specific-to-general o Under the general-to-specific strategy, an initial rule r: {} —• y is created, where the left- hand side is an empty set and the right-hand side contains the target class. The rule has poor quality because it covers all the examples in the training set. New conjuncts are subsequently added to improve the rule's quality. Rule Evaluation: 1. An evaluation metric is needed to determine which conjunct should be added (or removed) during the rule-growing process. 4. Rule Pruning: o The Rule generated by the Learn-One-Rule function can be pruned to improve their generalization errors. Ex: If the error on validation set decreases after pruning, we should keep the simplified rule. 5. Rationale for sequential covering: o After a rule is extracted, the sequential covering algorithm must eliminate all the positive and negative examples covered by the rule. Indirect Method It is a method for generating a rule set from a decision tree. In principle, every path from the root node to the leaf node of a decision tree can be expressed as a classification rule P No Yes Q No Rule Set R Yes - + + No Yes Q No Yes - + r1: (P=No,Q=No) ==> r2: (P=No,Q=Yes) ==> + r3: (P=Yes,R=No) ==> + r4: (P=Yes,R=Yes,Q=No) ==> r5: (P=Yes,R=Yes,Q=Yes) ==> + Fifth Semester MCA IA Test, 2014 4. a)Explain the K-nearest neighbor classification algorithm with example. (05 marks) To determine the class of a new example E: Calculate the distance between E and all examples in the training set Select K-nearest examples to E in the training set Assign E to the most common class among its K-nearest neighbors K-nearest neighbor classification algorithm Let K be the number of nearest neigbhors and D be the set of training examples For each test example z=(x’, y’) do Compute d(x’,x), the distance between z and every example (x,y) belongs to D Select Dz belongs to D, the set of k closest training examples to z Y’=argmax £(xi,yi)belongs to Dz b) Write a note on We b Robot Detection. (05 marks) o Web usage mining is the task of applying data mining techniques to extract useful patterns from Web access logs. o These patterns can reveal interesting characteristics of site visitors; e.g., people who repeatedly visit a Web site and view the same product description page are more likely to buy the product if certain incentives such as rebates or free shipping are offered. o In Web usage mining, it is important to distinguish accesses made by human users from those due to Web robots. o A Web robot (also known as a Web crawler) is a software program that automatically locates and retrieves information from the Internet by following the hyperlinks embedded in Web pages. Depart ment of M CA PESIT, BSC Fifth Semester MCA IA Test, 2014 Decision Tree Classification can be used to distinguish between accesses by human users and those by web Robots. o To classify the Web sessions, features are constructed to describe the characteristics of each session. o Some of the features used for the Web robot detection task are depth and Breadth o Depth determines the maximum distance of a requested page, where distance is measured in terms of the number of hyperlinks away from the entry point of the Web site. o For example, the home page http://www.cs.umn.edu/kumar is assumed to be at depth 0, o whereas http://www.cs.umn.edu/kumar/MINDS/MINDS_papers.htm is lo-cated at depth 2 o The breadth attribute measures the width of the corresponding Web graph o Here sessions used by WebRobots belongs to class-1 o Sessions used by human users belongs to class-2 Data Mining & Warehousing The model suggests that Web robots can be distinguished from human users in the following way: 1. Accesses by Web robots tend to be broad but shallow, whereas accesses by human users tend to be more focused (narrow but deep). 2. Unlike human users, Web robots seldom retrieve the image pages associated with a Web document. 3. Sessions due to Web robots tend to be long and contain a large number of requested pages. 4. Web robots are more likely to make repeated requests for the same document since the web pages retrieved by human uses are often cached by the browser. 5.a) Explain Naïve Bayes Classifier with example. (06 marks) Steps Involved 1. Each data sample is of the type X=(x i) i =1(1)n, where x i is the values of X for attribute A i 2. Suppose there are m classes Ci, i=1(1)m. X Ci iff P(Ci|X) > P(Cj|X) for 1 j m, j i i.e BC assigns X to class C i having highest posterior probability conditioned on X The class for which P(Ci|X) is maximized is called the maximum posterior hypothesis. From Bayes Theorem 3. P(X) is constant. Only need be maximized. If class prior probabilities not known, then assume all classes to be equally likely Otherwise maximize P(Ci) = Si/S Problem: computing P(X|C i) is unfeasible! (find out how you would find it and why it is infeasible Depart ment of M CA PESIT, BSC Fifth Semester MCA IA Test, 2014 4. Naïve assumption: attribute independence P = P(x1 ,…,xn |C) = P(xk |C) 5. In order to classify an unknown sample X, evaluate for each class C i. Sample X is assigned to the class C i iff P(X|Ci)P(Ci) > P(X|Cj) P(Cj) for 1 j m, j i b) What are the characteristics of decision tree. (04 marks) 1. Decision tree induction is a non-parametric approach for building classification models. 2. Finding an optimal decision tree is| an NP-complete problem. Many decision tree algorithms employ a heuristic-based approach to guide their search in the vast hypothesis space. 3. Techniques developed for constructing decision trees are computationally inexpensive, making it possible to quickly construct models even when the training set size is very large. 4. Decision trees, especially smaller-sized trees, are relatively easy to interpret. The accuracies of the trees are also comparable to other classification techniques for many simple data sets. 5. Decision trees provide an expressive representation for learning discrete- valued functions. 6. Decision tree algorithms are quite robust to the presence of noise, especially when methods for avoiding over- fitting. 7. The presence of redundant attributes does not adversely affect the accuracy of decision trees. 6. What are the different methods used for estimating predictive accuracy for classification. Explain. (10 marks) The accuracy of a classification method is the ability of the method to correctly determine the class of a randomly selected data instance. It may be expressed as the probability of correctly classifying unseen data. Estimating the accuracy of a supervised classification method can be difficult if only the (raining data is available and all of that data has been used in building the model. Accuracy may be measured using a number of metrics. These include: Sensitivity Specificity Precision and Accuracy. The methods for estimating errors include: Depart ment of M CA PESIT, BSC Fifth Semester MCA IA Test, 2014 Holdout Random Sub-sampling, Cross-validation and leave-one-out Holdout Method: o The holdout method requires a training set and a test set. o The sets are mutually exclusive. o A larger training set would produce a better classifier, while a larger test set would produce a better estimate of the accuracy. A balance must be achieved. Random Sub-sampling Method: o Random sub-sampling is very much like the holdout method except that it does not rely on a single test set. o The holdout estimation is repeated several times and the accuracy estimate is obtained by computing the mean of the several trials. o Random sub-sampling is likely to produce better e rror estimates than those by the holdout method. K-fold Cross-validation Method: o In K-fold cross-validation, the available data is randomly divided into k disjoint subsets of approximately equal size. o One of the subsets is then used as the test set and the remaining k - 1 sets are used for building the classifier. o The test set is then used to estimate the accuracy. o This is done repeatedly k times so that each subset is used as a test subset once. o Cross- validation has been tested extensively and has been found to generally work well when sufficient data is available. Leave-one-out Method o Leave-one-out is a simpler version of K- fold cross-validation. o In this method. One of the training samples is taken out and the model is generated using the remaining training data. o Once the model is built, the one remaining sample is used for testing and the result is coded as 1 or 0 depending if it was classified correctly or not. Depart ment of M CA PESIT, BSC Fifth Semester MCA IA Test, 2014 7. Explain the algorithm for generating the topology of a Bayesian Network with example . (10 marks) Let X=(x1, x2,…,xn) be a tuple described by variables or attributes Y1, Y2, …,Yn respectively. Each variable is CI of its nondescendants given its parents Allows he DAG to provide a complete representation of the existing Joint Probability Distribution by: P(x1, x2, x3,…,xn)=P(xi|Parents(Yi)) where P(x1, x2, x3,…,xn) is the prob. of a particular combination of values of X, and the values for P(xi|Parents(Yi)) correspond to the entries in CPT for Yi A node within the network can selected as an ‘output’ node, representing a class label attribute More than one output node Rather than returning a single class label, the classification process can return a probability distribution that gives the probability of each class Training BBN!! A Bayesian network specifies a joint distribution in a structured form Represent dependence/independence via a directed graph Nodes = random variables Edges = direct dependence Structure of the graph Conditional independence relations Requires that graph is acyclic (no directed cycles) 2 components to a Bayesian network The graph structure (conditional independence assumptions) The numerical probabilities (for each variable given its parents) Probability model has simple factored form Directed edges => direct dependence Absence of an edge => conditional independence Also known as belief networks, graphical models, causal networks Other formulations, e.g., undirected graphical models Consider the following 5 binary variables: B = a burglary occurs at your house E = an earthquake occurs at your house A = the alarm goes off J = John calls to report the alarm M = Mary calls to report the alarm What is P(B | M, J) ? (for example) We can use the full joint distribution to answer this question Requires 25 = 32 probabilities • Order the variables in terms of causality (may be a partial order) e.g., {E, B} -> {A} -> {J, M} • P(J, M, A, E, B) = P(J, M | A, E, B) P(A| E, B) P(E, B) ~ P(J, M | A) P(A| E, B) P(E) P(B) ~ P(J | A) P(M | A) P(A| E, B) P(E) P(B) These CI assumptions are reflected in the graph structure of the Bayesian netwo rk Depart ment of M CA PESIT, BSC Fifth Semester MCA IA Test, 2014 c 8. T id g a te R efu n d ic or al c g a te ic or al c tin n o u uo s s c la M arital S tatus T axab le In co m e E vad e 1 Y es S ingle 125K No 2 No M arried 100K No 3 No S ingle 70K No 4 Y es M arried 120K No 5 No D ivorced 95K Y es 6 No M arried 60K No 7 Y es D ivorced 220K No 8 No S ingle 85K Y es 9 No M arried 75K No 10 No S ingle 90K Y es s 10 By conside ring the above training data, given the Test record X= (Refund=No, Married, Income=120K) Using Naïve Bayes Classifier find the Class. (10 marks) l Depart ment of M CA PESIT, BSC Fifth Semester MCA IA Test, 2014 l l P(X|Class=No) = P(Refund=No|Class=No) P(Married| Class=No) P(Income=120K| Class=No) = 4/7 4/7 0.0072 = 0.0024 P(X|Class=Yes) = P(Refund=No| Class=Yes) P(Married| Class=Yes) P(Income=120K| Class=Yes) = 1 0 1.2 10-9 = 0 Since P(X|No)P(No) > P(X|Yes)P(Yes) Therefore P(No|X) > P(Yes|X) => Class = No Depart ment of M CA PESIT, BSC