Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining and Knowledge Acquizition — Chapter 7 — —Data Mining Overwiev and Exam Questions— 2013/2014 Summer 1 Data Mining Methodology Problem definition Data set selection Preprocessing transformations Functionalities Classification/prediction Clustering Association Sequential analysis others 2 Methodology cont. Algorithms For classification you can use For clustering you can use Decision trees ID3,C4.5 CHAID are algorithms Partitioning methods k-means,k-medoids Hierarchical AGNES Probabilistic EM is an algorithm Presenting results Back transformations Reports Taking action 3 Two basic style of data mining Descriptive Cross tabulations,OLAP,attribute oriented induction,clustering,association Predictive Classification,prediction Questions answered by these styles Difference between classification and prediction 4 Classification Methods Decision trees Neureal networks Bayesian K-NN or model based reasoning Adventages disadventages Given a problem which data processing techniques are required 5 Classification (cnt.d) Accuracy of the model Measures for classification/numerical prediction How to better estimate How to improve Holdout,cross validation, bootstraping Bagging, boosting For unbalanced classes What to do with models Lift charts 6 Clustering Distance measures Dissimilarity or similarity For different type of variables Ordinal,binary,nominal,ratio,interval Why need to transform data Partitioning methods K-means,k-medoids Adventage disadventage Hierarchical Density based probablistic 7 Association Apriori or FP-Growth How to measure strongness of rules Support and confidence Other measures critique of support confidence Multiple levels Constraints Sequential patterns 8 OLAP Concept of cube Fact table measures Dimensions Sheams Star, snowflake Concept hierarchies Set grouping such as price age Parent child 9 Pre processing Missing values Inconsistencies Redundent data Outliers Data reduction Attribute elimination Attribute combination Samplinng Histograms 10 Exam Questions Intorduction Basic functionalities Data description Data preperation Data warehousing olap Clustering classification/numerical prediction frequent pattern mining 11 Introduction Defining data mining problems Data mining functionalities 12 Define data mining problems 1. Suppose that a data warehouse for Big-University Library consists of the following three dimensions: users, books, time, and each dimension has four levels not including the all level. There are three measures: You are asked to perform a data mining study on that warehouse (25 pnt) Define three data mining problems on that warehouse: involving association, classification and clustering functionalities respectively. Clearly state the importance of each problem. What is the advantage of the data being organized as OLAP cubes compared to relational table organisation? 13 Define data mining problems In data preprocessing stage of the KDD What are the reasons for missing values? and How do you handle them? what are possible data inconsistencies do you make any discritization do you make any data transformations do you apply any data reduction strategies 14 Define data mining problems Define your target and input variables in classification. Which classification techniques and algorithms do you use in solving the classification problem? Support your answer Define your variables indicating their categories in clustering Which clustering techniques and algorithms do you use in solving the clustering problem? Support your answer. Describe association task in detail; specifying the algorithm interestingness measures or constraints if any. 15 Data mining on MIS A data warehouse for the MIS department consists of the following four dimensions: student, course, instructor, semester and each dimension has five levels including the all level. There are two measures: count and average grade. At the lowest level of average grade is the actual grade of a student. You are asked to perform a data mining study on that warehouse (25 pnt) 16 Data mining on MIS 2 Define three data mining problems on that warehouse: involving association, classification and clustering functionalities respectively. Clearly state the importance of each problem. What is the advantage of the data being organized as OLAP cubes compared to relational table organisation? In data preprocessing stage of the KDD What are the reasons for missing values? and How do you handle them? what are possible data inconsistencies do you make any discritization do you make any data transformations do you apply any data reduction strategies 17 Data mining on MIS 3 Define your target and input variables in classification. Which classification techniques and algorithms do you use in solving the classification problem? Support your answer Define your variables indicating their categories in clustering Which clustering techniques and algorithms do you use in solving the clustering problem? Support your answer. Describe association task in detail; specifying the algorithm interestingness measures or constraints if any. 18 Final 2010/2011 Spring (MIS) 3 ( 35 pt.) The aim of Knowledge Discovery from Databases (KDD) is to extract interesting, potentially useful, …, knowledge from data. The extracted knowledge can be represented in a knowledge base similar to a database. Considering the data mining functionalities and algorithms we covered in this course describe five different knowledge types. For each type discuss the following aspects: a) From which functionality and algorithm they are obtained? b) How they are represented in knowledge base? (Do not consider data structures ) c) What are the quality characteristics? d) How they are used in the deployment phase? 19 BIS 541 2011/2012 Final 1. For each of the following problem identify relevant data mining tasks a) A weather analyst is interested in calculating the likely change in temperatue for the coming days. b) A marketing analyst is looking for the groups of customers so as to apply different CRM strategies for ecach group c) A medical doctor must decide whether a set of symptoms is an indication of a particular disease. d) A educational psychologist would like to determine exceptional students to sugget them for special educational programs. . 20 BIS 541 2012/2013 Final For each of the following problem identify relevant data mining tasks with a brief explanation a) A weather analyst is interested in wheather the temperature will be up or down for the coming day b) An insurance analyst intends to group policy holders according to characteristics of customers and policies c) A medical researcher is looking for symptoms that are occurring together among a large set of pationes. d) An educational program director would like to determine likely GPA of applicant to a MA program from their ALES scores, undergraduate GPAs and enterence exam scores. 21 Basic Fuctionalities Decision tree - ID3 information gain Association – Apriori Clustering – k-means 22 Information gain 1. Consider a data set of two attributes A and B. A is continuous, whereas B is categorical, having two values as “y” and “n”, which can be considered as class of each observation. When attribute A is discretized into two equiwidth intervals no information is provided by the class attribute B but when discretized into three equiwidth intervals there is perfect information provided by B. Construct a simple dataset obeying these characteristics. 23 Decision tree 2. a-Construct a data set that generates the tree shown below In addition the following conditions are satisfied Node 2 A=a1 Decision Y Node 3 A=a2 Node 4 B=b1 Decision N Node 5 B=b2 Decision is Y 24 Midterm 2006/2007 Spring (MIS) 2. Show that entroy is not a symetric measure of association like correlation coefficient is. Construct a simple data set of two categorical attributes A and B such that i knowing the values of A provides perfect information to predict B but ii) knowing the values of B does not provide perfect information to precict A 25 at a particular node when information gain is 0 when it gets maximum value 26 Associations 1. 2. 3. In a particular database; AC and BC are strong association rules based on the support confidence measure. A and B are independent items. Does this imply that A BC is also a strong rule based on the lift measure? A,B,C are items in a transaction database. -if A B and BC are strong. Is AC a strong rule -if A B and AC are strong. İs BC a strong rule 27 Data Description/Preprocessing 28 Midterm 2004/2005 Spring (MIS) Consider the correlation coefficient between two numerical variables. Does its umerical value affected by the unit of measures of these variables?. (such as measureing temperature in oC or öF) 29 Midterm 2011/2012 Fall generate data 5. (10 points) Consider two continuous variables X and Y. Generate data sets a) where PCA (principle component analysis) can not reduces the dimensionality from two to one b) where although the two variables are related (a functional relationship exists between these two variables), PCA is not able to reduce the dimensionality from two to one 30 Midterm 2010/2011 Spring (MIS) 3. (25 points) Consider a data set of two continuous variables X and Y. X is right skewed and Y is left skewed. Both represent measures about same quantity (sales categories, exam grades,…) a) Draw typical distributions of X and Y separately. b) Draw box plots of X and Y separately. c) Draw q-plots (quantile) of X and Y separately. d) Draw q-q plot of X and Y. 31 MIS 541 2012/2013 Final 1. (20 pts) Consider a data set of two continuous variables X and Y. X both has the same mean, both have no skewness (symetric)ç X has a higher variance then Y. Both represent measures about same quantity (sales categories, exam grades,…) a) Draw typical distributions of X and Y on the same graph. b) Draw box plots of X and Y separately. 32 Final 2011/2012 Fall data description 1 (20 points) Give two examples of outliers. a) Where outliers are useful and essential patterns to be mined. b) Outliers are useless steaming from error or noise. 33 Final 2011/2012 Fall preprocessing 2 (20 points) Considering the classification methods we cover in class, describe two distinct reasons why continuous input variables have to be normalized for classification problems(each reason 10 points). 34 Midterm 2008/2009 Spring 4. (20 points) Principle components is used for dimensionality reduction then may be followed by cluster analysis – say for segmentation purposes – Consider a two continuous variable problem. Using scatter plots a) Generate a data set where PCA reduces the dimensionality from two to one b) Generate a data set where although there is a relation between the two variables, PCA is not able to reduce the dimensionality to one c) Generate a data set where there are natural clusters and PCA can reduce the dimensionality d) Generate a data set where there are natural clusters but PCA is not the appropriate method for reducing the dimensionality 35 Midterm 2012/2013 Fall (MIS) 1. (20 pts) Consider a data set of two continuous variables X and Y. X both has the same mean, both have no skewness (symetric)ç X has a higher variance then Y. Both represent measures about same quantity (sales categories, exam grades,…) a) Draw typical distributions of X and Y on the same graph. b) Draw box plots of X and Y separately. c) Draw q-plots (quantile) of X and Y separately. d) Draw q-q plot of X and Y. 36 Data Warehousing/OLAP Design of olap cubes Measures 37 Midterm 2005/2006 Spring (MIS) A large hypermarket has lots of branchs through out the country. Quantity purchased Qi, price Pi, for each item i are stored in a warehouse. The top management is interested in finding the cheapest large sold items minp(maxq item i). Is it possible to accomplish this in a distributive maner? In other word is minp(maxq item i) a distributive measure? 38 Final 2007/2008 Spring (MIS) 1. (25 pnt) Suppose an aggregation is to be designed to obtain weekly dollar values from daily values by two different ways described below. Can they be computed in a distributive manner? (the database has day ID and dollar value fields. Records are randomly selected and assigned to different processing units) a) Taking the daily averages b) Taking the last day’s value of the week 39 Data warehouse for library A data warehouse is constructed for the library of a university to be used as a multi-purpose DSS. Suppose this warehouse consists of the following dimensions: user , books , time (time_ID, year, quarter, month, week, academic year, semester, day), and . “Week” is considered not to be less than “month”. Each academic semester starts and ends at the beginning and end of a week respectively. Hence, week<semester. Describe concept hierarchies for the three dimensions. Construct meaningfull attributes for each dimension tables above . Describe at least two meaningfull measures in the fact table. Each dimension can be looked at its ALL level as well. What is the total number of cuboids for the library cube? Describe three meaningfull OLAP queries and write sql expresions for one of them. 40 OLAP Big University 2. (Han page 100,2.4) Suppose that the data warehouse for the Big-University consists of the following dimensions: student,course,instructor,semester and two measures count and average_grade. Where at the lowset conceptual level (for a given student, instructor,course, and semester) the average grade measure stores teh actual grade of the student. At higher conceptual levels the average_grade stores the average grade for the given combination. (when student is MIS semester 2005 all terms, course MIS 541, instructor Ahmet Ak, average_grade is the average of students grades in thet course by that instructer in all semester in 2005) 41 cont. a) draw a snawflake sheam diagram for that warehouse What are the concept hierarchys for the dimensions b) What is the total nmber of cuboids 42 MIS 542 Final S06 1 olap 1. MIS department wants to revise academic strategies for the following ten years. Relevent questions are: What portion of the courese are required or elective? What is the full time part time distribution of instuctors? What is the course load of instructors? What percent of technical or managerial courses are thought by part time instructors? How all theses things 43 MIS 542 Final S06 1 cont. changed over years? You can add similar stategic quustions of your own. Do not conside students aspects of the problem for the time being. Desing and OLAP sheam to be used as a strategic tool. You are free to decide the dimensions and the fact table. Describe the concept hierarchies, virtual dimensions and calculated members. Finally show OLAP opperations to answer three of such strategic questions 44 Midterm 2006/2007 Spring 1. A data warehouse is constructed for the web site of a e-commerce company to be used for customer segmentation. Each visitor click stream data is recorded Each session has an ID Suppose this warehouse consists of the following dimensions: visitor, time, product. There is a concept hierarcy for products which is reflected to the design of the web site so that products can be seen in a hierarchical manner. When a product is seen it can be purchased. Only registered customers can use the system so each visitor has an ID. When registering a form is field out so that sociodemographic information is taken form a customer. Suppose income (a a numerical variable), birthday, gender, profesion, marital status is asked. 45 cont. a) Describe concept hierarchies for the three dimensions. Construct meaningful attributes for each dimension tables above.(What transformations are required before constructing these attributes) Describe at least two meaningful measures in the fact table. b) Each dimension can be looked at its ALL level as well. Describe three meaningful OLAP queries and write sql expressions for one of them. c) Define a clustering problem: Which variables are important? Is there a missing value problem? What data transformation are needed? Which algorithm would you suggest? 46 Midterm 2007/2008 Spring 1. (20 points) Consider a shipment company responsible for shipping items from one location to another on predetermined due dates. Design a star schema OLAP cube for this problem to be used by managers for decision making purposes. The dimensions are time, item to be shipped, person responsible for shipping the item, location.. For each of these dimensions determine three levels in the concept hierarchy. Design the fact table with appropriate measures:and keys (include two measure and at least one calculated member in the fact table) Show one drilldown and role up operations Show the SQL query of one of the cuboids. 47 Midterm 2008/2009 Spring 1. (25 points) In an organization a data warehouse is to be designed for evaluating performance of employees. To evaluate performance of an employee, survey questionnaire is consisting a set of questions with 5 Likered scale are answered by other employees in the same company at specified times. That is, performance of employees are rated by other employees. Each employee has a set of characteristics including department, education,… Each survey is conducted at a particular date applied to some of the employees. Questions are aimed to evaluate broad categories of performance such as motivation, cooperation ability,… Typically, a question in a survey, aiming to measure a specific attitude about an employee is evaluated by another employee (rated f rom 1 to 5) Data is available at question level. 48 cont. Cube design: a star schema Fact table: Design the fact table should contain one calculated member. What are the measures and keys? Dimension tables: Employee, and Time are the two essential dimensions include a Survey and Question dimensions as well. For each dimension show a concept hierarchy. State three questions that can be answered by that OLAP cube. Show drilldown and role up operations related to these questions 49 MIS 541 2012/2013 Final 2. (20 pts) Suppose that a data warehouse for a hospital consists of the following dimensions: time, doctor and patient and the two measures count and charge, where charge is the fee a doctor charge a patient for a visit. Design a warehouse with star schema: a) Fact table: Design the fact table. b) Dimension tables: For each dimension show a reasonable concept hierarchy. c) State two questions that can be answered by that OLAP cube. d) Show drilldown and roll up operations related to one of these questions 50 BIS 541 2011/2012 Final 2. Develop a data warehouse for an insurance company using fact constellations scheme. The company holds insurance premiums paind by its customers for different type of policies as well as the payments in case of accidents to its customers. There are two facat tables for premiums and payments respectively. The dimensions are customer time, policy accident some are sheered by the two fact tables. a) design the fact tables : keys and measures b) design the dimension tables their concept hierarchies c) show one roll up and one drill down opperation 51 BIS 541 2012/2013 Final Develop a data warehouse for a weather bureau having so many probes located all over a large region, using star scheme. These probes collect basic weather data such as temperature , air pressure , humidity,… at each hour. All the data is sent to a central station to be processed. . a) design the fact table : keys and measures b) design the dimension tables their concept hierarchies c) state two questions that can be answered by querying the warehouse. d) show one roll up and one drill down operation abour one of these questions 52 BIS 541 2011/2012 Final 2. Develop a data warehouse for holding academic performance of an university’s faculty members. The dimensions are time (here academic year is important but the day of the publication is a bit detailed) faculty member, paper. For an article publiched by a factulty member at a particular paper, number of citations taken.and the implact factor of that paper are important. Paper can be journal articles, conference proceedings journals can be in SCI or SSCI and each such ournal or conference has a prestige factor a continous variable. a) design the fact table : keys and measures b) design the dimension tables their concept hierarchies c) describe in word fife different types of queries that can be answered by the OLAP cube d) show two roll up and two drill down operation 53 Clustering 54 Clustering preferences Consider a popular song competition. There are N competitors A1, A2,… AN. Number of voters is very large; a substantial fraction of the population of the country. Each voter is able to rank the competitors form best to worst e.g. for voter 1 (A4>A2>A3>A1) meaning that there are four competitors and A4 is the best for voter 1 A1 being the worst. Suppose preference data is available for a sample of n voters at the beginning of competition. Develop a distance measure between the preferences of two voters i and j Suppose you have the k-means algorithm available in a package. Describe how you can use the k-means algorithm to clusters voters according to their preferences. 55 clustering Construct simple data sets showing the inadequacies of k-means clustering (20 pnt) this algorithm is not suitable of even spherical clusters of different sizes What are the adventages and disadventage of using k-means 56 clustering 1. Consider a delivery center location decision problem in a city where a set of related products are to be delivered to markets located in the city. Design an algortihm for this lacation selection problem extending an algortihm we cover in class. State clearly the algorithm and its extensions.for this particular problem. 57 MIS 542 Final S06 clustering 3. a) Describe how to modify k-means algorithm so as to handle categorical variables (binary, ordinal, nominal). b) What is a disadventage of Agglomerative hierarchical clustering method in the case of large data. Suggest a way of eliminating this disadventages while benefiting the adventages of agglomerative methods 58 MIS 542 Midterm S08 clustering Generate data set of two continuous variables X and Y. Consider clustering based on density When clustered with one variable there (either X or Y) there is one cluster When clustered with both variable there there are two clusters 59 Final 2007/2008 Spring (MIS) 2. Considering the advantages and disadvantages of partitioning methods such as k-means and density based methods of clustering, generate two dimensional data set a) (5 pnt) Successfully clustered by k-means and DBScan b) (5 pnt) Successfully clustered by k-means but not by DBScan b) (5 pnt) Successfully clustered by DBScan but not by k-means d) (10 pnt) Suggest a clustering procedure combining the two methods 60 Midterm 2008/2009 Spring (MIS) 5. (20 points) In a clustering problem either z transformation or logistic transformation (y = 1/(1+exp(-z)) is applied to the original variables. Discuss the effects of these transformations on the quality and nature of clusters for a problem with two continuous variales. Suppose then k-means algorithm is used for clustering. Especially what is the consequence of these transformations (logistic and z on the similarity(dissimilarity between objects and nature of the clusters fromed then)? 61 Final 20010/2011 Spring (MIS) 1 (35 pt.) Consider a time series problem: a continuous variable observed in regularly spaced time steps, such as daily dollar/TL exchange rate (for each day a $/TL value is available) or monthly inflation rate. a) Suppose some time periods (days) data is not available. Suggest a method for handling missing value problems in time series data. 62 cont. b) The continuous variable is to be discretized into piecewise linear segments characterized by slope and duration. Slope can take say five distinct values as very high, high, horizontal, low, very low. Duration can take say three values short, medium and long. Plot a time series data. Plot the piecewise linear discrete form on the same graph. Propose a method for obtaining such piecewise linear segments. c) Following are examples of rules extracted from the piecewise linear segments: A long period of boom(very high slope) is followed by a short period of down, a medium period of down movement is followed by a long period of horizontal behavior Suggest a method for extracting such rules from piecewise linear segments. 63 Midterm 2011/2012 Fall In Question 3-5 artificial data sets are generated for given situations. 3. (10 points) Consider a data set of two continuous variables X and Y. There are two clusters (k=2) Considering the advantages and disadvantages of partitioning methods k-means and k-medoids of clustering, generate two dimensional data set a) (5 pnt) Produces almost the same clusters by kmedoids and k-means b) (5 pnt) Produces different clusters by k-medoids and k-means 64 Final 2011/2012 Fall 3 a (10 points) Generate data sets for two clustering problems with two continuous variables. Two natural clusters for the notion of density based clustering but the quality of these clusters are low for a partitioning approach based on dissimilarity such as k-means 3.b (10 points) Considering the advantages and disadvantages of partitioning and hierarchical agglomerative clustering approaches. Design a method for combining the two approaches to improve good clustering quality. (Finally there are hierarchies of clusters) 65 Midterm 2011/2012 Fall 6. (25 points) A retail company asked to segment its customers. Following variables are available for each customer: age, income, gender number of children, occupation, house owner, have a car or not. There are 6 category of goods sold by the company and total purchases from each category is available for each customer, in addition average inter-purchase time is also included in the database. 66 Midterm 2011/2012 Fall cont. a) What are the types and scales of these variables? b) If your tool has only k-means algorithm which of these variables are more suitable for the segmentation problem? c) What data transformations are to be applied? d) How do you reduce number of variables used in the analysis? e) If you want to include categorical variables into your clustering, how would you treat them? 67 Midterm 2010/2011 Spring (MIS) 5. (25 points) Consider a data set representing the interactions among a set of people. The degree of interaction is a positive real number; high values can be interpreted as, the two members are closely related (they have close interactions such as heavy telephone calls or mail traffic between them) In other words rather then including the coordinates of variables directly, the similarity/dissimilarity matrix is given. This is a symmetric matrix. Develop an algorithm for clustering similar objects into same clusters. Assume that number of clusters (k) is given 68 Midterm 2010/2011 Spring (MIS) 4. (25 points) A strategy for clustering high dimensional data of continuous variables is: First apply principle components to reduce the dimensionality of the data set and apply clustering on the reduced form of the data. Discuss the drawback(s) of this approach. 69 Midterm 2012/2013 Fall (MIS) 4. (20 pts) Consider a data set of two continuous variables X and Y. Consider two data points P and Q. Suppose Q is the origin (0,0). P is one unit away from Q. a) Draw the locus of points for P based on Eucledean, Manhatan, and Chebychev distance notions b) Suppose relative importance of X with respect to Y is controlled by weighting X in the distance formulas. What is the meaning of this weight being greater then one and less dthen one respectively? c) Draw the locus of points for P for the three distance notions in part a) when X is weighted greater then one and less then one respectively. 70 BIS 541 2011/2012 Final 3. Consider a customer segmentation problem to be solved with k-means algorithm. . The following variables are available in the dataset: gender, member card information, total spending in TL and education level. a) what are the scales of these variables.? b) How would you transform data before applying clustering? c) How do you find similarity/dissimilarity between two customers? 71 BIS 541 2011/2012 Final 1. Generate two different data sets of two continuous input variables X1 and X2 for a clustering problem. a) that would give almost the same set of clustering results when solved by k-means and k-medoids b) that would give different set of clusters when solved by k-means and k-medoids 72 Comparing clustering methods Clustering methods Partitioning, hierarchical, density based, modelbased: probabnlistic EM Compare clustering methods Output İnterpreteation Sensitivity ot aoutliers Sepped of coputation 73 Classification Decision trees Neural networks k-NN Bayesian classification Measuring and Improving Accuracy 74 MIS 542 Final S06 2 2. Given the training data set with missing values: A(Size) B(color) C(shape) Class small yellow round A big yellow round A big yellow red A small red round A small black round B big black cube B big yellow cube B big black round B small yellow cube B 75 MIS 542 Final S06 2 cont. a) Apply the C4.5 algorithm to construct a decision tree. b) Given the new inputs X:size= small,color= missing, shape=round.and Y:size= big,color= yellow, shape=missing What is the prediction of the tree for X and Y? c) How do you classify the new data points given in part b) using Bayesian Classification? d) Analyse the possibility of pruning the tree. You can make normal approximation to Binomial distribution though number of observations is low. z value for upper confidence limit of c=25% is 0.69. 76 MIS 542 Final S06 neural networks 4. Consider a classification problem with two classes as C1 and C2. There are two numerical input variables X1 and X2, taking values between 0 and infinity. All observations are of class C1, if they are above X2 = 1/X1 curve (a hyperbola) All other observations are class C2. Describe how multilayer perceptrons can separate such a boundary using as few hidden nodes as possible. 77 MIS 542 Midterm S08 2 classification Consider a clasification problem with two continuous variables X and Y and a categorical output with two distinct values C1 and C2 Generate data set such that A) Decision trees are appropriate for clasification B) Decision trees are not appropriate for clasification but a perceptron can classify the data succesfully C) Even s single perceptron is not enough to classify the data D) How do you encorporate a perceptron into decision trees so that cases in B and C can be clasified by an hybrid approach of DTs and perceptron 78 Final 2010/2011 Spring 2 (30 pt.) Consider a prediction problem; e.g. predicting weight using height(a continuous variable) as input, solved by neural networks. Such methods as back propagation try to minimize the prediction error but it is claimed that the magnitude of error depends on the weight: a prediction error of 0.5 for a baby with a short height should not be the same as for an adult with a height of 2.00 meters. a) Make a scatter plot of such a hypothetical data set for a two variable problem. b) Plot the prediction error on another graph c) Do you need to modify the back propagation algorithm so as to handle such a situation? If so explain your modification. 79 Final 2011/2012 Fall supervised learning 4. Illustrate the over fitting of neural networks for the following cases by generating data sets. a) (10 points) For a binary classification problem with two continuous inputs. b) (10 points) For a numerical prediction problem (output being continuous) with one continuous input variable. 80 Midterm 2011/2012 Fall generate data 4. (10 points) Consider a classification by a decision tree problem. Consider a categorical input variable A having two distinct values. The output variable B has two distinct classes as well. At a particular node of the tree there are N data objects. Generate partitioning of data by input variable A for the following a) A does not provide any information: does not decrease information gain at all. b) A does provides perfect information: decrease information gain as much as possible 81 MIS 541 2012/2013 Final 5. (20 pts) Consider a classification problem solved by k-NN. Suppose in your dataset all inputs are continuous variables. Why do you need to apply any data transformations? What data transformation is applied? Suppose the variables are to be weighted after transformations. Device a method for determining optimal weights for variables s well as determining optimal k value considering that k-NN is a supervised learning method. 82 MIS 541 2012/2013 Final 1. (20 pts) Consider a decision tree with only two branches in that the attribute selection measure is entropy. Bearing in mind that each candidate input attribute may have more then two distinct values, how do you modify the ID3 algorithm to handle such a constraint on the number of branches of the tree. 83 MIS 541 2012/2013 Final 2. (20 pts) Illustrate with plots of two continuous inputs and binary class that one layer neural networks are enough to classify convex class boundaries Two hidden layers are enough to capture even non convex class boundaries. 84 MIS 541 2012/2013 Final 5..(20 pts) The follwing table consists of training data from an employee database. Predicted variable is status. Age,Salary and Department are inputs Design a multilayer feedforward neural network for the given data. Label the noedes in the input, hidden and output layers. Describe how you encode the input and output variables, specifiy the parameters of the network that can be changed by the backpropegation algorithm. 85 Department Status Age Salary Sales Senior 31-35 46K-50K Sales Junior 26-30 26K-30K Sales Junior 31-35 31K-35K Systems Junior 21-25 46K-50K Systems Senior 31-35 66K-70K Systems Junior 26-30 46K-50K Systems Senior 41-45 66K-70K Marketing Senior 36-40 46K-50K Marketing Junior 31-35 41K-45K Secretary Senior 46-50 36K-40K Secretary Junior 26-30 26K-30K 86 Midterm 2007/2008 Spring (20 pnt) MIS department has a couple of criteria in choosing graduate students such as GAP, ALES score, interview point. Some students may fail to complete the program most others graduate successfully. Considering this as a binary variable, describe how do you decide the best weighting of the enterence critera could be designed. (Assume enough data is available in out database) 87 Final 2003/2004 Spring 4. Consider the network topology shown below, there is one input X and two output Y1 and Y2 , activation function in nodes 2,3 and 4 are hyperbolic tangent tanh(x)=(ex-e-x)/(ex+e-x).Currently all weights and biases are zero. a-Derive the backpropagation rule for the tanh activation function for hidden and output units b- Perform one iteration of the algorithm when the data point (X=0,Y1=1,Y2=-1) is presented the to the network 88 3 1 2 4 89 Midterm 2010/2011 Spring 2.(25 points) Consider a prediction problem with one continuous input and one output that is solved by a network topology as follows: there are n layers, in each layer there is only one node with logistic transfer function and no bias (constant) term. The nodes in the layers are indexed from 1 (the output noede) to n (input is send to that node) wi is the weight for layer i. So w1 is the weight for the output node and wn is the weight applied to input before sending it to the node n (the first node) Derive the back propagation weight update formula for weight wi. (i=1 for the output weight and i i>1 hiddden node weights ) Note that nodes are indexed in a reverse order (starting from output to inputs ) for the sake of easiness. Derivative of logistic function: y= 1/(1+exp(-x)) is y*(1-y) 90 Accuracy measures For class balanjcy or unbalancy problems Output variables with ordinary scale How do you modify the accuricy measure for an ordinal output variable with three different values Give an example for such a variable 91 Midterm 2008/2009 Spring 2.(20) Consider a classification problem in that customers that are taking consumer credits from a bank are classified into three risk groups The input variables are age: discretized into 4 groups, income into 4 groups, education into four groups, gender, number of months the customer is dealing with the bank and average delay of payments in months, and current value of the accont balance. The output variable has 3 categories as risky, normal or highly risky calculated by some procedure and provided to the data miner. Design an encoding schema for the input and output variables so that the problem will be solved by a neural network Show a typical topology of a feedforward network architecture 92 Midterm 2008/2009 Spring 3. (20 points) Consider a classification by a decision three problem. There are two categorical input variables A and B having two distinct values each. The output variable C has two distinct classes. Suppose the dataset is suitable for using decision threes. Is the order of selection of variables affects the classification error? Support your answer by generating data sets pictorially. (stoping condition is either a pure class is obtained or no variables remains to be tested) 93 Midterm 2012/2013 Fall (MIS) 3. (20 pts) A data mining study for a targeted marketing problem reveals that the only variable (X) explaining the buying behavior is previous spending (continuous) Probability of a customer returning to the mail offer is P(buy) = 0.1*X, where 0<=X<=10.Suppose There are 100 customer whose X variables are uniformly distributed between 0 and 10. Suppose cost of dealing with a customer is c and revenue from a buyer is r (r > c). What is the break even point in terms of the previous spending X? That is up to what value of X for a new customer the company should treat that customer? 94 Midterm 2012/2013 Fall (MIS) 5. (20 pts) Consider a classification problem solved by k-NN. Suppose in your dataset all inputs are continuous variables. Why do you need to apply any data transformations? What data transformation is applied? Suppose the variables are to be weighted after transformations. Device a method for determining optimal weights for variables s well as determining optimal k value considering that k-NN is a supervised learning method. 95 BIS 541 2011/2012 Final 4. Construct a particular node of a decision tree There are 6 data points at that node. The output is a categorical variable with two distinct values. Generate a dtra set of three variables one bieing the output (Y) the others are inputs (X1 and X2) such that X1 reduces the information gane as much as possible whereas X2 dose not reduces the information gain at all. 96 BIS 541 2011/2012 Final 3. Generate data sets for a supervised learning problem solved by neural networks. a) There are two continuous independent variables X1 and X2 and a class variable with two different values such as yes and no. On the same artificially generatred dataset illustrate the concept of overfitting by neural networks. b) Illustrate the behavior of training and test errors as the complexity of the network increases 97 BIS 541 2011/2012 Final 4. Consider a classification problem to be solved by kNN method. The output is whether the customer will buy a product or not. The inputs are income, age, education level of the customer and profession of the customer (having here distinct values) a) Describe the data transformations needed in the preprocessing step to prepare the datra set to be classified by k-NN b) How the data transformations are different from the solution of th same problem by neural networks. 98 BIS 541 2012/2013 Final Based on a sample of 30 observations the population regression model Y i = 0+ 1x i + i The least square estimates of intercept is 10.0 Sum of the values of dependent and independent variables are 450 and 150 respectively. Estimated variance of dependent variable is 25, variance of the residuals is 4 a) What is the least square estimate of slope coefficient? Interpret the figure. b) What are the values of SSR and SSE? c) Find and interpret the coefficient of determination. d) Test the null hypothesis that the explanatory variable X does not have a significant effect on Y at confidence level of 95%.Critical value of F=0.05(1,28) = 4.20 99 BIS 541 2012/2013 Final Evaluate the four classification methods: decision threes, neural networks, Bayesian classification and k-NN in terms of a) accuricy b) speed of model development and use c) understandability and interpretability of output d) handling of outlayers if not handled in preprocessing step 100 Frequent Pattern Mining Association rules Apriori, FP-Growth Multilevel rules Quantitaitve variables Interestingness measures Constraint-bsed association rule mining Sequential patten mining 101 MIS 541 2012/2013 Final 3. (20 pts) Consider association rules X Y where X is a categorical variable with more then two values and Y is originally continuous but discretize into categories. Give example variables for X and Y. Illustrate that confidence as an interestingness measure may be misleading. Suggest a modification to the classical confidence so as to eliminate its drawback for this type of variables. 102 MIS 541 2012/2013 Final 4. (20 pts) The price of each item is nonnegative For the following cases indicate the type of constraints (monotone, anti-monotone, tough, strongly convertible or succinct) a) the sum of prices of items is less then or equal to 10 b) the average price of items is less then or equal to 20 103 BIS 541 2012/2013 Final The questions about constaint-based association rule mining The price of each item is nonnegative For the following cases indicate the type of constraints (monotonic, anti-monotonic or none) a) the sum of prices of items is less then or equal to 10 b) the average price of items is less then or equal to 20 104 MIS 542 midterm S06 association constratint The price of each item in a store is nonnegative. For the following cases indicate the type of constraints (such as: monotone, untimonotone, tough, storngly convertable or succinct) a) Containing at least one Nintendo Game. b) The average price of items is between 100 and 500. 105 Tips or the exam Data discription for Single variables Ordinal, nominal, continuous For two variables One categorical the other continuous Both are continuous – correlation coeficient 106