* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Review Questions
Survey
Document related concepts
Transcript
Data Mining and Knowledge Acquizition — Chapter 7 — —Data Mining Overview and Exam Questions— 2014/2015 Summer 1 Outline Methodology - Overview Introduction Data Description – Preprocessing OLAP Clustering Classification Numerical Prediction - Regression Frequent Pattern Mining Recent BIS Exams Unclassified Questions 2 Methodology and Overview KDD Methodology Functionalities 3 KDD Methodology Methodology Problem definition Data set selection Preprocessing transformations Functionalities Classification/numerical prediction Clustering Frequent Pattern Mining Association Sequential analysis others 4 KDD Methodology (cont.) Algorithms For classification you can use For clustering you can use Decision trees ID3,C4.5 CHAID are algorithms Partitioning methods k-means,k-medoids Hierarchical AGNES Probabilistic EM is an algorithm Presenting results Back transformations Reports Taking action 5 Data Description Single variables Categorical - Ordinal, nominal Continuous – interval, ratio Frequency plots, tables, Pie charts 5-summary, centeral tendency, spread Examine the probability distribution For two variables Both categorical Cross tabulation One categorical the other continuous Both are continuous correlation coeficient, scatter plots 6 Preprocessing Missing values Inconsistencies Redundent data Outliers Data transformations Data reduction Attribute elimination Attribute combination Samplinng Histograms 7 Functionalities Styles of Data Mining Descriptive - OLAP Classification Numerical Prediction Clustering Frequent Pattern Mining 8 Two basic style of data mining Descriptive Cross tabulations,OLAP,attribute oriented induction,clustering,association Predictive Classification,numerical prediction Difference between classification and numerical prediction Questions answered by these styles Supervised v.s. Unsupervised 9 Descriptive - OLAP Concept of data cube Fact table Measures – calculated measures Keys Dimensions Sheams Star, snowflake Concept hierarchies Set grouping such as price age Parent child Attributes not suitable for concept hierarcies 10 Classification Methods Decision trees Neureal networks Bayesian K-NN or model based reasoning Adventages disadventages Given a problem which data processing techniques are required Given a problem shich classification method or algorithm is more apprpriate 11 Classification (cnt.d) Accuracy of the model Measures for classification/numerical prediction How to better estimate How to improve Holdout,cross validation, bootstraping Bagging, boosting For unbalanced classes What to do with models Lift charts 12 Numercal Prediction Learning is supervised Output variable is continuous Methods Regression Simple Multiple Most methods for classification can be used for numerical prediction as well Accuricy Root mean square, absolute mean deviation 13 Clustering Distance measures Dissimilarity or similarity For different type of variables Ordinal,binary,nominal,ratio,interval Why need to transform data Partitioning methods K-means,k-medoids Adventage disadventage Hierarchical Density based probablistic 14 Frequent Pattern Mining Association analysis Apriori or FP-Growth How to measure strongness of rules Support and confidence Other measures of interestingness critique of support confidence Multiple levels Constraints Sequential pattern mining 15 Outline Methodology - Overview Introduction Data Description – Preprocessing OLAP Clustering Classification Numerical Prediction - Regression Frequent Pattern Mining Recent BIS Exams Unclassified Questions 16 Introduction Defineing problems Given a short description of an environment, deine data mining problems fiting to different functionalities, possible preprocessing problems paciliur to the environment Basic functionalities Given a short description of a data mining problem, with which functionality the problem is solved? 17 Big University Library 1. Suppose that a data warehouse for Big-University Library consists of the following three dimensions: users, books, time, and each dimension has four levels not including the all level. There are three measures: You are asked to perform a data mining study on that warehouse (25 pnt) Define three data mining problems on that warehouse: involving association, classification and clustering functionalities respectively. Clearly state the importance of each problem. What is the advantage of the data being organized as OLAP cubes compared to relational table organisation? 18 Big University Library (cont.) In data preprocessing stage of the KDD What are the reasons for missing values? and How do you handle them? what are possible data inconsistencies do you make any discritization do you make any data transformations do you apply any data reduction strategies 19 Big University Library (cont.) Define your target and input variables in classification. Which classification techniques and algorithms do you use in solving the classification problem? Support your answer Define your variables indicating their categories in clustering Which clustering techniques and algorithms do you use in solving the clustering problem? Support your answer. Describe association task in detail; specifying the algorithm interestingness measures or constraints if any. 20 Data mining on MIS A data warehouse for the MIS department consists of the following four dimensions: student, course, instructor, semester and each dimension has five levels including the all level. There are two measures: count and average grade. At the lowest level of average grade is the actual grade of a student. You are asked to perform a data mining study on that warehouse (25 pnt) 21 Data mining on MIS (cont.) Define three data mining problems on that warehouse: involving association, classification and clustering functionalities respectively. Clearly state the importance of each problem. What is the advantage of the data being organized as OLAP cubes compared to relational table organisation? In data preprocessing stage of the KDD What are the reasons for missing values? and How do you handle them? what are possible data inconsistencies do you make any discritization do you make any data transformations do you apply any data reduction strategies 22 Data mining on MIS (cont.) Define your target and input variables in classification. Which classification techniques and algorithms do you use in solving the classification problem? Support your answer Define your variables indicating their categories in clustering Which clustering techniques and algorithms do you use in solving the clustering problem? Support your answer. Describe association task in detail; specifying the algorithm interestingness measures or constraints if any. 23 Outline Methodology - Overview Introduction Data Description – Preprocessing OLAP Clustering Classification Numerical Prediction - Regression Frequent Pattern Mining Recent BIS Exams Unclassified Questions 24 Data Description How to describe single variables – categorical and continuous How to desribe two association between two variables bnoth continuous both categorical One continous, one categorical 25 Preprocessing What to do as preprocessing? Which techniques are applied? For what reason? 26 MIS 542 Midterm 2011/2012 Fall PCA 5. (10 points) Consider two continuous variables X and Y. Generate data sets a) where PCA (principle component analysis) can not reduces the dimensionality from two to one b) where although the two variables are related (a functional relationship exists between these two variables), PCA is not able to reduce the dimensionality from two to one 27 MIS 542 Final 2011/2012 Falloutliers 1 (20 points) Give two examples of outliers. a) Where outliers are useful and essential patterns to be mined. b) Outliers are useless steaming from error or noise. 28 MIS 542 Final 2011/2012 Fall transformations 2 (20 points) Considering the classification methods we cover in class, describe two distinct reasons why continuous input variables have to be normalized for classification problems(each reason 10 points). 29 Outline Methodology - Overview Introduction Data Description – Preprocessing OLAP Clustering Classification Numerical Prediction - Regression Frequent Pattern Mining Recent BIS Exams Unclassified Questions 30 OLAP Concept of data cube Fact table Measures – calculated measures Keys Dimensions Sheams Star, snowflake Concept hierarchies Set grouping such as price age Parent child Attributes not suitable for concept hierarcies 31 Data warehouse for library A data warehouse is constructed for the library of a university to be used as a multi-purpose DSS. Suppose this warehouse consists of the following dimensions: user , books , time (time_ID, year, quarter, month, week, academic year, semester, day), and . “Week” is considered not to be less than “month”. Each academic semester starts and ends at the beginning and end of a week respectively. Hence, week<semester. Describe concept hierarchies for the three dimensions. Construct meaningfull attributes for each dimension tables above . Describe at least two meaningfull measures in the fact table. Each dimension can be looked at its ALL level as well. What is the total number of cuboids for the library cube? Describe three meaningfull OLAP queries and write sql expresions for one of them. 32 Big University 2. (Han page 100,2.4) Suppose that the data warehouse for the Big-University consists of the following dimensions: student,course,instructor,semester and two measures count and average_grade. Where at the lowset conceptual level (for a given student, instructor,course, and semester) the average grade measure stores teh actual grade of the student. At higher conceptual levels the average_grade stores the average grade for the given combination. (when student is MIS semester 2005 all terms, course MIS 541, instructor Ahmet Ak, average_grade is the average of students grades in thet course by that instructer in all semester in 2005) 33 Big University (cont.) a) draw a snawflake sheam diagram for that warehouse What are the concept hierarchys for the dimensions b) What is the total nmber of cuboids 34 MIS 542 Final 2005/2006 Spring olap 1. MIS department wants to revise academic strategies for the following ten years. Relevent questions are: What portion of the courese are required or elective? What is the full time part time distribution of instuctors? What is the course load of instructors? What percent of technical or managerial courses are thought by part time instructors? How all theses things 35 MIS 542 Final S06 1 cont. changed over years? You can add similar stategic quustions of your own. Do not conside students aspects of the problem for the time being. Desing and OLAP sheam to be used as a strategic tool. You are free to decide the dimensions and the fact table. Describe the concept hierarchies, virtual dimensions and calculated members. Finally show OLAP opperations to answer three of such strategic questions 36 MIS 54 Final 2012/2013 Hospital 2. (20 pts) Suppose that a data warehouse for a hospital consists of the following dimensions: time, doctor and patient and the two measures count and charge, where charge is the fee a doctor charge a patient for a visit. Design a warehouse with star schema: a) Fact table: Design the fact table. b) Dimension tables: For each dimension show a reasonable concept hierarchy. c) State two questions that can be answered by that OLAP cube. d) Show drilldown and roll up operations related to one of these questions 37 Humman Resource cube 1. (25 points) In an organization a data warehouse is to be designed for evaluating performance of employees. To evaluate performance of an employee, survey questionnaire is consisting a set of questions with 5 Likered scale are answered by other employees in the same company at specified times. That is, performance of employees are rated by other employees. Each employee has a set of characteristics including department, education,… Each survey is conducted at a particular date applied to some of the employees. Questions are aimed to evaluate broad categories of performance such as motivation, cooperation ability,… Typically, a question in a survey, aiming to measure a specific attitude about an employee is evaluated by another employee (rated f rom 1 to 5) Data is available at question level. 38 Human resource cube (cont.) Cube design: a star schema Fact table: Design the fact table should contain one calculated member. What are the measures and keys? Dimension tables: Employee, and Time are the two essential dimensions include a Survey and Question dimensions as well. For each dimension show a concept hierarchy. State three questions that can be answered by that OLAP cube. Show drilldown and role up operations related to these questions 39 MIS Midterm 2008/2009 Spring Shipment 1. (20 points) Consider a shipment company responsible for shipping items from one location to another on predetermined due dates. Design a star schema OLAP cube for this problem to be used by managers for decision making purposes. The dimensions are time, item to be shipped, person responsible for shipping the item, location.. For each of these dimensions determine three levels in the concept hierarchy. Design the fact table with appropriate measures:and keys (include two measure and at least one calculated member in the fact table) Show one drilldown and role up operations Show the SQL query of one of the cuboids. 40 Outline Clustering 41 Outline Methodology - Overview Introduction Data Description – Preprocessing OLAP Clustering Classification Numerical Prediction - Regression Frequent Pattern Mining Recent BIS Exams Unclassified Questions 42 Comparing clustering methods Clustering methods Partitioning, hierarchical, density based, modelbased: probabnlistic EM Compare clustering methods Output İnterpreteation Sensitivity ot aoutliers Speed of computation 43 clustering Construct simple data sets showing the inadequacies of k-means clustering (20 pnt) this algorithm is not suitable of even spherical clusters of different sizes What are the adventages and disadventage of using k-means 44 clustering 1. Consider a delivery center location decision problem in a city where a set of related products are to be delivered to markets located in the city. Design an algortihm for this lacation selection problem extending an algortihm we cover in class. State clearly the algorithm and its extensions.for this particular problem. 45 Clustering preferences Consider a popular song competition. There are N competitors A1, A2,… AN. Number of voters is very large; a substantial fraction of the population of the country. Each voter is able to rank the competitors form best to worst e.g. for voter 1 (A4>A2>A3>A1) meaning that there are four competitors and A4 is the best for voter 1 A1 being the worst. Suppose preference data is available for a sample of n voters at the beginning of competition. Develop a distance measure between the preferences of two voters i and j Suppose you have the k-means algorithm available in a package. Describe how you can use the k-means algorithm to clusters voters according to their preferences. 46 MIS 542 Final 2005/2006 Spring 3. a) Describe how to modify k-means algorithm so as to handle categorical variables (binary, ordinal, nominal). b) What is a disadventage of Agglomerative hierarchical clustering method in the case of large data. Suggest a way of eliminating this disadventages while benefiting the adventages of agglomerative methods 47 MIS 542 Midterm 2007/2008 Spring Generate data set of two continuous variables X and Y. Consider clustering based on density When clustered with one variable there (either X or Y) there is one cluster When clustered with both variable there there are two clusters 48 MIS 542 Final 2011/2012 Fall 3 a (10 points) Generate data sets for two clustering problems with two continuous variables. Two natural clusters for the notion of density based clustering but the quality of these clusters are low for a partitioning approach based on dissimilarity such as k-means 3.b (10 points) Considering the advantages and disadvantages of partitioning and hierarchical agglomerative clustering approaches. Design a method for combining the two approaches to improve good clustering quality. (Finally there are hierarchies of clusters) 49 MIS Midterm 2011/2012 Fall 6. (25 points) A retail company asked to segment its customers. Following variables are available for each customer: age, income, gender number of children, occupation, house owner, have a car or not. There are 6 category of goods sold by the company and total purchases from each category is available for each customer, in addition average inter-purchase time is also included in the database. 50 MIS Midterm 2011/2012 Fall a) What are the types and scales of these variables? b) If your tool has only k-means algorithm which of these variables are more suitable for the segmentation problem? c) What data transformations are to be applied? d) How do you reduce number of variables used in the analysis? e) If you want to include categorical variables into your clustering, how would you treat them? 51 Midterm 2011/2012 Fall In Question 3-5 artificial data sets are generated for given situations. 3. (10 points) Consider a data set of two continuous variables X and Y. There are two clusters (k=2) Considering the advantages and disadvantages of partitioning methods k-means and k-medoids of clustering, generate two dimensional data set a) (5 pnt) Produces almost the same clusters by kmedoids and k-means b) (5 pnt) Produces different clusters by k-medoids and k-means 52 Outline Methodology - Overview Introduction Data Description – Preprocessing OLAP Clustering Classification Numerical Prediction - Regression Frequent Pattern Mining Recent BIS Exams Unclassified Questions 53 Outline Classification General Decision trees Neural networks Bayesian K-NN Accuricy Measures 54 Information gain 1. Consider a data set of two attributes A and B. A is continuous, whereas B is categorical, having two values as “y” and “n”, which can be considered as class of each observation. When attribute A is discretized into two equiwidth intervals no information is provided by the class attribute B but when discretized into three equiwidth intervals there is perfect information provided by B. Construct a simple dataset obeying these characteristics. 55 Decision tree 2. a-Construct a data set that generates the tree shown below In addition the following conditions are satisfied Node 2 A=a1 Decision Y Node 3 A=a2 Node 4 B=b1 Decision N Node 5 B=b2 Decision is Y 56 MIS 541 2012/2013 Final 1. (20 pts) Consider a decision tree with only two branches in that the attribute selection measure is entropy. Bearing in mind that each candidate input attribute may have more then two distinct values, how do you modify the ID3 algorithm to handle such a constraint on the number of branches of the tree. 57 MIS 542 Final 2005/2006 Spring 2. Given the training data set with missing values: A(Size) B(color) C(shape) Class small yellow round A big yellow round A big yellow red A small red round A small black round B big black cube B big yellow cube B big black round B small yellow cube B 58 MIS 542 Final 2005/2006 Spring (cont.) a) Apply the C4.5 algorithm to construct a decision tree. b) Given the new inputs X:size= small,color= missing, shape=round.and Y:size= big,color= yellow, shape=missing What is the prediction of the tree for X and Y? c) How do you classify the new data points given in part b) using Bayesian Classification? d) Analyse the possibility of pruning the tree. You can make normal approximation to Binomial distribution though number of observations is low. z value for upper confidence limit of c=25% is 0.69. 59 MIS 542 Final S06 neural networks 4. Consider a classification problem with two classes as C1 and C2. There are two numerical input variables X1 and X2, taking values between 0 and infinity. All observations are of class C1, if they are above X2 = 1/X1 curve (a hyperbola) All other observations are class C2. Describe how multilayer perceptrons can separate such a boundary using as few hidden nodes as possible. 60 MIS 542 Midterm S08 2 cşass,f,cat,pm Consider a clasification problem with two continuous variables X and Y and a categorical output with two distinct values C1 and C2 Generate data set such that A) Decision trees are appropriate for clasification B) Decision trees are not appropriate for clasification but a perceptron can classify the data succesfully C) Even s single perceptron is not enough to classify the data D) How do you encorporate a perceptron into decision trees so that cases in B and C can be clasified by an hybrid approach of DTs and perceptron 61 Final 2010/2011 Spring 2 (30 pt.) Consider a prediction problem; e.g. predicting weight using height(a continuous variable) as input, solved by neural networks. Such methods as back propagation try to minimize the prediction error but it is claimed that the magnitude of error depends on the weight: a prediction error of 0.5 for a baby with a short height should not be the same as for an adult with a height of 2.00 meters. a) Make a scatter plot of such a hypothetical data set for a two variable problem. b) Plot the prediction error on another graph c) Do you need to modify the back propagation algorithm so as to handle such a situation? If so explain your modification. 62 Final 2011/2012 Fall pverf,tt,mg 4. Illustrate the over fitting of neural networks for the following cases by generating data sets. a) (10 points) For a binary classification problem with two continuous inputs. b) (10 points) For a numerical prediction problem (output being continuous) with one continuous input variable. 63 Midterm 2011/2012 Fall 4. (10 points) Consider a classification by a decision tree problem. Consider a categorical input variable A having two distinct values. The output variable B has two distinct classes as well. At a particular node of the tree there are N data objects. Generate partitioning of data by input variable A for the following a) A does not provide any information: does not decrease information gain at all. b) A does provides perfect information: decrease information gain as much as possible 64 MIS 541 2012/2013 Final 5. (20 pts) Consider a classification problem solved by k-NN. Suppose in your dataset all inputs are continuous variables. Why do you need to apply any data transformations? What data transformation is applied? Suppose the variables are to be weighted after transformations. Device a method for determining optimal weights for variables s well as determining optimal k value considering that k-NN is a supervised learning method. 65 MIS 541 2012/2013 Final 5..(20 pts) The follwing table consists of training data from an employee database. Predicted variable is status. Age,Salary and Department are inputs Design a multilayer feedforward neural network for the given data. Label the noedes in the input, hidden and output layers. Describe how you encode the input and output variables, specifiy the parameters of the network that can be changed by the backpropegation algorithm. 66 Department Status Age Salary Sales Senior 31-35 46K-50K Sales Junior 26-30 26K-30K Sales Junior 31-35 31K-35K Systems Junior 21-25 46K-50K Systems Senior 31-35 66K-70K Systems Junior 26-30 46K-50K Systems Senior 41-45 66K-70K Marketing Senior 36-40 46K-50K Marketing Junior 31-35 41K-45K Secretary Senior 46-50 36K-40K Secretary Junior 26-30 26K-30K 67 Accuracy measures For class balanjcy or unbalancy problems Output variables with ordinary scale How do you modify the accuricy measure for an ordinal output variable with three different values Give an example for such a variable 68 Outline Methodology - Overview Introduction Data Description – Preprocessing OLAP Clustering Classification Numerical Prediction - Regression Frequent Pattern Mining Recent BIS Exams Unclassified Questions 69 BIS 541 2012/2013 Final II 5. Based on a sample of 30 observations the population regression model Y i = 0+ 1x i + i The least square estimates of intercept is 10.0 Sum of the values of dependent and independent variables are 450 and 150 respectively. Estimated variance of dependent variable is 25, variance of the residuals is 4 a) What is the least square estimate of slope coefficient? Interpret the figure. b) What are the values of SSR and SSE? c) Find and interpret the coefficient of determination. d) Test the null hypothesis that the explanatory variable X does not have a significant effect on Y at confidence level of 95%.Critical value of F=0.05(1,28) = 4.20 70 BIS 541 2013/2014 Final 4. Based on a sample of 50 observations the population regression model to predict number of automobile sales (dependent variable) based on advertisement placements (independent variable) Y i = 0+ 1x i + i The least square estimates of slope is 2.0 Average of the values of independent variable is 50. Sum of the values of dependent variable is 5390. Total sum of squares for dependent variable is 9000 Variance of the residuals is 40 71 BIS 541 2013/2014 Final a) What is the least square estimate of intercept coefficient? Interpret the figure. b) Interpret the the slope coefficient. b) What are the values of SSR and SSE? c) Find and interpret the coefficient of determination. 72 MIS 214 Midterm 2012/2015 Summer 5. (20 pt) An analyst want to estimate dependence of quantity demanded of a product (Y) on its price (X1) and price of its substitute (X2) using linear regression, based on a large sample of data obtained from 50 weeks Fill the missing parts in the following regression outputs (From a to l: this letter l) Do not report the – s but you may need their values. Do not write on this table R-square: f Adjusted R-square: g Standard error of regression: h: SS: d.f. MS F p-value Regression a c d e Error b d 2.5 Total 400 e 73 MIS 214 Final 2013/2014 Spring 1 (20 pt) For the following four scenarios, each having two cases denoted by I and II, draw scatter plots of X (explanatory variable) and Y (dependent variable) showing the population regression model drawn as a line or curve as well. Use around 20-25 hypothetical points unless otherwise stated assumptions of least square are hold. In I and II population slope and intercepts are the same a) In II variance of the error is higher than in I. b) In II coefficient of determination is higher than in I. c) In II spread of X is higher than in I. d) In II variance of the error term increases with higher values of X.. In I, variance of error is homoscedastic. 74 Outline Methodology - Overview Introduction Data Description – Preprocessing OLAP Clustering Classification Numerical Prediction - Regression Frequent Pattern Mining Recent BIS Exams Unclassified Questions 75 Exercise a) Suppose A B and B C are strong rules Dose this imply that A C is also a strong rule? b) Suppose A C and B C are strong rules Dose this imply that A AND B C is also a strong rule? c) Suppose A B and A C are strong rules Dose this imply that A B AND C is also a strong? d) Suppose A B AND C is a strong rule. Dose this imply that A B and A C are strong rules? e) Suppose A AND B C is a strong rule. Dose this imply that A C and B C are strong rules? 76 Exercise a) Suppose {A,B,C} is a frequent 3 itemset. Dose it imply that {A,B} and {A,C} are frequent 2 itemsets? b) Suppose {A,B}, {A,C}, and {B,C} are frequent 2 itemsets. Dose it imply that {A,B,C} is a frequent 3 itemset? c) Suppose {A,B} is a frequent 2 itemset. Dose it imply that, A B and B A are strong rules? 77 Associations 1. 2. 3. In a particular database; AC and BC are strong association rules based on the support confidence measure. A and B are independent items. Does this imply that A BC is also a strong rule based on the lift measure? A,B,C are items in a transaction database. -if A B and BC are strong. Is AC a strong rule -if A B and AC are strong. İs BC a strong rule 78 MIS 542 midterm S06 association constratint The price of each item in a store is nonnegative. For the following cases indicate the type of constraints (such as: monotone, untimonotone, tough, storngly convertable or succinct) a) Containing at least one Nintendo Game. b) The average price of items is between 100 and 500. 79 BIS 541 2012/2013 Final II 4. The questions about constaint-based association rule mining The price of each item is nonnegative For the following cases indicate the type of constraints (monotonic, anti-monotonic or none) a) the sum of prices of items is less then or equal to 10 b) the average price of items is less then or equal to 20 80 MIS 214 Final 2013/2015 Spring (15 pt) Given that L4: {(1,2,3,4),(2,4,5,6)}where 1,2,...,6 are ID s of items. a) Write a L3 consisting of five 3-itemsets b) Write a C3 of seven 3-itemsets 81 Outline Methodology - Overview Introduction Data Description – Preprocessing OLAP Clustering Classification Numerical Prediction - Regression Frequent Pattern Mining Recent BIS Exams Unclassified Questions 82 BIS 541 2011/2012 Final 1. For each of the following problem identify relevant data mining tasks a) A weather analyst is interested in calculating the likely change in temperatue for the coming days. b) A marketing analyst is looking for the groups of customers so as to apply different CRM strategies for ecach group c) A medical doctor must decide whether a set of symptoms is an indication of a particular disease. d) A educational psychologist would like to determine exceptional students to sugget them for special educational programs. . 83 BIS 541 2011/2012 Final 2. Develop a data warehouse for an insurance company using fact constellations scheme. The company holds insurance premiums paind by its customers for different type of policies as well as the payments in case of accidents to its customers. There are two facat tables for premiums and payments respectively. The dimensions are customer time, policy accident some are sheered by the two fact tables. a) design the fact tables : keys and measures b) design the dimension tables their concept hierarchies c) show one roll up and one drill down opperation 84 BIS 541 2011/2012 Final 3. Consider a customer segmentation problem to be solved with k-means algorithm. . The following variables are available in the dataset: gender, member card information, total spending in TL and education level. a) what are the scales of these variables.? b) How would you transform data before applying clustering? c) How do you find similarity/dissimilarity between two customers? 85 BIS 541 2011/2012 Final 4. Construct a particular node of a decision tree There are 6 data points at that node. The output is a categorical variable with two distinct values. Generate a dtra set of three variables one bieing the output (Y) the others are inputs (X1 and X2) such that X1 reduces the information gane as much as possible whereas X2 dose not reduces the information gain at all. 86 BIS 541 2011/2012 Final 1. Generate two different data sets of two continuous input variables X1 and X2 for a clustering problem. a) that would give almost the same set of clustering results when solved by k-means and k-medoids b) that would give different set of clusters when solved by k-means and k-medoids 87 BIS 541 2011/2012 Final 2. Develop a data warehouse for holding academic performance of an university’s faculty members. The dimensions are time (here academic year is important but the day of the publication is a bit detailed) faculty member, paper. For an article publiched by a factulty member at a particular paper, number of citations taken.and the implact factor of that paper are important. Paper can be journal articles, conference proceedings journals can be in SCI or SSCI and each such ournal or conference has a prestige factor a continous variable. a) design the fact table : keys and measures b) design the dimension tables their concept hierarchies c) describe in word fife different types of queries that can be answered by the OLAP cube d) show two roll up and two drill down operation 88 BIS 541 2011/2012 Final 3. Generate data sets for a supervised learning problem solved by neural networks. a) There are two continuous independent variables X1 and X2 and a class variable with two different values such as yes and no. On the same artificially generatred dataset illustrate the concept of overfitting by neural networks. b) Illustrate the behavior of training and test errors as the complexity of the network increases 89 BIS 541 2011/2012 Final 4. Consider a classification problem to be solved by kNN method. The output is whether the customer will buy a product or not. The inputs are income, age, education level of the customer and profession of the customer (having here distinct values) a) Describe the data transformations needed in the preprocessing step to prepare the datra set to be classified by k-NN b) How the data transformations are different from the solution of th same problem by neural networks. 90 BIS 541 2012/2013 Final II 1 For each of the following problem identify relevant data mining tasks with a brief explanation a) A weather analyst is interested in wheather the temperature will be up or down for the coming day b) An insurance analyst intends to group policy holders according to characteristics of customers and policies c) A medical researcher is looking for symptoms that are occurring together among a large set of pationes. d) An educational program director would like to determine likely GPA of applicant to a MA program from their ALES scores, undergraduate GPAs and enterence exam scores. 91 BIS 541 2012/2013 Final II 2. Develop a data warehouse for a weather bureau having so many probes located all over a large region, using star scheme. These probes collect basic weather data such as temperature , air pressure , humidity,… at each hour. All the data is sent to a central station to be processed. . a) design the fact table : keys and measures b) design the dimension tables their concept hierarchies c) state two questions that can be answered by querying the warehouse. d) show one roll up and one drill down operation abour one of these questions 92 BIS 541 2012/2013 Final II Evaluate the four classification methods: decision threes, neural networks, Bayesian classification and k-NN in terms of a) accuricy b) speed of model development and use c) understandability and interpretability of output d) handling of outlayers if not handled in preprocessing step 93 BIS 541 2012/2013 Final II 4. The questions about constaint-based association rule mining The price of each item is nonnegative For the following cases indicate the type of constraints (monotonic, anti-monotonic or none) a) the sum of prices of items is less then or equal to 10 b) the average price of items is less then or equal to 20 94 BIS 541 2012/2013 Final II 5. Based on a sample of 30 observations the population regression model Y i = 0+ 1x i + i The least square estimates of intercept is 10.0 Sum of the values of dependent and independent variables are 450 and 150 respectively. Estimated variance of dependent variable is 25, variance of the residuals is 4 a) What is the least square estimate of slope coefficient? Interpret the figure. b) What are the values of SSR and SSE? c) Find and interpret the coefficient of determination. d) Test the null hypothesis that the explanatory variable X does not have a significant effect on Y at confidence level of 95%.Critical value of F=0.05(1,28) = 4.20 95 BIS 541 2013/2014 Final 1. For each of the following problem identify relevant data mining tasks with a brief explanation a) A financial analyst is interested in wheather the stock market index will be up or down for the coming day b) Cities in Turkey are grouped according to their voting characteristics after the Republic of President election. c) A security specialist is interested in determining mail message are spam or no looking at words passing the messages. d) A medical doctor is interested in what symptoms (binary variables) occur together for a specific gtype of canser. 96 BIS 541 2013/2014 Final 2. Evaluate the four clustering methods: k-means, kmedoids, hierarchical, model-based (probalictic) in terms of a) handling of non-spherical shapes b) speed of model development c) understandability and interpretability of output d) sensitivity to outlayers. In each of these aspects mention only the remarkable methods (you need not mantion all methods in all aspects) 97 BIS 541 2013/2014 Final 3. Develop a data warehouse for the election to selection of president of republic. There are so many poll stations (sandık) located all over the country. Using star scheme.. Each pool station has valid notes for each of the three candidates, invalid ots and total number of voters. Each poll station has a set of lacation related variables such as district, city,.some characteristics of cities There is no time dimension in this version of the problem. 98 BIS 541 2013/2014 Final a) design a warehouse with star shame: fact table : keys and measures and at least two calculated measures. b) design the dimension tables their concept hierarchies c) state two questions that can be answered by querying the warehouse. d) show one roll up and one drill down operation abour one of these questions 99 BIS 541 2013/2014 Final 4. Based on a sample of 50 observations the population regression model to predict number of automobile sales (dependent variable) based on advertisement placements (independent variable) Y i = 0+ 1x i + i The least square estimates of slope is 2.0 Average of the values of independent variable is 50. Sum of the values of dependent variable is 5390. Total sum of squares for dependent variable is 9000 Variance of the residuals is 40 100 BIS 541 2013/2014 Final a) What is the least square estimate of intercept coefficient? Interpret the figure. b) Interpret the the slope coefficient. b) What are the values of SSR and SSE? c) Find and interpret the coefficient of determination. 101 Outline Methodology - Overview Introduction Data Description – Preprocessing OLAP Clustering Classification Numerical Prediction - Regression Frequent Pattern Mining Recent BIS Exams Unclassified Questions 102 5. (25 points) Consider a data set representing the interactions among a set of people. The degree of interaction is a positive real number; high values can be interpreted as, the two members are closely related (they have close interactions such as heavy telephone calls or mail traffic between them) In other words rather then including the coordinates of variables directly, the similarity/dissimilarity matrix is given. This is a symmetric matrix. Develop an algorithm for clustering similar objects into same clusters. Assume that number of clusters (k) is given 103 3. (25 points) Consider a data set of two continuous variables X and Y. X is right skewed and Y is left skewed. Both represent measures about same quantity (sales categories, exam grades,…) a) Draw typical distributions of X and Y separately. b) Draw box plots of X and Y separately. c) Draw q-plots (quantile) of X and Y separately. d) Draw q-q plot of X and Y. 104 4. (25 points) A strategy for clustering high dimensional data of continuous variables is: First apply principle components to reduce the dimensionality of the data set and apply clustering on the reduced form of the data. Discuss the drawback(s) of this approach. 105 MIS 541 2012/2013 Final 1. (20 pts) Consider a data set of two continuous variables X and Y. X both has the same mean, both have no skewness (symetric)ç X has a higher variance then Y. Both represent measures about same quantity (sales categories, exam grades,…) a) Draw typical distributions of X and Y on the same graph. b) Draw box plots of X and Y separately. 106 MIS 541 2012/2013 Final 2. (20 pts) Illustrate with plots of two continuous inputs and binary class that one layer neural networks are enough to classify convex class boundaries Two hidden layers are enough to capture even non convex class boundaries. 107 MIS 541 2012/2013 Final 3. (20 pts) Consider association rules X Y where X is a categorical variable with more then two values and Y is originally continuous but discretize into categories. Give example variables for X and Y. Illustrate that confidence as an interestingness measure may be misleading. Suggest a modification to the classical confidence so as to eliminate its drawback for this type of variables. 108 MIS 541 2012/2013 Final 4. (20 pts) The price of each item is nonnegative For the following cases indicate the type of constraints (monotone, anti-monotone, tough, strongly convertible or succinct) a) the sum of prices of items is less then or equal to 10 b) the average price of items is less then or equal to 20 109 Midterm 2008/2009 Spring 2.(20) Consider a classification problem in that customers that are taking consumer credits from a bank are classified into three risk groups The input variables are age: discretized into 4 groups, income into 4 groups, education into four groups, gender, number of months the customer is dealing with the bank and average delay of payments in months, and current value of the accont balance. The output variable has 3 categories as risky, normal or highly risky calculated by some procedure and provided to the data miner. Design an encoding schema for the input and output variables so that the problem will be solved by a neural network Show a typical topology of a feedforward network architecture 110 Midterm 2008/2009 Spring 3. (20 points) Consider a classification by a decision three problem. There are two categorical input variables A and B having two distinct values each. The output variable C has two distinct classes. Suppose the dataset is suitable for using decision threes. Is the order of selection of variables affects the classification error? Support your answer by generating data sets pictorially. (stoping condition is either a pure class is obtained or no variables remains to be tested) 111 Midterm 2008/2009 Spring 4. (20 points) Principle components is used for dimensionality reduction then may be followed by cluster analysis – say for segmentation purposes – Consider a two continuous variable problem. Using scatter plots a) Generate a data set where PCA reduces the dimensionality from two to one b) Generate a data set where although there is a relation between the two variables, PCA is not able to reduce the dimensionality to one c) Generate a data set where there are natural clusters and PCA can reduce the dimensionality d) Generate a data set where there are natural clusters but PCA is not the appropriate method for reducing the dimensionality 112