Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Name _____________________ Grade: ____+____+____+____+____+____+____=____ BIS 541 Data Mining and Knowledge Acquisition /Midterm/Final Questions Solve each question on different side of a page Show all your work, No exchange of calculators 1. Given the data set: Income age low young high young high young low mid low old high old high young high mid low young Class Y Y Y Y N N N N N Where Y for yes, N for no a) Given the new inputs X: income= low, ager= young,how do you classify the new data point using Bayesian Classification? b) Describe a feedforward neural network structure for this classification problem: indicating how the encode the inputs and the output variables are performed, hidden node and transfer functions in the hidden nodes. 2 A retail company is aiming at performing a segmentation study of its customers. a) If the tool has a k-means algorithm for the segmentation purpose. What are theadventages and disadventages of having this algorithm at hand What data preprocessing actions are required? b) Suppose the number of data points is too large. If a hierarchical clustering is required by the management, describe how to combine k-means and hierarchical algorithms.so as to eliminate the disadvantages of both methods 3. Construct a data set that generates the tree shown below by the ID3 algorithm. Here N umber of data points Node 2 A=a1 Decision Y N: 4 Node 3 A=a2 N:4 Node 4 B=b1 Decision No N: 2 Node 5 B=b2 Decision is Yes N:2 2. (35 points) Consider a classification problem in that customers that are taking consumer credits from a bank are classified into three risk groups The input variables are age: discretized into 4 groups, income into 4 groups, education into four groups, gender, number of months the customer is dealing with the bank and average delay of payments in months, and current value of the accont balance. The output variable has 3 categories as risky, normal or highly risky calculated by some procedure and provided to the data miner. Design an encoding schema for the input and output variables so that the problem will be solved by a neural network Show a typical topology of a feedforward network architecture 3. (35 points) Consider a shipment company responsible for shipping items from one location to another on predetermined due dates. Design a star schema OLAP cube for this problem to be used by managers for decision making purposes. The dimensions are time, item to be shipped, person responsible for shipping the item, location.. For each of these dimensions determine three levels in the concept hierarchy. Design the fact table with appropriate measures:and keys (include two measure and at least one calculated member in the fact table) Show one drilldown and role up operations 1. A data warehouse is constructed for the library of a university to be used as a multipurpose DSS. Suppose this warehouse consists of the following dimensions: user , books , time (time_ID, year, quarter, month, week, academic year, semester, day), and . “Week” is considered not to be less than “month”. Each academic semester starts and ends at the beginning and end of a week respectively. Hence, week<semester. a. Describe concept hierarchies for the three dimensions. Construct meaningfull attributes for each dimension tables above . Describe at least two meaningfull measures in the fact table. Each dimension can be looked at its ALL level as well. b. Describe three meaningfull OLAP queries and write sql expresions for one of them. 2. An E-commerse company is aiming at performing a segmentation study of its visitors so as to foree visitors to stay and make orders There is a concept hierarcy for products that can be followed by visitors. At the end of hierarcy there are products that can be looked or ordered Each session may end up with an order. If the tool has a k-means algorithm for the segmentation purpose. What are the adventages and disadventages of having this algorithm at hand What data pre processing actions are required (missing value handling data transformations) 3. A churn model is to be developed for the customers of a telecominication company.aiming at perdicting whether a customer in the next month is a churner or not. Customers voluntarily leave the company and 1% of the customers in each month churns. Relevant data is available in three tables: a customer table holding personal information about customers, a billing table holding payment charactherisitics of customers, a calling table holding summersy of calling patterns a) propose two variables from each table that you thinkimpotant in explaiining churn. b) what are the characteristics of these variables c) What are the reasons for missing values? and How do you handle them? d) What are possible data inconsistencies e) Do you make any discritization f) Do you make any data transformationdo you apply any data reduction strategies g) Define your target and input variables in classification. Which classification techniques and algorithms do you use in solving the classification problem? Support your answer h) Which functionality of data mining is appropriate for this problem, which algorithm would you suggest. What are the adventages and disadventages of rhe algorithm you porpose a. Define your variables indicating their categories in clustering Which clustering techniques and algorithms do you use in solving the clustering problem? Support your answer. 1. 1.(Han page problem 328 problem 7,6) The follwing table consists of training data from an employee database. For a given raw entry, count represents the number of data tuples having the values for department, status, age and salary given in that row. Predicted variable is status Age,Salary and Department are inputs Department Status Age Salary Sales Senior 31-35 46K-50K Sales Junior 26-30 26K-30K Sales Junior 31-35 31K-35K Systems Junior 21-25 46K-50K Systems Senior 31-35 66K-70K Systems Junior 26-30 46K-50K Systems Senior 41-45 66K-70K Marketing Senior 36-40 46K-50K Marketing Junior 31-35 41K-45K Secretary Senior 46-50 36K-40K Secretary Junior 26-30 26K-30K a) Solve this classification problem using ID3 algorithm. Where status is the output variable b) Design a multilayer feedforward neural network for the given data. Label the noedes in the input and output layers. Describe how you encode the input and output variables, specifiy the parameters of the network that can be changed 2. For each of the following problem identify relevant data mining tasks a) A stock market analyst is asked to calculate the likely change in stock price for a set of companies with similar price/earning ratios b) A political strategist is seeking the best groups to canvass for donations in a particular country c) A defense computer must immediately decide whether a blip on the radar is a flock of geese or an incoming nuclear missile d) A homeland security official would like to determine whether a certain sequence of financial or residential moves implies a tendency for terrorist acts. 3. A retail company asked to segment its customers. Flowing variables are available for each customer: age,income, gender numer of children, occupation, houseowner, have a car or not. There are 6 category of goods sold by the company and total purchases from each category is available for each customer, in addition average inter purchase time is also included in the database. a) What are the types of these variables b) if your tool has only k-means algorithm which of these variables are more suitable for the segmentation problem.? c) What data transformations are applied? d) How do you reduce number of variables used in the analysis? e) How do you determine number of customer segments? f) How do you measure similarity between occupation and gender? g) If you want to include categorical variables into your clustering, How would you treat 1. (20 pts) For each of the following problem identify relevant data mining tasks with a brief explanation a) A weather analyst is interested in wheather the temperature will be up or down for the coming day b) An insurance analyst intends to group policy holders according to characteristics of customers and policies c) A medical researcher is looking for symptoms that are occurring together among a large set of pationes. d) An educational program director would like to determine likely GPA of applicant to a MA program from their ALES scores, undergraduate GPAs and enterence exam scores. 2. (20 pts) Develop a data warehouse for a weather bureau having so many probes located all over a large region, using star scheme. These probes collect basic weather data such as temperature , air pressure , humidity,… at each hour. All the data is sent to a central station to be processed. . a) design the fact table : keys and measures b) design the dimension tables their concept hierarchies c) state two questions that can be answered by querying the warehouse. d) show one roll up and one drill down operation abour one of these questions 3. (20 pts) Evaluate the four classification methods: decision threes, neural networks, Bayesian classification and k-NN in terms of a) accuricy b) speed of model development and use c) understandability and interpretability of output d) handling of outlayers if not handled in preprocessing step 4. (20 pts) The questions about constaint-based association rule mining The price of each item is nonnegative For the following cases indicate the type of constraints (monotonic, anti-monotonic or none) a) the sum of prices of items is less then or equal to 10 b) the average price of items is less then or equal to 20 5. (20 pts) Based on a sample of 30 observations the population regression model Y i = 0+ 1x i + i The least square estimates of intercept is 10.0 Sum of the values of dependent and independent variables are 450 and 150 respectively. Estimated variance of dependent variable is 25, variance of the residuals is 4 a) What is the least square estimate of slope coefficient? Interpret the figure. b) What are the values of SSR and SSE? c) Find and interpret the coefficient of determination. d) Test the null hypothesis that the explanatory variable X does not have a significant effect on Y at confidence level of 95%.Critical value of F=0.05(1,28) = 4.20