Download 2015-2016 advanced data mining mscda1

59 : 42 : 9 0 National College of Ireland 17 . 5 MSCDA – MSc in Data Analytics –.0 Year 1 – MSCDADJANI Post Graduate Diploma in Science in Data 07 Analytics – Year 1 – PGDSPJANI AD Examinations – 2015/16 Semester Two August/Repeat O L N 13 August 2016 WSaturday 10.00am – 1.00pm O D I ______________________________________________________________________ NC th Advanced Data Mining Dr. Geraldine Gray Dr. Simon Caton Dr. Jer Hayes Answer question 1 AND any 3 of the remaining 7 questions Duration of exam: 3 hours Attachments: none Page 1 of 6 Answer Q1 and any THREE other questions. 1. Foundations of Data Mining [25 Marks] a) Explain what is the 'bias-viariance trade-off'. Give two examples of how variance can be decreased when building models [10 Marks]s :5 2 :4 9 0 9 b) What is the problem of overfitting? List some potential sources for this problem when building models. [8 Marks] 17 . 05 c) There are 3 commonly accepted data mining methodologies: KDD, CRISP-DM and SEMMA. Using 2 example contexts to justify your answer, discuss when you would use any 2 of these based upon their methodology stages/steps. [7 Marks] O L N 2. Text Analytics [25 Marks] AD . 7 0 W I C N DO You are a newly hired data scientist for 'Caton's Cracker Biscuits' in the aftermath of a new EU-wide product launch. Consider the problem of discovering whether a product has been receiving positive or negative feedback. Assume that users of the product have posted reviews to third party websites that cater for customers in the EU and specific EU countries. Assume also that there is no API that will provide the reviews directly and that you must collect the data directly from the pages of the third party websites.  Describe how you go about analysing these reviews.  Note what tools and what external data (if any) you would use for this task.  Note and evaluate any assumptions you make.  Given the data you have collected how would you build a model to classify future reviews? (25 marks) 3. Genetic Algorithms [25 Marks] (A) The Knapsack Problem is an example of a combinatorial optimization problem, which seeks to maximize the benefit of objects in a knapsack without exceeding its capacity. The problem can be stated like this: you have a collection of N objects of different weights, w1, w2, …, wn, and different values, v1, v2, …, vn, and a knapsack that can only hold a certain maximum combined weight W. You would like to get a set of objects of maximal value into the knapsack. Page 2 of 6 Demonstrate in principle how use a genetic algorithms to solve this problem using the following data: Name Weight Value A 45 3 B 40 5 C 50 D 90 58 9 : 42 10 : 09 7 1 . ...and a knapsack that can support a maximum weight 5 of 100. 0 . In your answer cover the following: 07 D A  What is a genome. O L N  What is a cost function in this W context. O  Outline the basic steps in applying a genetic algorithm to this problem. (19 Marks) D I NC (B) Describe how you could use a genetic algorithm approach to feature selection. (6 marks) 4. Data Mining Methods Comparison [25 Marks] Your company needs to assess accounts data to determine if any suspicious or fraudulent activity has taken place. You have been charged with developing possible models to assess the accounts data. You have been provided with a sample of selected training data, but have not been told how this sample has been curated. You should assume that the data has not been cleaned and that there are missing values. You have been provided with: - account holder demographic details a historical csv of different account types, whether a customer has each account, their balances, monthly in- and out-goings, how long the account has been open, and a label with 2 values: 1) fraudulent activity is suspected 2) fraudulent activity not suspected Evaluate the following possible methods of tackling this problem and mention specific Page 3 of 6 algorithms in your answer where appropriate (15 marks): a) ANN c) Decision Tree d) Association Rule Mining You should evaluate each model using the following criteria / restrictions (10 marks): accusations of fraudulent activity are to be binned into sub categories of possible, likely, very likely 59 : 42reliable : any “very likely” accusations of fraudulent activity are highly 09 computation is not an issue, but you may receive and7have to accommodate additional .1 sources or volume of data at any time 5 .0 any predictions are highly interpretable 7 0 D A O NL 5. Clustering [25 Marks] W O D I a. Given the data plotted below discuss how a standard k-means clustering algorithm is NC likely to partition the data. Assume k=2. (10 marks) b. What other clustering technique would you recommend for the data set in part (a)? Describe the steps involved in finding a cluster based on that technique (8 marks) Page 4 of 6 c. With many different methods to form clusters, how can we evaluate the 'validity' of a cluster? In your answer note internal and external evaluation. (7 marks). 6. Support Vector Machines [25 Marks] You have 3 independent problems to solve using Support Vector Machines: 59 : 42of handwritten numbers; : 2- a multi-class classification problem: recognise images 09 3- a binary classification problem on a wide, but7sparse data set .1 5 .0 a) Which kernel(s) would you select07 for each of these problems, and why? [12 Marks] AD O L N speed up the creation of SVMs models and in which b) How could you potentially W problem(s) listed above Ois this most likely to be an issue: discuss. [5 Marks] D I C N 1- a regression-style problem; c) Describe what the curse of dimensionality is and then, highlighting any assumptions you make and discussing their implications, how you would handle the curse of dimensionality for one of the problems listd above. [8 Marks] Q7. Association Rules & Nearest Neigbours Part (A) (i) With respect to Association Rules what are support and confidence? (4 marks) (ii) Consider the following transaction database: Page 5 of 6 Transaction ID Items 1 Tiling Cement; Tiles 2 Paint; White Spirit 3 Paint; Wallpaper; Plaster 4 Paint; Plaster; Tiling Cement; Tiles 17 . 05 What is the support, confidence and lift for: (Paint => White Spirit) b. (Paint & Plaster => Wallpaper) O L N 9 . 7 0 a. AD :5 2 :4 9 0 (6 marks) W O (iii) There are number D of algorithms that can be used for discovering association I rules. List 2 and compare their relative strengths and weaknesses. (6 marks) C N Part B (i) “Lazy learning is not really learning at all”. Discuss this statement with respect to k-NN (nearest neighbour). (9 marks) Page 6 of 6

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download 2015-2016 advanced data mining mscda1