Download 2015-2016 advanced data mining mscda1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
59
:
42
:
9
0
National College of Ireland
17
.
5
MSCDA – MSc in Data Analytics –.0
Year 1 – MSCDADJANI
Post Graduate Diploma in Science in Data
07 Analytics – Year 1 – PGDSPJANI
AD Examinations – 2015/16
Semester Two August/Repeat
O
L
N 13 August 2016
WSaturday
10.00am – 1.00pm
O
D
I
______________________________________________________________________
NC
th
Advanced Data Mining
Dr. Geraldine Gray
Dr. Simon Caton
Dr. Jer Hayes
Answer question 1 AND any 3 of the remaining 7 questions
Duration of exam: 3 hours
Attachments: none
Page 1 of 6
Answer Q1 and any THREE other questions.
1. Foundations of Data Mining [25 Marks]
a) Explain what is the 'bias-viariance trade-off'. Give two examples of how variance can be
decreased when building models [10 Marks]s
:5
2
:4
9
0
9
b) What is the problem of overfitting? List some potential sources for this problem when
building models. [8 Marks]
17
.
05
c) There are 3 commonly accepted data mining methodologies: KDD, CRISP-DM and
SEMMA. Using 2 example contexts to justify your answer, discuss when you would use any
2 of these based upon their methodology stages/steps. [7 Marks]
O
L
N
2. Text Analytics [25 Marks]
AD
.
7
0
W
I
C
N
DO
You are a newly hired data scientist for 'Caton's Cracker Biscuits' in the aftermath of a new
EU-wide product launch. Consider the problem of discovering whether a product has been
receiving positive or negative feedback. Assume that users of the product have posted
reviews to third party websites that cater for customers in the EU and specific EU
countries. Assume also that there is no API that will provide the reviews directly and that
you must collect the data directly from the pages of the third party websites.

Describe how you go about analysing these reviews.

Note what tools and what external data (if any) you would use for this task.

Note and evaluate any assumptions you make.

Given the data you have collected how would you build a model to classify future
reviews?
(25 marks)
3.
Genetic Algorithms [25 Marks]
(A) The Knapsack Problem is an example of a combinatorial optimization problem, which
seeks to maximize the benefit of objects in a knapsack without exceeding its capacity. The
problem can be stated like this: you have a collection of N objects of different weights, w1, w2,
…, wn, and different values, v1, v2, …, vn, and a knapsack that can only hold a certain
maximum combined weight W. You would like to get a set of objects of maximal value into the
knapsack.
Page 2 of 6
Demonstrate in principle how use a genetic algorithms to solve this problem using the
following data:
Name
Weight
Value
A
45
3
B
40
5
C
50
D
90
58 9
:
42 10
:
09
7
1
.
...and a knapsack that can support a maximum weight 5
of 100.
0
.
In your answer cover the following:
07
D
A
 What is a genome.
O
L
N
 What is a cost function in this
W context.
O
 Outline the basic steps
in applying a genetic algorithm to this problem. (19 Marks)
D
I
NC
(B) Describe how you could use a genetic algorithm approach to feature selection. (6 marks)
4.
Data Mining Methods Comparison [25 Marks]
Your company needs to assess accounts data to determine if any suspicious or fraudulent
activity has taken place. You have been charged with developing possible models to assess the
accounts data. You have been provided with a sample of selected training data, but have not
been told how this sample has been curated. You should assume that the data has not been
cleaned and that there are missing values. You have been provided with:
-
account holder demographic details
a historical csv of different account types, whether a customer has each account, their
balances, monthly in- and out-goings, how long the account has been open, and a label with 2
values:
1)
fraudulent activity is suspected
2)
fraudulent activity not suspected
Evaluate the following possible methods of tackling this problem and mention specific
Page 3 of 6
algorithms in your answer where appropriate (15 marks):
a)
ANN
c)
Decision Tree
d)
Association Rule Mining
You should evaluate each model using the following criteria / restrictions (10 marks):
accusations of fraudulent activity are to be binned into sub categories of possible,
likely, very likely
59
:
42reliable
:
any “very likely” accusations of fraudulent activity are highly
09
computation is not an issue, but you may receive and7have to accommodate additional
.1
sources or volume of data at any time
5
.0
any predictions are highly interpretable
7
0
D
A
O
NL
5. Clustering [25 Marks]
W
O
D
I
a. Given the data plotted
below discuss how a standard k-means clustering algorithm is
NC
likely to partition the data. Assume k=2. (10 marks)
b. What other clustering technique would you recommend for the data set in part (a)?
Describe the steps involved in finding a cluster based on that technique (8 marks)
Page 4 of 6
c. With many different methods to form clusters, how can we evaluate the 'validity' of a
cluster? In your answer note internal and external evaluation. (7 marks).
6.
Support Vector Machines [25 Marks]
You have 3 independent problems to solve using Support Vector Machines:
59
:
42of handwritten numbers;
:
2- a multi-class classification problem: recognise images
09
3- a binary classification problem on a wide, but7sparse data set
.1
5
.0
a) Which kernel(s) would you select07
for each of these problems, and why? [12
Marks]
AD
O
L
N speed up the creation of SVMs models and in which
b) How could you potentially
W
problem(s) listed above
Ois this most likely to be an issue: discuss. [5 Marks]
D
I
C
N
1-
a regression-style problem;
c)
Describe what the curse of dimensionality is and then, highlighting any
assumptions you make and discussing their implications, how you would handle the
curse of dimensionality for one of the problems listd above. [8 Marks]
Q7. Association Rules & Nearest Neigbours
Part (A)
(i)
With respect to Association Rules what are support and confidence? (4 marks)
(ii)
Consider the following transaction database:
Page 5 of 6
Transaction ID
Items
1
Tiling Cement; Tiles
2
Paint; White Spirit
3
Paint; Wallpaper; Plaster
4
Paint; Plaster; Tiling Cement;
Tiles
17
.
05
What is the support, confidence and lift for:
(Paint => White Spirit)
b.
(Paint & Plaster => Wallpaper)
O
L
N
9
.
7
0
a.
AD
:5
2
:4
9
0
(6 marks)
W
O
(iii)
There are number
D of algorithms that can be used for discovering association
I
rules. List 2 and compare their
relative strengths and weaknesses. (6 marks)
C
N
Part B
(i) “Lazy learning is not really learning at all”. Discuss this statement with respect to k-NN
(nearest neighbour). (9 marks)
Page 6 of 6