Download Data mining: Knowledge Discovery in Databases LAPP

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Data mining:
Knowledge Discovery in Databases
LAPP-Top Computer Science, Pre University College Leiden
Peter van der Putten (putten-at-liacs.nl), February 2005
Lab Session I
Assignment 1: Animal Trees
In this assignment we use a data set of animals and their attributes. Using a
decision tree classifier the computer learns to classify animals into different
categories (mammals, fish, reptiles etc).
1.1 The data set can be found here. Without using the data mining tool, draw a
decision tree of three to five levels deep that classifies animals into a
mammal, bird, reptile, fish, amphibian, insect or invertebrate.
1.2 Now we are going to let the computer discover a decision tree itself. First
download this zip file with data sets to your desktop and unzip it. Open the
zoo.arff data set in WEKA (choose start menu – weka – weka-3-4 –
Weka Explorer – Open file).
1.2.1 How many attributes are known of each animal?
1.2.2 How many animals are there in the data set?
1.3 Let us build some classifiers. Go to the classifier tab. We will use 66% of
the animals to build the models, and the remaining 34% to evaluate the
quality of the model., so select percentage split – 66%. First we will
build a ‘naïve’ model that just predicts the most occurring class in the data set
for each animal. This corresponds to a decision tree of depth 0. Click start
to build a model.
1.3.1 What % of animals is correctly classified?
1.3.2 Into what category are all these animals classified and why?
1.4 Now build a decision tree of depth 1 (a.k.a. a decision stump - select
choose – trees – decision stump). Draw the discovered decision
tree.
1.4.1 What % of animals is correctly classified?
1.4.2 Give an example of an animal that would not be classified correctly by this
model.
1.5 Now build a decision tree of any depth (a.k.a. a J48 tree). Draw the
discovered decision tree.
1.5.1 What % of animals is correctly classified?
1.5.2 Give an example of an animal that would not be classified correctly by this
model.
Assignment 2: Animal Rules
In this exercise you will use the association rule algorithm to discover interesting
regularities in the zoo data set.
2.1 The association rule algorithm to be used can only cope with non-numerical
(‘nominal’) attributes, so you first have to transform the numerical attribute
‘legs’ to discrete bins (so 0, 2, 2, 4, >4 legs etc). This type of data
preprocessing can be performed in the preprocess tab by applying the right
filter (select Discretize of PKIDistcretize and then Apply). Check
the results before and after application of the filter. Now run the association
rule algorithm. You can change the numrules option to get more rules Id
needed.
2.1.1 List at least three interesting rules
2.1.2 Give at least one example of a rule that is always true according to the
algorithm (hint: see the confidence)?
2.1.3 Give an example of counterexample for a specific rule (an example for
which the rule is not correct)
Assignment 3: Mine Yourself
At the beginning of this lab session you have answered some questions about
yourselves. In this exercise we will mine this survey of all Lapp-toppers to
discover interesting, surprising and counterintuitive patterns in the data.
3. 1 Build a decision tree to predict whether someone watches RTL Boulevard or
the Journaal. What is the predictive power of the model? What are important
distinguishing characteristics?
3.2 Build classifiers for a selection of the other attributes. For each attribute note
the classification accuracy and some distinguishing characteristics. Which
attribute is easiest to predict and which one is hardest to predict?
3.3 Use the association rules algorithm to derives interesting rules of this data
set. Pick three rules that find most interesting (most funny, trivial, counterintuitive)
We will discuss some of the patterns found with the group.
Lab session II
Assignment 4: Recommenders
4. 1 List the top recommendations belonging to your favourite book(s), movie(s)
or music using two out of the following list of recommenders (or any other
recommender you know):
 Amazon, BOL, Proxis, [email protected], Centrale Discotheek
Rotterdam, GNOD (Music, Books, Movies, People), Internet Movie
Database, Reel.com
Assignment 5: Data Mining Case Projects
The zip file from assignment 1 contains a number of data sets from a variety of
areas. Most data sets contain a small description in the header – to read this
open the file in a text editor like notepad. This exercise should be done in pairs.
Pick a data set that looks interesting and write it on the blackboard so that we
don’t get two team working on the same data set.
For your data set / data mining case note:
1. The practical problem that is being solved here
2. The goal of the classifier: what needs to be predicted
3. A high level description of the data: kind of attributes available, number of
attributes / instances etc.
4. Examples of interesting patterns found by just analyzing individual
attributes
5. The classification accuracy for each classifier type – a decision stump, a
decision tree and optionally another type of classifier
6. The patterns discovered by at least one of the classifiers
7. One or more interesting association rules
8. A suggestion of how such a prediction can be used in practice
Create a small powerpoint presentation discussing your most interesting results.
One of you should act like the domain expert and present the beginning and the
end; the other one should act like the data mining expert and present the data
mining approach and results. The rest of the group will ask questions after the
presentation. The presentations should be short – no more than 5 minutes.
The presentations will be posted to this website.