Download Petra Kralj

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Data Mining and Knowledge
Discovery
Knowledge Discovery and Knowledge
Management in e-Science
Petra Kralj
[email protected]
Practice, 2007/11/15
1
Practice plan
• 2007/11/8:
Classification
• 2007/11/15:
Numeric prediction and
descriptive data mining
– Decision trees
– Naïve Bayes classifier
– Evaluating classifiers (confusion matrix, classification
accuracy)
– Predictive data mining in Weka
–
–
–
–
–
Models for numeric prediction
Association rules
Regression models and evaluation in Weka
Descriptive data mining in Weka
Discussion about seminars and exam
• 2007/11/29: Written examination and
seminar proposal presentations
2
Numeric prediction
Baseline,
Linear Regression,
Regression tree,
Model Tree,
KNN
3
Regression
Classification
Data: attribute-value description
Target variable:
Continuous
Target variable:
Categorical (nominal)
Evaluation: cross validation, separate test set, …
Error:
MSE, MAE, RMSE, …
Error:
1-accuracy
Algorithms:
Linear regression,
regression trees,…
Baseline predictor:
Mean of the target
variable
Algorithms:
Decision trees, Naïve
Bayes, …
Baseline predictor:
Majority class
4
Example
• data about 80 people:
Age and Height
2
Height
1.5
1
0.5
Height
0
0
50
Age
100
5
Test set
6
Baseline numeric predictor
Height
• Average of the target variable
2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
Height
Average predictor
0
20
40
60
80
100
Age
7
Baseline predictor: prediction
Average of the target variable is 1.63
8
Linear Regression Model
Height =
0.0056 * Age + 1.4181
2.5
Height
2
1.5
1
0.5
Height
Prediction
0
0
20
40
60
Age
80
100
9
Linear Regression: prediction
Height =
0.0056 * Age + 1.4181
10
Regression tree
2
Height
1.5
1
0.5
Height
Prediction
0
0
50
Age
100
11
Regression tree: prediction
12
Model tree
2
Height
1.5
1
0.5
Height
Prediction
0
0
20
40
60
Age
80
100
13
Model tree: prediction
14
KNN – K nearest neighbors
Height
• Looks at K closest examples (by age) and
predicts the average of their target variable
• K=3
2.00
1.80
1.60
1.40
1.20
1.00
0.80
0.60
0.40
0.20
0.00
Height
Prediction KNN, n=3
0
20
40
60
Age
80
100
15
KNN prediction
Age
1
1
2
3
3
5
5
Height
0.90
0.99
1.01
1.03
1.07
1.19
1.17
16
KNN prediction
Age
8
8
9
9
11
12
12
13
14
Height
1.36
1.33
1.45
1.39
1.49
1.66
1.52
1.59
1.58
17
KNN prediction
Age
30
30
31
34
37
37
38
39
39
Height
1.57
1.88
1.71
1.55
1.65
1.80
1.60
1.69
1.80
18
KNN prediction
Age
67
67
69
69
71
71
72
76
Height
1.56
1.87
1.67
1.86
1.74
1.82
1.70
1.88
19
Which predictor is the best?
Age
Height
2
10
35
70
0.85
1.4
1.7
1.6
Linear
Regression
Baseline regression
tree
Model tree
1.63
1.63
1.63
1.63
1.43
1.47
1.61
1.81
1.39
1.46
1.71
1.71
1.20
1.47
1.71
1.75
kNN
1.01
1.51
1.67
1.81
20
Evaluating numeric prediction
21
Linear
pi-ai regression pi-ai
Age
Height
Baseline
2
0.85
1.63
0.78
1.43
0.58
10
1.4
1.63
0.23
1.47
0.07
35
1.7
1.63
-0.07
1.61
-0.09
70
1.6
1.63
0.03
1.81
0.21
mean-squared error
root mean-squared error
mean absolute error
relative squared error
root relative squared error
relative absolute error
correlation coefficient
22
Regression
tree
pi-ai Model tree pi-ai
Age
Height
kNN
pi-ai
2
0.85
1.39
0.54
1.20
0.35
1.01
0.16
10
1.4
1.46
0.06
1.47
0.07
1.51
0.11
35
1.7
1.71
0.01
1.71
0.01
1.67
-0.03
70
1.6
1.71
0.11
1.75
0.15
1.81
0.21
mean-squared error
root mean-squared error
mean absolute error
relative squared error
root relative squared error
relative absolute error
correlation coefficient
23
Discussion
• Can KNN be used for classification tasks?
• Similarities between KNN and Naïve Bayes.
• Similarities and differences between
decision trees and regression trees.
24
Association Rules
25
Association rules
• Rules X Æ Y, X, Y conjunction of items
• Task: Find all association rules that
satisfy minimum support and minimum
confidence constraints
- Support:
Sup(X Æ Y) = #XY/#D ≅ p(XY)
- Confidence:
Conf(X Æ Y) = #XY/#X ≅ p(XY)/p(X) = p(Y|X)
26
Association rules - algorithm
1. generate frequent itemsets with a
minimum support constraint
2. generate rules from frequent
itemsets with a minimum confidence
constraint
* Data are in a transaction database
27
Association rules –
transaction database
Items: A=apple, B=banana,
C=coca-cola, D=doughnut
• Client
• Client
• Client
• Client
• Client
• Client
1
2
3
4
5
6
bought:
bought:
bought:
bought:
bought:
bought:
A,
B,
B,
A,
A,
A,
B, C, D
C
D
C
B, D
B, C
28
Frequent itemsets
• Generate frequent itemsets with
support at least 2/6
29
Frequent itemsets algorithm
Items in an itemset should be sorted alphabetically.
• Generate all 1-itemsets with the given
minimum support.
• Use 1-itemsets to generate 2-itemsets
with the given minimum support.
• From 2-itemsets generate 3-itemsets with
the given minimum support as unions of
2-itemsets with the same item at the
beginning.
• …
• From n-itemsets generate (n+1)-itemsets
as unions of n-itemsets with the same (n1) items at the beginning.
30
Frequent itemsets lattice
Frequent itemsets:
• A&B, A&C, A&D, B&C, B&D
• A&B&C, A&B&D
31
Rules from itemsets
• A&B is a frequent itemset with
support 3/6
• Two possible rules
– AÆB confidence = #(A&B)/#A = 3/4
– BÆA confidence = #(A&B)/#B = 3/5
• All the counts are in the itemset
lattice!
32
Discussion
• Transformation of an attribute-value
dataset to a transaction dataset.
• What would be the association rules
for a dataset with two items A and B,
each of them with support 80% and
appearing in the same transactions as
rarely as possible?
– minSupport = 50%, min conf = 70%
– minSupport = 20%, min conf = 70%
• What if we had 4 items: A, ¬A, B, ¬ B
33
Related documents