Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
KDD and Data Mining
Instructor: Dragomir R. Radev
Winter 2005
Fundamentals, Design,
and Implementation, 9/e
The big problem
Billions of records
A small number of interesting patterns
“Data rich but information poor”
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/2
Data mining
Knowledge discovery
Knowledge extraction
Data/pattern analysis
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/3
Types of source data
Relational databases
Transactional databases
Web logs
Textual databases
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/4
Association rules
65% of all customers who buy beer
and tomato sauce also buy pasta and
chicken wings
Association rules: X Y
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/5
Association analysis
IF
20 < age < 30
AND
20K < INCOME < 30K
THEN
– Buys (“CD player”)
SUPPORT = 2%, CONFIDENCE =
60%
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/6
Basic concepts
Minimum support threshold
Minimum confidence threshold
Itemsets
Occurrence frequency of an itemset
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/7
Association rule mining
Find all frequent itemsets
Generate strong association rules
from the frequent itemsets
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/8
Support and confidence
Support (X)
Confidence (X Y) = Support(X+Y) /
Support (X)
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/9
Example
TID
T100
T200
T300
T400
T500
T600
T700
T800
T900
List of item IDs
I1, I2, I5
I2, I4
I2, I3
I1, I2, I4
I1, I3
I2, I3
I1, I3
I1, I2, I3, I5
I1, I2, I3
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/10
Example (cont’d)
Frequent itemset l = {I1, I2, I5}
I1 AND I2 I5
C = 2/4 = 50%
I1 AND I5 I2
I2 AND I5 I1
I1 I2 AND I5
I2 I1 AND I5
I3 I1 AND I2
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/11
Example 2
TID
date
items
T100
10/15/99
{K, A, D, B}
T200
10/15/99
{D, A, C, E, B}
T300
10/19/99
{C, A, B, E}
T400
10/22/99
{B, A, D}
min_sup = 60%, min_conf = 80%
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/12
Correlations
Corr (A,B) = P (A OR B) / P(A) P (B)
If Corr < 1: A discourages B (negative
correlation)
(lift of the association rule A B)
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/13
Contingency table
Game
^Game
Sum
Video
4,000
3,500
7,500
^Video
2,000
500
2,500
Sum
6,000
4,000
10,000
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/14
Example
P({game}) = 0.60
P({video}) = 0.75
P({game,video}) = 0.40
P({game,video})/(P({game})x(P({video
})) = 0.40/(0.60 x 0.75) = 0.89
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/15
Example 2
hotdogs
^hotdogs Sum
hamburgers
2000
500
2500
^hamburgers
1000
1500
2500
Sum
3000
2000
5000
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/16
Classification using
decision trees
Expected information need
I (s1, s2, …, sm) = -
S
pi log (pi)
s = data samples
m = number of classes
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/17
RID
Age
Income
student
credit
buys?
1
<= 30
High
No
Fair
No
2
<= 30
High
No
Excellent
No
3
31 .. 40
High
No
Fair
Yes
4
> 40
Medium
No
Fair
Yes
5
> 40
Low
Yes
Fair
Yes
6
> 40
Low
Yes
Excellent
No
7
31 .. 40
Low
Yes
Excellent
Yes
8
<= 30
Medium
No
Fair
No
9
<= 30
Low
Yes
Fair
Yes
10
> 40
Medium
Yes
Fair
Yes
11
<= 30
Medium
Yes
Excellent
Yes
12
31 .. 40
Medium
No
Excellent
Yes
13
31 .. 40
High
Yes
Fair
Yes
14
> 40
Medium
no
excellent
no
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/18
Decision tree induction
I(s1,s2)
= I(9,5) =
= - 9/14 log 9/14 – 5/14 log 5/14 =
= 0.940
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/19
Entropy and information gain
•E(A) =
S
S1j + … + smj
s
I (s1j,…,smj)
Entropy = expected information based on the partitioning into
subsets by A
Gain (A) = I (s1,s2,…,sm) – E(A)
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/20
Entropy
Age <= 30
s11 = 2, s21 = 3, I(s11, s21) = 0.971
Age in 31 .. 40
s12 = 4, s22 = 0, I (s12,s22) = 0
Age > 40
s13 = 3, s23 = 2, I (s13,s23) = 0.971
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/21
Entropy (cont’d)
E (age) =
5/14 I (s11,s21) + 4/14 I (s12,s22) + 5/14 I
(S13,s23) = 0.694
Gain (age) = I (s1,s2) – E(age) = 0.246
Gain (income) = 0.029, Gain (student) =
0.151, Gain (credit) = 0.048
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/22
Final decision tree
age
> 40
31 .. 40
student
credit
yes
no
yes
no
yes
excellent
no
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
fair
yes
Chapter 9/23
Other techniques
Bayesian classifiers
X: age <=30, income = medium,
student = yes, credit = fair
P(yes) = 9/14 = 0.643
P(no) = 5/14 = 0.357
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/24
Example
P (age < 30 | yes) = 2/9 = 0.222
P (age < 30 | no) = 3/5 = 0.600
P (income = medium | yes) = 4/9 = 0.444
P (income = medium | no) = 2/5 = 0.400
P (student = yes | yes) = 6/9 = 0.667
P (student = yes | no) = 1/5 = 0.200
P (credit = fair | yes) = 6/9 = 0.667
P (credit = fair | no) = 2/5 = 0.400
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/25
Example (cont’d)
P (X | yes) = 0.222 x 0.444 x 0.667 x 0.667 =
0.044
P (X | no) = 0.600 x 0.400 x 0.200 x 0.400 = 0.019
P (X | yes) P (yes) = 0.044 x 0.643 = 0.028
P (X | no) P (no) = 0.019 x 0.357 = 0.007
Answer: yes/no?
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/26
Predictive models
Inputs (e.g., medical history, age)
Output (e.g., will patient experience
any side effects)
Some models are better than others
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/27
Principles of data mining
Training/test sets
Error analysis and overfitting
error
test
training
Cross-validation
input size
Supervised vs. unsupervised methods
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/28
Representing data
Vector space
credit
pay off
default
salary
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/29
Decision surfaces
credit
pay off
default
salary
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/30
Decision trees
credit
pay off
default
salary
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/31
Linear boundary
credit
pay off
default
salary
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/32
kNN models
Assign each element to the closest
cluster
Demos:
– http://www2.cs.cmu.edu/~zhuxj/courseproject/knnd
emo/KNN.html
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/33
Other methods
Decision trees
Neural networks
Support vector machines
Demos
– http://www.cs.technion.ac.il/~rani/LocBo
ost/
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/34
arff files
@data
sunny,85,85,FALSE,no
@relation weather
sunny,80,90,TRUE,no
overcast,83,86,FALSE,yes
@attribute outlook {sunny, overcast, rainy}
@attribute temperature real
@attribute humidity real
@attribute windy {TRUE, FALSE}
@attribute play {yes, no}
rainy,70,96,FALSE,yes
rainy,68,80,FALSE,yes
rainy,65,70,TRUE,no
overcast,64,65,TRUE,yes
sunny,72,95,FALSE,no
sunny,69,70,FALSE,yes
rainy,75,80,FALSE,yes
sunny,75,70,TRUE,yes
overcast,72,90,TRUE,yes
overcast,81,75,FALSE,yes
rainy,71,91,TRUE,no
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/35
Weka
http://www.cs.waikato.ac.nz/ml/weka
Methods:
rules.ZeroR
bayes.NaiveBayes
trees.j48.J48
lazy.IBk
trees.DecisionStump
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/36
kMeans clustering
http://www.cc.gatech.edu/~dellaert/html/sof
tware.html
java weka.clusterers.SimpleKMeans -t
data/weather.arff
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/37
More useful pointers
http://www.kdnuggets.com/
http://www.twocrows.com/booklet.htm
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/38
More types of data mining
Classification and prediction
Cluster analysis
Outlier analysis
Evolution analysis
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/39