Download 11 - CLAIR

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
KDD and Data Mining
Instructor: Dragomir R. Radev
Winter 2005
Fundamentals, Design,
and Implementation, 9/e
The big problem
 Billions of records
 A small number of interesting patterns
 “Data rich but information poor”
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/2
Data mining
 Knowledge discovery
 Knowledge extraction
 Data/pattern analysis
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/3
Types of source data




Relational databases
Transactional databases
Web logs
Textual databases
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/4
Association rules
 65% of all customers who buy beer
and tomato sauce also buy pasta and
chicken wings
 Association rules: X Y
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/5
Association analysis
 IF
20 < age < 30
AND
20K < INCOME < 30K
 THEN
– Buys (“CD player”)
 SUPPORT = 2%, CONFIDENCE =
60%
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/6
Basic concepts
 Minimum support threshold
 Minimum confidence threshold
 Itemsets
 Occurrence frequency of an itemset
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/7
Association rule mining
 Find all frequent itemsets
 Generate strong association rules
from the frequent itemsets
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/8
Support and confidence
 Support (X)
 Confidence (X  Y) = Support(X+Y) /
Support (X)
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/9
Example
TID
T100
T200
T300
T400
T500
T600
T700
T800
T900
List of item IDs
I1, I2, I5
I2, I4
I2, I3
I1, I2, I4
I1, I3
I2, I3
I1, I3
I1, I2, I3, I5
I1, I2, I3
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/10
Example (cont’d)







Frequent itemset l = {I1, I2, I5}
I1 AND I2  I5
C = 2/4 = 50%
I1 AND I5  I2
I2 AND I5  I1
I1  I2 AND I5
I2  I1 AND I5
I3  I1 AND I2
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/11
Example 2
TID
date
items
T100
10/15/99
{K, A, D, B}
T200
10/15/99
{D, A, C, E, B}
T300
10/19/99
{C, A, B, E}
T400
10/22/99
{B, A, D}
min_sup = 60%, min_conf = 80%
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/12
Correlations
 Corr (A,B) = P (A OR B) / P(A) P (B)
 If Corr < 1: A discourages B (negative
correlation)
 (lift of the association rule A  B)
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/13
Contingency table
Game
^Game
Sum
Video
4,000
3,500
7,500
^Video
2,000
500
2,500
Sum
6,000
4,000
10,000
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/14
Example




P({game}) = 0.60
P({video}) = 0.75
P({game,video}) = 0.40
P({game,video})/(P({game})x(P({video
})) = 0.40/(0.60 x 0.75) = 0.89
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/15
Example 2
hotdogs
^hotdogs Sum
hamburgers
2000
500
2500
^hamburgers
1000
1500
2500
Sum
3000
2000
5000
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/16
Classification using
decision trees
 Expected information need
 I (s1, s2, …, sm) = -
S
pi log (pi)
 s = data samples
 m = number of classes
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/17
RID
Age
Income
student
credit
buys?
1
<= 30
High
No
Fair
No
2
<= 30
High
No
Excellent
No
3
31 .. 40
High
No
Fair
Yes
4
> 40
Medium
No
Fair
Yes
5
> 40
Low
Yes
Fair
Yes
6
> 40
Low
Yes
Excellent
No
7
31 .. 40
Low
Yes
Excellent
Yes
8
<= 30
Medium
No
Fair
No
9
<= 30
Low
Yes
Fair
Yes
10
> 40
Medium
Yes
Fair
Yes
11
<= 30
Medium
Yes
Excellent
Yes
12
31 .. 40
Medium
No
Excellent
Yes
13
31 .. 40
High
Yes
Fair
Yes
14
> 40
Medium
no
excellent
no
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/18
Decision tree induction
 I(s1,s2)
= I(9,5) =
= - 9/14 log 9/14 – 5/14 log 5/14 =
= 0.940
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/19
Entropy and information gain
•E(A) =
S
S1j + … + smj
s
I (s1j,…,smj)
Entropy = expected information based on the partitioning into
subsets by A
Gain (A) = I (s1,s2,…,sm) – E(A)
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/20
Entropy
 Age <= 30
s11 = 2, s21 = 3, I(s11, s21) = 0.971
 Age in 31 .. 40
s12 = 4, s22 = 0, I (s12,s22) = 0
 Age > 40
s13 = 3, s23 = 2, I (s13,s23) = 0.971
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/21
Entropy (cont’d)
 E (age) =
5/14 I (s11,s21) + 4/14 I (s12,s22) + 5/14 I
(S13,s23) = 0.694
 Gain (age) = I (s1,s2) – E(age) = 0.246
 Gain (income) = 0.029, Gain (student) =
0.151, Gain (credit) = 0.048
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/22
Final decision tree
age
> 40
31 .. 40
student
credit
yes
no
yes
no
yes
excellent
no
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
fair
yes
Chapter 9/23
Other techniques
 Bayesian classifiers
 X: age <=30, income = medium,
student = yes, credit = fair
 P(yes) = 9/14 = 0.643
 P(no) = 5/14 = 0.357
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/24
Example
 P (age < 30 | yes) = 2/9 = 0.222
P (age < 30 | no) = 3/5 = 0.600
P (income = medium | yes) = 4/9 = 0.444
P (income = medium | no) = 2/5 = 0.400
P (student = yes | yes) = 6/9 = 0.667
P (student = yes | no) = 1/5 = 0.200
P (credit = fair | yes) = 6/9 = 0.667
P (credit = fair | no) = 2/5 = 0.400
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/25
Example (cont’d)
 P (X | yes) = 0.222 x 0.444 x 0.667 x 0.667 =
0.044
 P (X | no) = 0.600 x 0.400 x 0.200 x 0.400 = 0.019
 P (X | yes) P (yes) = 0.044 x 0.643 = 0.028
 P (X | no) P (no) = 0.019 x 0.357 = 0.007
 Answer: yes/no?
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/26
Predictive models
 Inputs (e.g., medical history, age)
 Output (e.g., will patient experience
any side effects)
 Some models are better than others
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/27
Principles of data mining
 Training/test sets
 Error analysis and overfitting
error
test
training
 Cross-validation
input size
 Supervised vs. unsupervised methods
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/28
Representing data
 Vector space
credit
pay off
default
salary
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/29
Decision surfaces
credit
pay off
default
salary
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/30
Decision trees
credit
pay off
default
salary
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/31
Linear boundary
credit
pay off
default
salary
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/32
kNN models
 Assign each element to the closest
cluster
 Demos:
– http://www2.cs.cmu.edu/~zhuxj/courseproject/knnd
emo/KNN.html
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/33
Other methods




Decision trees
Neural networks
Support vector machines
Demos
– http://www.cs.technion.ac.il/~rani/LocBo
ost/
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/34
arff files
@data
sunny,85,85,FALSE,no
@relation weather
sunny,80,90,TRUE,no
overcast,83,86,FALSE,yes
@attribute outlook {sunny, overcast, rainy}
@attribute temperature real
@attribute humidity real
@attribute windy {TRUE, FALSE}
@attribute play {yes, no}
rainy,70,96,FALSE,yes
rainy,68,80,FALSE,yes
rainy,65,70,TRUE,no
overcast,64,65,TRUE,yes
sunny,72,95,FALSE,no
sunny,69,70,FALSE,yes
rainy,75,80,FALSE,yes
sunny,75,70,TRUE,yes
overcast,72,90,TRUE,yes
overcast,81,75,FALSE,yes
rainy,71,91,TRUE,no
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/35
Weka
http://www.cs.waikato.ac.nz/ml/weka
Methods:
rules.ZeroR
bayes.NaiveBayes
trees.j48.J48
lazy.IBk
trees.DecisionStump
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/36
kMeans clustering
 http://www.cc.gatech.edu/~dellaert/html/sof
tware.html
 java weka.clusterers.SimpleKMeans -t
data/weather.arff
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/37
More useful pointers
 http://www.kdnuggets.com/
 http://www.twocrows.com/booklet.htm
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/38
More types of data mining




Classification and prediction
Cluster analysis
Outlier analysis
Evolution analysis
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 9/39
Related documents