Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
SI 654
Database Application Design
Winter 2003
Dragomir R. Radev
1
© 2002 by Prentice Hall
Data Mining
(continued)
2
© 2002 by Prentice Hall
arff files
@relation weather
@data
sunny,85,85,FALSE,no
@attribute outlook {sunny, overcast, rainy}
sunny,80,90,TRUE,no
@attribute temperature real
overcast,83,86,FALSE,yes
@attribute humidity real
rainy,70,96,FALSE,yes
@attribute windy {TRUE, FALSE}
rainy,68,80,FALSE,yes
@attribute play {yes, no}
rainy,65,70,TRUE,no
overcast,64,65,TRUE,yes
sunny,72,95,FALSE,no
sunny,69,70,FALSE,yes
rainy,75,80,FALSE,yes
sunny,75,70,TRUE,yes
overcast,72,90,TRUE,yes
overcast,81,75,FALSE,yes
rainy,71,91,TRUE,no
3
© 2002 by Prentice Hall
Predictive models
• Inputs (e.g., medical history, age)
• Output (e.g., will patient experience
any side effects)
• Some models are better than others
4
© 2002 by Prentice Hall
Operating curves
success
optimal
practical
random
failure
most likely
5
least likely
© 2002 by Prentice Hall
Principles of data mining
• Training/test sets
• Error analysis and overfitting
error
test
training
input size
• Cross-validation
• Supervised vs. unsupervised methods
6
© 2002 by Prentice Hall
Representing data
• Vector space
credit
pay off
default
salary
7
© 2002 by Prentice Hall
Decision surfaces
credit
pay off
default
salary
8
© 2002 by Prentice Hall
Decision trees
credit
pay off
default
salary
9
© 2002 by Prentice Hall
Linear boundary
credit
pay off
default
salary
10
© 2002 by Prentice Hall
kNN models
• Assign each element to the closest
cluster
• Demos:
– http://www2.cs.cmu.edu/~zhuxj/courseproject
/knndemo/KNN.html
11
© 2002 by Prentice Hall
Other methods
•
•
•
•
12
Decision trees
Neural networks
Support vector machines
Demos
– http://www.cs.technion.ac.il/~rani/
LocBoost/
© 2002 by Prentice Hall
arff files
@relation weather
@data
sunny,85,85,FALSE,no
@attribute outlook {sunny, overcast, rainy}
sunny,80,90,TRUE,no
@attribute temperature real
overcast,83,86,FALSE,yes
@attribute humidity real
rainy,70,96,FALSE,yes
@attribute windy {TRUE, FALSE}
rainy,68,80,FALSE,yes
@attribute play {yes, no}
rainy,65,70,TRUE,no
overcast,64,65,TRUE,yes
sunny,72,95,FALSE,no
sunny,69,70,FALSE,yes
rainy,75,80,FALSE,yes
sunny,75,70,TRUE,yes
overcast,72,90,TRUE,yes
overcast,81,75,FALSE,yes
rainy,71,91,TRUE,no
13
© 2002 by Prentice Hall
Weka
http://www.cs.waikato.ac.nz/ml/weka
Methods:
rules.ZeroR
bayes.NaiveBayes
trees.j48.J48
lazy.IBk
trees.DecisionStump
14
© 2002 by Prentice Hall
kMeans clustering
• http://www.cs.mcgill.ca/~bonnef/project.h
tml
• http://www.cs.washington.edu/research/im
agedatabase/demo/kmcluster/
• http://www2.cs.cmu.edu/~dellaert/software/
• java weka.clusterers.SimpleKMeans -t
data/weather.arff
15
© 2002 by Prentice Hall
More useful pointers
• http://www.kdnuggets.com/
• http://www.twocrows.com/booklet.ht
m
16
© 2002 by Prentice Hall