Download No Slide Title

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
CSE 711:
DATA MINING
Sargur N. Srihari
E-mail: [email protected]
Phone: 645-6164, ext. 113
1
CSE 711 Texts
Required Text
1. Witten, I. H., and E. Frank, Data Mining: Practical
Machine Learning Tools and Techniques with Java
Implementations, Morgan Kaufmann, 2000.
Recommended Texts
1. Adriaans, P., and D. Zantinge, Data Mining, AddisonWesley,1998.
2
Input for Data Mining/Machine
Learning
• Concepts
• result of learning process
• intelligible
• operational
• Instances
• Attributes
3
Concept Learning
• Four styles of learning in data mining
• classification learning
• supervised
• association learning
• association between features
• clustering
• numeric prediction
4
Iris Data–Clustering Problem
1
2
3
4
5
…
51
52
53
54
55
…
101
102
103
104
105
Sepal Length Sepal Width Petal Length Petal Width
5.1
3.5
1.4
0.2
4.9
3
1.4
0.2
4.7
3.2
1.3
0.2
4.6
3.1
1.5
0.2
5
3.6
1.4
0.2
7
6.4
6.9
5.5
6.5
3.2
3.2
3.1
2.3
2.8
4.7
4.5
4.9
4
4.6
1.4
1.5
1.5
1.3
1.5
6.3
5.8
7.1
6.3
6.5
3.3
2.7
3
2.9
3
6
5.1
5.9
5.6
5.8
2.5
1.9
2.1
1.8
2.2
5
Weather Data–Numeric Class
Outlook
sunny
sunny
overcast
rainy
rainy
rainy
overcast
sunny
sunny
rainy
sunny
overcast
overcast
rainy
Temperature
85
80
83
70
68
65
64
72
69
75
75
72
81
71
Humidity
85
90
86
96
80
70
65
95
70
80
70
90
75
91
Windy
false
true
false
false
false
true
true
false
false
false
true
true
false
true
Play-time
5
0
55
40
65
45
60
0
70
45
50
55
75
10
6
Instances
• Input to machine learning scheme is a
set of instances
• Matrix of examples versus attributes is a
flat file
• Input data as instances is common but
also restrictive in representing
relationships between objects
7
Family Tree Example
Peter
M
Steven
M
=
Grace =
F
Peggy
F
Pam
F
Graham
M
Anna
F
=
Ian
M
Pippa
F
Ray
M
Brian
M
Nikki
F
8
Two ways of expressing sister-of relation
(a)
First
Person
Peter
Peter
…
Steven
Steven
Steven
Steven
…
Ian
…
Anna
…
Nikki
Second
Person
Peggy
Steven
…
Peter
Graham
Pam
Grace
…
Pippa
…
Nikki
…
Anna
(b)
Sister-of
?
no
no
no
no
yes
no
First
Second
Person
Person
Steven
Pam
Graham Pam
Ian
Pippa
Brian
Pippa
Anna
Nikki
Nikki
Anna
All the rest
Sister-of
?
yes
yes
yes
yes
yes
yes
no
yes
yes
yes
9
Family Tree As Table
Name
Peter
Peggy
Steven
Graham
Pam
Ian
Gender
male
female
male
male
female
male
Parent1
?
?
Peter
Peter
Peter
Grace
Parent2
?
?
Peggy
Peggy
Peggy
Ray
10
Sister-of As Table
(combines 2 tables)
Name
Steven
Graham
Ian
Ian
Annna
Nikki
First Person
Gender
Parent1
male
Peter
male
Peter
male
Grace
male
Grace
female
Pam
female
Pam
Parent2
Name
Peggy
Pam
Peggy
Pam
Ray
Pippa
Ray
Pippa
Ian
Nikki
Ian
Anna
All the rest
Second Person
Gender
Parent 1
female
Peter
female
Peter
female
Grace
female
Grace
female
Pam
female
Pam
Sister of?
Parent2
Peggy
Peggy
Ray
Ray
Ian
Ian
yes
yes
yes
yes
yes
yes
no
11
Rule for sister-of relation
If second person’s gender =
female
and first person’s parent1 =
second person’s parent1
then sister-of = yes
12
Denormalization
• Relationship between different nodes of
a tree recast into set of independent
instances
• Join two records and make into one by
process of flattening
• Relationship among more would be
combinatorially large
13
Denormalization can produce
spurious discoveries
• Supermarket database
• customers and products bought relation
• products and supplier relation
• suppliers and their address relation
• Denormalizing produces flat file
• each instance has:
• customer, product, supplier, supplier address
• Database mining tool discovers:
• customers that buy beer also buy chips
• supplier address can be “discovered” from supplier!
14
Relations need not be finite
• Relation ancestor-of involves arbitrarily
long paths through tree
• Inductive logic programming learns
rules such as:
If person-1 is a parent of person-2
then person-1 is an ancestor of person-2
If person-1 is a parent of person-2
and person-2 is an ancestor of person-3
then person-1 is an ancestor of person-3
15
Inductive Logic Programming can learn
recursive rules from set of relation
instances
Name
Peter
Peter
Peter
Peter
Pam
Grace
Grace
First Person
Gender
Parent1
male
?
male
?
male
?
male
?
female
Peter
female
?
female
?
Parent2
Name
?
Steven
?
Pam
?
Anna
?
Nikki
Peggy
Nikki
?
Ian
?
Nikki
Other examples here
All the rest
Second Person
Gender
Parent 1
male
Peter
female
Peter
female
Pam
female
Pam
female
Pam
male
Grace
female
Pam
Ancestor of?
Parent2
Peggy
Peggy
Ian
Ian
Ian
Ray
Ian
yes
yes
yes
yes
yes
yes
yes
yes
no
Drawback of such techniques: do not cope with noisy data,
so slow as to be unusable, not covered in book
16
Summary of Data-mining Input
• Input is table of independent instances of
concept to be learned (file-mining!)
• Relational data is more complex than a flat
file
• Finite set of relations can be recast into a
single table
• Denormalizaion can result in spurious data
17
Attributes
• Each instance is characterized by a set of predefined
features, eg, iris data
• different instances may have different features
• instances are transportation vehicles
• no. of wheels useful for land vehicles but not to ships
• no. of masts is applicable to ships but not land vehicles
• one feature may depend on value of another
• eg spouse’s name depends on married/unmarried
• use “irrelevant value” flag
18
Attribute Values
• Nominal
• outlook = sunny, overcast, rainy
• Ordinal
• temperature = hot, mild, cool
• hot > mild > cool
• Interval
• ordered and measured in fixed units eg, temp. in
• differences are meaningful, not sums

F
• Ratio
•
•
inherently defines zero point, eg, distance between points
real nos, all mathematical operations
19
Preparing the Input
• Denormalization
• Integrate data from different sources
• marketing study: sales dept, billing dept, service dept
• Each source may have varying conventions,error, etc
• Enterprise-wide database integration is data
warehousing
20
ARFF File for Weather Data
% ARFF file for the weather data
with some numeric features
%
@relation weather
@attribute
rainy}
@attribute
@attribute
@attribute
@attribute
outlook {sunny, overcast,
temperature numeric
humidity numeric
windy {true, false}
play? {yes, no}
@data
%
%14 instances
%
sunny, 85, 85, false, no
sunny, 80, 90, true, no
overcast, 83, 86, false, yes
rainy, 70, 96, false, yes
rainy, 68, 80, false, yes
rainy, 65, 70, true, no
overcast, 64, 65, true, yes
sunny, 72, 95, false, no
sunny, 69, 70, false, yes
rainy, 75, 80, false, yes
sunny, 75, 70, true, yes
overcast, 72, 90, true, yes
overcast, 81, 75, false, yes
rainy, 71, 91, true, no
21
Simple Disjunction
a
y
b
c
y
d
y
x
c
n
y
n
y
x
n
d
y
n
n
x
n
22
Exclusive-Or Problem
1
0
a
b
X =1?
b
a
0
1
no
Y =1?
no
b
yes
Y =1?
no
yes
a
If x = 1 and y = 0
then class = a
If x = 0 and y = 1
then class = a
If x = 0 and y = 0
then class = b
If x = 1 and y = 1
then class = b
a
yes
b
23
Replicated Subtree
If x = 1 and y = 1
then class = a
If z = 0 and w = 1
then class = a
Otherwise class = b
X
2
1
3
y
3
1
a
2
z
1
2
w
2
1
a
b
b
3
b
3
b
24
New Iris Flower
Sepal Length
5.1
Sepal Width
3.5
Petal Length
2.6
Petal Width
0.2
Type
?
25
Rules for Iris Data
Default: Iris-setosa
except if petal-length  2.45 and petal-length < 5.355
and petal-width < 1.75
then Iris-versicolor
except if petal-length  4.95 and petal-width < 1.55
then Iris-virginica
else if sepal-length < 4.95 and sepal-width  2.45
then Iris-virginica
else if petal-length  3.35
then Iris-virginica
except if petal-length < 4.85 and sepal-length < 5.95
then Iris-versicolor
1
2
3
4
5
6
7
8
9
10
11
12
26
The Shapes Problem
Shaded: Standing
Unshaded: Lying
27
Training Data for Shapes
Problem
Width
2
3
4
7
7
2
9
10
Height
4
6
3
8
6
9
1
2
Sides
4
4
4
3
3
4
4
3
Class
standing
standing
lying
standing
lying
standing
lying
lying
28
CPU Performance Data
PRP =
-56.1
+0.049 MYCT
+0.015 MMIN
+0.006MMAX
+0.630CACH
-0.270CHMIN
+1.46 CHMAX
CHMIN
>7.5
7.5
CACH
8.5
64.6
(24/19.2%)
MMAX
(a) linear regression
2500
19.3
(28/8.7%)
(2500,
4250]
>4250
29.8
(37/8.18%)
CACH
0.5
37.3
(19/11.3%)
1000
75.7
(10/24.6%)
28000
>28000
157
(21/73.7%)
CHMAX
>10000
58
133
(16/28.8%)
(0.5,8.5]
59.3
(24/16.9%)
MYCT
550
(8.5, >28
28]
MMAX
MMAX
MMIN
12000
281
(11/56%)
>58
783
(5/359%)
>12000
492
(7/53.9%)
>550
18.3
(7/3.83%)
(b) regression tree
29
CPU Performance Data
CHMIN
7.5
>7.5
CACH
8.5
MMAX
4250
MMAX
>8.5
LM4
(50/22.17%)
28000
LM5
(21/45.5%)
>28000
LM6
(23/63.5%)
>4250
LM1
(65/7.32%)
CACH
0.5
LM2
(26/6.37%)
(0.5,8.5]
LM3
(24/14.5%)
LM1 PRP = 8.29 + 0.004 MMAX + 2.77 CHMIN
LM2 PRP = 20.3 + 0.004 MMIN - 3.99 CHMIN
+ 0.946 CHMAX
LM3 PRP = 38.1 + 0.012 MMIN
LM4 PRP = 10.5 + 0.002 MMAX + 0.698 CACH
+0.969 CHMAX
LM5 PRP = 285 - 1.46 MYCT + 1.02 CACH
- 9.39 CHMIN
LM6 PRP = -65.8 + 0.03 MMIN - 2.94 CHMIN
= 4.98 CHMAX
(c) model tree
30
Partitioning Instance Space
31
Ways to Represent Clusters
32
Related documents