Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
DM.Lab in University of Seoul
The iDA Dataset : DeerHunter
Wildlife Management
Data Mining Laboratory
May 22th , 2008
by Sungjick Lee
Data Mining Laboratory
DM.Lab in University of Seoul
The iDA Dataset : DeerHunter (1/3)
The information about 6059 individual deer
hunters
Attributes
Attribute
Description
wtdeer
survey weighting variable, 사용하지 말아야 함
state
DC를 제외한 미국의 주를 1-51의 수로 나타냄
urban
사는 곳의 규모
a big city or urban area (=3)
a small city or town (=2)
a rural area
race
인종, 백인은 1, 타인종 0
retire
직업이 있는 경우 0, 은퇴한 경우 1, 학업 2, 가사 3, 다른 일 4
employ
직업이 있는 경우 1, 고용되어 있지 않은 경우 0
educ
학업을 마치는 데 걸린 시간(year)
Data Mining Laboratory
2
DM.Lab in University of Seoul
The iDA Dataset : DeerHunter (2/3)
Attributes (cont.)
Attribute
Description
married
결혼한 경우 1, 결혼하지 않은 경우 0
income
수입, 아래와 같이 중간 지점으로 카테고리 지정
5,000: under$10,000
15,000: between $10,000 and $19,900
22,500: between $20,000 and $24,900
27,500: between $25,000 and $29,900
40,000: between $30,000 and $49,900
62,500: between $50,000 and $74,900
85,000: Over $75,000, 10000
gender
여자 1, 남자 0
age
응답자의 나이
huntexp
?
agehunt
처음 사슴을 사냥한 나이
trips
1991년에 Deer trip을 한 회수
bagdder
“Did you bag a deer in 1991” yes : 1, no : 0
Data Mining Laboratory
3
DM.Lab in University of Seoul
The iDA Dataset : DeerHunter (3/3)
Attributes (cont.)
Attribute
Description
numbag
“How many deer did you bag during 1991?”
bagbuck
“Did you bag a buck in 1991?” Yes = 1, No = 0
avgcost
사냥 여행에 사용된 평균 비용
totcost
사냥 여행에 사용한 총 비용
a
9, 27, 45, 69, 99, 139, 202, 289, 491, 953
yes
Response to the Contingent Valuation Question:"Would you have taken
any trips during 1991…if the total cost of all of your trips was $A more
than the amount you just reported(TOTCOST)?" Yes=1, No=2
Data Mining Laboratory
4
DM.Lab in University of Seoul
Unsupervised clustering (Instance Similarity:65)
 Rules for clusters
Real-valued attribute 는 부등호로 범위를
지정해주고, Categorical attribute 는 등호
로 특정 Category를 지정
Cluster 1 (2898 instances  Total Percent Coverage : 99.97%)
12.00 <= educ <= 21.00
( Accuracy : 82.79% /
Coverage : 57.42% )
bagdeer = 1
( 92.34% / 99.86% )
1.00 <= numbag <= 41.00
( 93.42% / 98.00% )
bagbuck = 1
( 99.30% / 77.81% )
bagbuck = 1
and 1.00 <= numbag <= 41.00
( 99.29% / 77.36% )
bagbuck = 1 and bagdeer = 1 ( 99.30% /
77.81% )
Data Mining Laboratory
1.00 <= numbag <= 41.00
and bagdeer = 1 (93.42% / 98.00%)
1.00 <= numbag <= 41.00
and 12.00 <= educ <= 21.00
(97.48% / 56.11% )
bagdeer = 1
and 12.00 <= educ <= 21.00
(97.25% / 57.32% )
bagbuck = 1
and 1.00 <= numbag <= 41.00
and bagdeer = 1
( 99.29% / 77.36% )
1.00 <= numbag <= 41.00
and bagdeer = 1
and 12.00 <= educ <= 21.00
( 97.48% / 56.11% )
5
DM.Lab in University of Seoul
Unsupervised clustering (Instance Similarity:65)
 Rules for clusters (cont.)
Cluster 3 (2845 instances 
Total Percent Coverage :
93.11%)
Cluster 2 (26 instances)
Cluster 4 (290 instances)
bagdeer = 0
( Accuracy : 89.20% /
Coverage : 91.70% )
No rule.
0.00 <= numbag <= 0.00
( 87.74% / 93.11% )
bagdder = 0
and 0.00 <= numbag <= 0.00
( 89.20% / 91.70% )
Data Mining Laboratory
6
DM.Lab in University of Seoul
Unsupervised clustering (Instance Similarity:65)
 Important categorical attributes of Cluster 1 & 3
 Cluster 1
Attribute
value
Frequency
Predictability
Predictiveness
bagdeer
1
2894
1.00
0.92
0
4
0.00
0.00
1
2255
0.78
0.99
0
643
0.22
0.17
bagbuck
 Cluster 2
Attribute
value
Frequency
Predictability
Predictiveness
bagdeer
1
236
0.08
0.08
0
2609
0.92
0.89
1
16
0.01
0.01
0
2829
0.99
0.75
bagbuck
Data Mining Laboratory
7
DM.Lab in University of Seoul
Unsupervised clustering(Instance Similarity:65)
 Important real-valued attributes of Cluster 1 & 3
 Cluster 1
Name
Mean
Standard Deviation
educ
15.76
5.785
numbag
1.686
1.791
Name
Mean
Standard Deviation
educ
16.514
6.034
numbag
0.071
0.264
 Cluster 2
Data Mining Laboratory
8
DM.Lab in University of Seoul
Unsupervised clustering(Instance Similarity:66)
 Class Resemblance Statistics
1
2
3
4
5
6
Domain
Res.
Score
0.645
0.886
0.72
0.862
0.775
0.653
0.60
No.
of Inst.
2252
85
166
104
48
3404
6059
Cluster
Quality
0.07
0.47
0.19
0.43
0.29
0.08
Data Mining Laboratory
9
DM.Lab in University of Seoul
Unsupervised clustering(Instance Similarity:67)
 Class Resemblance Statistics
1
2
3
4
5
Domain
Res.
Score
0.65
0.784
0.751
0.654
0.741
0.60
No.
of Inst.
2151
43
91
3479
295
6059
Cluster
Quality
0.08
0.30
0.25
0.08
0.23
Data Mining Laboratory
10
DM.Lab in University of Seoul
Kohonen Neural Network(unsupervised clustering)
Categorical attribute를 real-numbered
Learning Rate
attribute로 지정한 후 실행
• .1~.9 사이의 값을 지정
• 낮은 값은 학습을 더 많이 반복하도록 하
며, 높은 값은 다 빠르게 결과를 반환(높은
값은 최적의 답을 주지 못할 기회를 증가시
킴)
Epochs
• network structure 를 학습 데이터의 전체
집합이 통과하는 총 회수
Instances
• 학습에 사용할 개체의 수를 지정
• 한번의 학습이 끝나면 network weight
value는 고정되며, 테스트 데이터가 마
지막 클러스터링을 위해 사용됨
Data Mining Laboratory
Clusters
• network 에 의해 생성되는 클러스터의
수를 지정
• n개로 지정했을 경우, 알고리즘은 가장
밀도가 높은(most populated) output layer
nodes n개만 network testing을 위해 저장
됨
11
DM.Lab in University of Seoul
Kohonen Neural Network(unsupervised clustering)
Java 실행화면
Total epochs
• 증가하는 속도가 일정하지 않음
RMS
• 변화를 확인해서 적당한 Total epochs를 지정할 수 있을 것이라
예상해봄
Data Mining Laboratory
12
Results of Kohonen Neural Network
DM.Lab in University of Seoul
(Various epoch 1/3)
 Instances 전체를 학습과 테스트에 사용하고 Clusters는
2로 지정, 모든 attribute 입력으로 사용
Epochs
1st Cluster
2nd Cluster
RMS
100
3270 instances(53.97%)
2789 instances(46.03%)
0.241
125
2971 instances(49.03%)
3088 instances(50.97%)
0.247
150
2980 instances(49.18%)
3079 instances(50.82%)
0.247
175
3518 instances(58.06%)
2541 instances(41.94%)
0.242
200
3355 instances(55.37%)
2704 instances(44.62%)
0.240
225
3482 instances(57.47%)
2577 instances(42.53%)
0.243
226
3482 instances(57.47%)
2577 instances(42.53%)
0.243
227
3482 instances(57.47%)
2577 instances(42.53%)
0.243
228
2925 instances(48.28%)
3134 instances(51.72%)
0.240
231
2925 instances(48.28%)
3134 instances(51.72%)
0.240
237
2925 instances(48.28%)
3134 instances(51.72%)
0.242
250
2925 instances(48.28%)
3134 instances(51.72%)
0.241
500
3134 instances(51.72%)
2925 instances(48.28%)
0.245
1000
2925 instances(48.28%)
3134 instances(51.72%)
0.275
Data
10000
Mining하나의
Laboratory
Cluster로
분류됨
0.275
13
Results of Kohonen Neural Network
DM.Lab in University of Seoul
(Various epoch 2/3)
Epoch 227
bagdeer
numbag
Data Mining Laboratory
bagbuck
14
Results of Kohonen Neural Network
DM.Lab in University of Seoul
(Various epoch 3/3)
Epoch 228
bagdeer
numbag
Data Mining Laboratory
bagbuck
15
분산형 그래프(1/5)
Data Mining Laboratory
DM.Lab in University of Seoul
16
분산형 그래프(2/5)
Data Mining Laboratory
DM.Lab in University of Seoul
17
분산형 그래프(3/5)
Data Mining Laboratory
DM.Lab in University of Seoul
18
분산형 그래프(4/5)
Data Mining Laboratory
DM.Lab in University of Seoul
19
분산형 그래프(5/5)
Data Mining Laboratory
DM.Lab in University of Seoul
20
DM.Lab in University of Seoul
CORREL 함수를 사용한 상관 계수 분석
state
urban
race
retire
employ
0.036295
0.016366
0.016004
0.044582
-0.06122
educ
married
income
gender
age
-0.01407
-0.00829
-0.09656
-0.0389
0.050576
huntexp
agehunt
trips
bagdeer
numbag
0.014289
0.080026
-0.19052
-0.07195
-0.1184
bagbuck
avgcost
totcost
a
-0.09078
-0.06964
-0.18792
0.336645
결과
-1, 1에 가까운
attribute가 없음
Data Mining Laboratory
21
DM.Lab in University of Seoul
Backpropagation Neural Network
Learning Rate
• .1에서 .9 까지의 값
• 낮은 값은 다 많은 학습을 반복
하도록 하며, 높은 값은 더 빠른 학
습을 하도록 함
Convergence
• 학습 종료를 위해 사용되는
root mean squared error 의
최대값을 지정
Data Mining Laboratory
22
Results of Backpropagation NN
DM.Lab in University of Seoul
 Hidden Layer : 5-3, Training Instance 3000, Test Instance 3059
Epochs
Test Data RMS
Test Data MAE
100
0.498
0.485
350
0.453
0.398
400
0.453
0.399
500
0.449
0.396
550
0.455
0.397
600
0.455
0.391
625
0.451
0.392
650
0.455
0.387
675
0.455
0.397
700
0.457
0.391
750
0.452
0.392
1000
0.460
0.393
5000
0.469
0.395
10000
0.492
0.401
Data Mining Laboratory
23
Results of Backpropagation NN
DM.Lab in University of Seoul
 Hidden Layer : 5-0, Training Instance 3000, Test Instance 3059
Epochs
Test Data RMS
Test Data MAE
100
0.449
0.391
350
0.446
0.389
400
0.447
0.377
500
0.450
0.388
550
0.452
0.385
600
0.451
0.378
625
0.451
0.390
650
0.453
0.384
675
0.453
0.381
700
0.447
0.377
750
0.452
0.384
1000
0.457
0.377
5000
0.458
0.385
10000
0.468
0.387
Data Mining Laboratory
24
Results of Backpropagation NN
DM.Lab in University of Seoul
 Hidden Layer : 2-0, Training Instance 3000, Test Instance 3059
Epochs
Test Data RMS
Test Data MAE
100
0.447
0.395
350
0.446
0.390
400
0.447
0.391
500
0.447
0.391
550
0.446
0.388
600
0.448
0.390
625
0.445
0.387
650
0.445
0.387
675
0.448
0.389
700
0.445
0.386
750
0.447
0.390
1000
0.444
0.385
5000
0.445
0.386
10000
0.446
0.386
Data Mining Laboratory
25
Results of Backpropagation NN
DM.Lab in University of Seoul
RMS
Data Mining Laboratory
26
Results of Backpropagation NN
DM.Lab in University of Seoul
MAE
Data Mining Laboratory
27
DM.Lab in University of Seoul
Supervised classification with ESX
Data Mining Laboratory
28
DM.Lab in University of Seoul
Supervised classification with ESX
Data Mining Laboratory
29
DM.Lab in University of Seoul
Supervised classification with ESX
75  50
Data Mining Laboratory
30
DM.Lab in University of Seoul
Supervised classification with ESX
Data Mining Laboratory
31
DM.Lab in University of Seoul
Supervised classification with ESX
Lower Minimum correctness
value  more rules
Data Mining Laboratory
32
DM.Lab in University of Seoul
Supervised classification with ESX
Data Mining Laboratory
33
DM.Lab in University of Seoul
Supervised classification with ESX
Confusion Matrix
Minimum Correctness Value = 75
Percent Correct : 58.0%
1
2
1
688
1011
2
262
1098
Minimum Correctness Value = 50
Percent Correct : 58.0%
1
2
1
688
1011
2
262
1098
Minimum Correctness Value = 25
Percent Correct : 56.0%
1
2
1
785
914
2
407
953
Data Mining Laboratory
34
Related documents