Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
DM.Lab in University of Seoul The iDA Dataset : DeerHunter Wildlife Management Data Mining Laboratory May 22th , 2008 by Sungjick Lee Data Mining Laboratory DM.Lab in University of Seoul The iDA Dataset : DeerHunter (1/3) The information about 6059 individual deer hunters Attributes Attribute Description wtdeer survey weighting variable, 사용하지 말아야 함 state DC를 제외한 미국의 주를 1-51의 수로 나타냄 urban 사는 곳의 규모 a big city or urban area (=3) a small city or town (=2) a rural area race 인종, 백인은 1, 타인종 0 retire 직업이 있는 경우 0, 은퇴한 경우 1, 학업 2, 가사 3, 다른 일 4 employ 직업이 있는 경우 1, 고용되어 있지 않은 경우 0 educ 학업을 마치는 데 걸린 시간(year) Data Mining Laboratory 2 DM.Lab in University of Seoul The iDA Dataset : DeerHunter (2/3) Attributes (cont.) Attribute Description married 결혼한 경우 1, 결혼하지 않은 경우 0 income 수입, 아래와 같이 중간 지점으로 카테고리 지정 5,000: under$10,000 15,000: between $10,000 and $19,900 22,500: between $20,000 and $24,900 27,500: between $25,000 and $29,900 40,000: between $30,000 and $49,900 62,500: between $50,000 and $74,900 85,000: Over $75,000, 10000 gender 여자 1, 남자 0 age 응답자의 나이 huntexp ? agehunt 처음 사슴을 사냥한 나이 trips 1991년에 Deer trip을 한 회수 bagdder “Did you bag a deer in 1991” yes : 1, no : 0 Data Mining Laboratory 3 DM.Lab in University of Seoul The iDA Dataset : DeerHunter (3/3) Attributes (cont.) Attribute Description numbag “How many deer did you bag during 1991?” bagbuck “Did you bag a buck in 1991?” Yes = 1, No = 0 avgcost 사냥 여행에 사용된 평균 비용 totcost 사냥 여행에 사용한 총 비용 a 9, 27, 45, 69, 99, 139, 202, 289, 491, 953 yes Response to the Contingent Valuation Question:"Would you have taken any trips during 1991…if the total cost of all of your trips was $A more than the amount you just reported(TOTCOST)?" Yes=1, No=2 Data Mining Laboratory 4 DM.Lab in University of Seoul Unsupervised clustering (Instance Similarity:65) Rules for clusters Real-valued attribute 는 부등호로 범위를 지정해주고, Categorical attribute 는 등호 로 특정 Category를 지정 Cluster 1 (2898 instances Total Percent Coverage : 99.97%) 12.00 <= educ <= 21.00 ( Accuracy : 82.79% / Coverage : 57.42% ) bagdeer = 1 ( 92.34% / 99.86% ) 1.00 <= numbag <= 41.00 ( 93.42% / 98.00% ) bagbuck = 1 ( 99.30% / 77.81% ) bagbuck = 1 and 1.00 <= numbag <= 41.00 ( 99.29% / 77.36% ) bagbuck = 1 and bagdeer = 1 ( 99.30% / 77.81% ) Data Mining Laboratory 1.00 <= numbag <= 41.00 and bagdeer = 1 (93.42% / 98.00%) 1.00 <= numbag <= 41.00 and 12.00 <= educ <= 21.00 (97.48% / 56.11% ) bagdeer = 1 and 12.00 <= educ <= 21.00 (97.25% / 57.32% ) bagbuck = 1 and 1.00 <= numbag <= 41.00 and bagdeer = 1 ( 99.29% / 77.36% ) 1.00 <= numbag <= 41.00 and bagdeer = 1 and 12.00 <= educ <= 21.00 ( 97.48% / 56.11% ) 5 DM.Lab in University of Seoul Unsupervised clustering (Instance Similarity:65) Rules for clusters (cont.) Cluster 3 (2845 instances Total Percent Coverage : 93.11%) Cluster 2 (26 instances) Cluster 4 (290 instances) bagdeer = 0 ( Accuracy : 89.20% / Coverage : 91.70% ) No rule. 0.00 <= numbag <= 0.00 ( 87.74% / 93.11% ) bagdder = 0 and 0.00 <= numbag <= 0.00 ( 89.20% / 91.70% ) Data Mining Laboratory 6 DM.Lab in University of Seoul Unsupervised clustering (Instance Similarity:65) Important categorical attributes of Cluster 1 & 3 Cluster 1 Attribute value Frequency Predictability Predictiveness bagdeer 1 2894 1.00 0.92 0 4 0.00 0.00 1 2255 0.78 0.99 0 643 0.22 0.17 bagbuck Cluster 2 Attribute value Frequency Predictability Predictiveness bagdeer 1 236 0.08 0.08 0 2609 0.92 0.89 1 16 0.01 0.01 0 2829 0.99 0.75 bagbuck Data Mining Laboratory 7 DM.Lab in University of Seoul Unsupervised clustering(Instance Similarity:65) Important real-valued attributes of Cluster 1 & 3 Cluster 1 Name Mean Standard Deviation educ 15.76 5.785 numbag 1.686 1.791 Name Mean Standard Deviation educ 16.514 6.034 numbag 0.071 0.264 Cluster 2 Data Mining Laboratory 8 DM.Lab in University of Seoul Unsupervised clustering(Instance Similarity:66) Class Resemblance Statistics 1 2 3 4 5 6 Domain Res. Score 0.645 0.886 0.72 0.862 0.775 0.653 0.60 No. of Inst. 2252 85 166 104 48 3404 6059 Cluster Quality 0.07 0.47 0.19 0.43 0.29 0.08 Data Mining Laboratory 9 DM.Lab in University of Seoul Unsupervised clustering(Instance Similarity:67) Class Resemblance Statistics 1 2 3 4 5 Domain Res. Score 0.65 0.784 0.751 0.654 0.741 0.60 No. of Inst. 2151 43 91 3479 295 6059 Cluster Quality 0.08 0.30 0.25 0.08 0.23 Data Mining Laboratory 10 DM.Lab in University of Seoul Kohonen Neural Network(unsupervised clustering) Categorical attribute를 real-numbered Learning Rate attribute로 지정한 후 실행 • .1~.9 사이의 값을 지정 • 낮은 값은 학습을 더 많이 반복하도록 하 며, 높은 값은 다 빠르게 결과를 반환(높은 값은 최적의 답을 주지 못할 기회를 증가시 킴) Epochs • network structure 를 학습 데이터의 전체 집합이 통과하는 총 회수 Instances • 학습에 사용할 개체의 수를 지정 • 한번의 학습이 끝나면 network weight value는 고정되며, 테스트 데이터가 마 지막 클러스터링을 위해 사용됨 Data Mining Laboratory Clusters • network 에 의해 생성되는 클러스터의 수를 지정 • n개로 지정했을 경우, 알고리즘은 가장 밀도가 높은(most populated) output layer nodes n개만 network testing을 위해 저장 됨 11 DM.Lab in University of Seoul Kohonen Neural Network(unsupervised clustering) Java 실행화면 Total epochs • 증가하는 속도가 일정하지 않음 RMS • 변화를 확인해서 적당한 Total epochs를 지정할 수 있을 것이라 예상해봄 Data Mining Laboratory 12 Results of Kohonen Neural Network DM.Lab in University of Seoul (Various epoch 1/3) Instances 전체를 학습과 테스트에 사용하고 Clusters는 2로 지정, 모든 attribute 입력으로 사용 Epochs 1st Cluster 2nd Cluster RMS 100 3270 instances(53.97%) 2789 instances(46.03%) 0.241 125 2971 instances(49.03%) 3088 instances(50.97%) 0.247 150 2980 instances(49.18%) 3079 instances(50.82%) 0.247 175 3518 instances(58.06%) 2541 instances(41.94%) 0.242 200 3355 instances(55.37%) 2704 instances(44.62%) 0.240 225 3482 instances(57.47%) 2577 instances(42.53%) 0.243 226 3482 instances(57.47%) 2577 instances(42.53%) 0.243 227 3482 instances(57.47%) 2577 instances(42.53%) 0.243 228 2925 instances(48.28%) 3134 instances(51.72%) 0.240 231 2925 instances(48.28%) 3134 instances(51.72%) 0.240 237 2925 instances(48.28%) 3134 instances(51.72%) 0.242 250 2925 instances(48.28%) 3134 instances(51.72%) 0.241 500 3134 instances(51.72%) 2925 instances(48.28%) 0.245 1000 2925 instances(48.28%) 3134 instances(51.72%) 0.275 Data 10000 Mining하나의 Laboratory Cluster로 분류됨 0.275 13 Results of Kohonen Neural Network DM.Lab in University of Seoul (Various epoch 2/3) Epoch 227 bagdeer numbag Data Mining Laboratory bagbuck 14 Results of Kohonen Neural Network DM.Lab in University of Seoul (Various epoch 3/3) Epoch 228 bagdeer numbag Data Mining Laboratory bagbuck 15 분산형 그래프(1/5) Data Mining Laboratory DM.Lab in University of Seoul 16 분산형 그래프(2/5) Data Mining Laboratory DM.Lab in University of Seoul 17 분산형 그래프(3/5) Data Mining Laboratory DM.Lab in University of Seoul 18 분산형 그래프(4/5) Data Mining Laboratory DM.Lab in University of Seoul 19 분산형 그래프(5/5) Data Mining Laboratory DM.Lab in University of Seoul 20 DM.Lab in University of Seoul CORREL 함수를 사용한 상관 계수 분석 state urban race retire employ 0.036295 0.016366 0.016004 0.044582 -0.06122 educ married income gender age -0.01407 -0.00829 -0.09656 -0.0389 0.050576 huntexp agehunt trips bagdeer numbag 0.014289 0.080026 -0.19052 -0.07195 -0.1184 bagbuck avgcost totcost a -0.09078 -0.06964 -0.18792 0.336645 결과 -1, 1에 가까운 attribute가 없음 Data Mining Laboratory 21 DM.Lab in University of Seoul Backpropagation Neural Network Learning Rate • .1에서 .9 까지의 값 • 낮은 값은 다 많은 학습을 반복 하도록 하며, 높은 값은 더 빠른 학 습을 하도록 함 Convergence • 학습 종료를 위해 사용되는 root mean squared error 의 최대값을 지정 Data Mining Laboratory 22 Results of Backpropagation NN DM.Lab in University of Seoul Hidden Layer : 5-3, Training Instance 3000, Test Instance 3059 Epochs Test Data RMS Test Data MAE 100 0.498 0.485 350 0.453 0.398 400 0.453 0.399 500 0.449 0.396 550 0.455 0.397 600 0.455 0.391 625 0.451 0.392 650 0.455 0.387 675 0.455 0.397 700 0.457 0.391 750 0.452 0.392 1000 0.460 0.393 5000 0.469 0.395 10000 0.492 0.401 Data Mining Laboratory 23 Results of Backpropagation NN DM.Lab in University of Seoul Hidden Layer : 5-0, Training Instance 3000, Test Instance 3059 Epochs Test Data RMS Test Data MAE 100 0.449 0.391 350 0.446 0.389 400 0.447 0.377 500 0.450 0.388 550 0.452 0.385 600 0.451 0.378 625 0.451 0.390 650 0.453 0.384 675 0.453 0.381 700 0.447 0.377 750 0.452 0.384 1000 0.457 0.377 5000 0.458 0.385 10000 0.468 0.387 Data Mining Laboratory 24 Results of Backpropagation NN DM.Lab in University of Seoul Hidden Layer : 2-0, Training Instance 3000, Test Instance 3059 Epochs Test Data RMS Test Data MAE 100 0.447 0.395 350 0.446 0.390 400 0.447 0.391 500 0.447 0.391 550 0.446 0.388 600 0.448 0.390 625 0.445 0.387 650 0.445 0.387 675 0.448 0.389 700 0.445 0.386 750 0.447 0.390 1000 0.444 0.385 5000 0.445 0.386 10000 0.446 0.386 Data Mining Laboratory 25 Results of Backpropagation NN DM.Lab in University of Seoul RMS Data Mining Laboratory 26 Results of Backpropagation NN DM.Lab in University of Seoul MAE Data Mining Laboratory 27 DM.Lab in University of Seoul Supervised classification with ESX Data Mining Laboratory 28 DM.Lab in University of Seoul Supervised classification with ESX Data Mining Laboratory 29 DM.Lab in University of Seoul Supervised classification with ESX 75 50 Data Mining Laboratory 30 DM.Lab in University of Seoul Supervised classification with ESX Data Mining Laboratory 31 DM.Lab in University of Seoul Supervised classification with ESX Lower Minimum correctness value more rules Data Mining Laboratory 32 DM.Lab in University of Seoul Supervised classification with ESX Data Mining Laboratory 33 DM.Lab in University of Seoul Supervised classification with ESX Confusion Matrix Minimum Correctness Value = 75 Percent Correct : 58.0% 1 2 1 688 1011 2 262 1098 Minimum Correctness Value = 50 Percent Correct : 58.0% 1 2 1 688 1011 2 262 1098 Minimum Correctness Value = 25 Percent Correct : 56.0% 1 2 1 785 914 2 407 953 Data Mining Laboratory 34