Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
4 Data Reduction 응용화학부 송상옥 1 발표순서 Data Reduction의 필요성 Dimension Reduction의 역할 및 형태 Dimension Reduction의 구체적 방법 2 왜 필요한가? 데이터가 너무 많으면 – 예측 프로그램의 용량 초과 – 해를 구하는데 걸리는 시간 지연 적절한 양의 데이터 – 데이터에 포함된 개념의 복잡도에 의존 (model의 complexity) – mining 이전에 알 수 없다. – Ex) random data 3 Dimension Reduction의 역할 4 Dimension Reduction의 형태 Delete a column (feature) Delete a row (case) Reduce the number of values in a column (smooth a feature) transformation to new data set(PCA) 5 Best Features Selection Impossible ! – Search space – computational time approximation – promising subsets – simple distance measure – using only training error 6 Mean and Variance Cases : a sample from some dist. Spreadsheet mean and variance BUT, Dist. is unknown Heuristic Feature Selection Guidance 7 Independent Features Classification problem var A var B se A B n1 mean A meanB se A B n2 sig k classes classification – k pairwise comparison Regression = pseudo-classification 8 Distance Based Selection Independent analysis + correlation analysis detect redundancy Distance measure DM M 1 M 2 C1 C2 M 1 M 2 1 T – Independent feature m1 i m2 i 2 var1 i var2 i Branch-and-Bound Algorithm DM F DM F , i 9 Heuristic Feature Selection Comparison measures – Significant Test – Dm – F-Test 10 Principal Components Merging features – a new set of fewer columns S SP first k-component First principal component – minimum euclidean distance Feature with a large variance – excellent chances for separation of class or group of case values 11 Decision Trees Dynamic logic approach – coordinated with searching for solution advantageous in large feature spaces recursive partitioning 12 Reducing Values Problem Clustering problem 13 Rounding iy int( ix 10 k ) if mod ix ,10 k 10 k 2 then iy iy 1 ix iy 10 k 14 K-Mean Clustering 15 Class Entropy ent k PrCi * log PrCi i n( k ) Err ent k * N k 16 How many Cases? 적절한 sample size complexity Prediction method와 긴밀하게 연관 빠른 시간 안에 적절한 해 Case reduction !! Basic approach (random sampling) – Incremental samples – Average samples 17 A Single Sample 18 Incremental Samples 19 Average Samples 추가적인 bias 없이 variance error를 줄 일 수 있음 Best Solution Approach 20 Specialized Techniques Sequential Sampling over Time – Time-dependent data – Sampling period와 feature measuring 사 이에 최적화 Strategic sampling of Key Event – Net change > threshold (regression) Adjusting prevalence – Low prevalence에 대해 case 반복 21