Download ppt

4 Data Reduction 응용화학부 송상옥 1 발표순서  Data Reduction의 필요성  Dimension Reduction의 역할 및 형태  Dimension Reduction의 구체적 방법 2 왜 필요한가?  데이터가 너무 많으면 – 예측 프로그램의 용량 초과 – 해를 구하는데 걸리는 시간 지연  적절한 양의 데이터 – 데이터에 포함된 개념의 복잡도에 의존 (model의 complexity) – mining 이전에 알 수 없다. – Ex) random data 3 Dimension Reduction의 역할 4 Dimension Reduction의 형태  Delete a column (feature)  Delete a row (case)  Reduce the number of values in a column (smooth a feature)  transformation to new data set(PCA) 5 Best Features Selection  Impossible ! – Search space – computational time  approximation – promising subsets – simple distance measure – using only training error 6 Mean and Variance  Cases : a sample from some dist.  Spreadsheet  mean and variance  BUT, Dist. is unknown Heuristic Feature Selection Guidance 7 Independent Features  Classification problem var  A var B  se A  B    n1 mean A  meanB  se A  B  n2  sig  k classes classification – k pairwise comparison  Regression = pseudo-classification 8 Distance Based Selection  Independent analysis + correlation analysis  detect redundancy  Distance measure DM  M 1  M 2 C1  C2  M 1  M 2  1 T – Independent feature m1 i   m2 i 2 var1 i   var2 i   Branch-and-Bound Algorithm DM F  DM F , i 9 Heuristic Feature Selection  Comparison measures – Significant Test – Dm – F-Test 10 Principal Components  Merging features – a new set of fewer columns S  SP first k-component  First principal component – minimum euclidean distance  Feature with a large variance – excellent chances for separation of class or group of case values 11 Decision Trees  Dynamic logic approach – coordinated with searching for solution  advantageous in large feature spaces  recursive partitioning 12 Reducing Values Problem  Clustering problem 13 Rounding iy  int( ix 10 k )   if mod ix ,10 k   10 k  2 then iy  iy  1 ix  iy  10 k 14 K-Mean Clustering 15 Class Entropy ent k     PrCi * log PrCi  i n( k )   Err   ent k * N  k  16 How many Cases?  적절한 sample size  complexity  Prediction method와 긴밀하게 연관  빠른 시간 안에 적절한 해  Case reduction !!  Basic approach (random sampling) – Incremental samples – Average samples 17 A Single Sample 18 Incremental Samples 19 Average Samples  추가적인 bias 없이 variance error를 줄 일 수 있음  Best Solution Approach 20 Specialized Techniques  Sequential Sampling over Time – Time-dependent data – Sampling period와 feature measuring 사 이에 최적화  Strategic sampling of Key Event – Net change > threshold (regression)  Adjusting prevalence – Low prevalence에 대해 case 반복 21

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download ppt