• Study Resource
• Explore

Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
```4
Data Reduction
응용화학부
송상옥
1
발표순서
 Data Reduction의 필요성
 Dimension Reduction의 역할 및 형태
 Dimension Reduction의 구체적 방법
2
왜 필요한가?
 데이터가 너무 많으면
– 예측 프로그램의 용량 초과
– 해를 구하는데 걸리는 시간 지연
 적절한 양의 데이터
– 데이터에 포함된 개념의 복잡도에 의존
(model의 complexity)
– mining 이전에 알 수 없다.
– Ex) random data
3
Dimension Reduction의 역할
4
Dimension Reduction의 형태
 Delete a column (feature)
 Delete a row (case)
 Reduce the number of values in a
column (smooth a feature)
 transformation to new data set(PCA)
5
Best Features Selection
 Impossible !
– Search space
– computational time
 approximation
– promising subsets
– simple distance
measure
– using only training
error
6
Mean and Variance
 Cases : a sample from some dist.
 Spreadsheet  mean and variance
 BUT, Dist. is unknown
Heuristic Feature Selection Guidance
7
Independent Features
 Classification problem
var  A var B 
se A  B  

n1
mean A  meanB 
se A  B 
n2
 sig
 k classes classification
– k pairwise comparison
 Regression = pseudo-classification
8
Distance Based Selection
 Independent analysis + correlation
analysis  detect redundancy
 Distance measure
DM  M 1  M 2 C1  C2  M 1  M 2 
1
T
– Independent feature
m1 i   m2 i 2 var1 i   var2 i 
 Branch-and-Bound Algorithm
DM F  DM F , i
9
Heuristic Feature Selection
 Comparison measures
– Significant Test
– Dm
– F-Test
10
Principal Components
 Merging features
– a new set of fewer columns
S  SP first k-component
 First principal component
– minimum euclidean distance
 Feature with a large variance
– excellent chances for separation of class
or group of case values
11
Decision Trees
 Dynamic logic approach
– coordinated with searching for solution
 advantageous in large feature spaces
 recursive partitioning
12
Reducing Values Problem
 Clustering problem
13
Rounding
iy  int( ix 10 k )


if mod ix ,10 k
  10
k

2 then iy  iy  1
ix  iy  10 k
14
K-Mean Clustering
15
Class Entropy
ent k     PrCi * log PrCi 
i
n( k ) 

Err   ent k *
N 
k 
16
How many Cases?
 적절한 sample size  complexity
 Prediction method와 긴밀하게 연관
 빠른 시간 안에 적절한 해
 Case reduction !!
 Basic approach (random sampling)
– Incremental samples
– Average samples
17
A Single Sample
18
Incremental Samples
19
Average Samples
 추가적인 bias 없이
variance error를 줄
일 수 있음
 Best Solution
Approach
20
Specialized Techniques
 Sequential Sampling over Time
– Time-dependent data
– Sampling period와 feature measuring 사
이에 최적화
 Strategic sampling of Key Event
– Net change > threshold (regression)