Download Variance Reduction for Stable Feature Selection

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Presenter: Yue Han
Advisor: Lei Yu
Ph.D. Dissertation
4/26/2012
1
Outline
 Introduction and Motivation
 Background and Related Work
 Major Contributions
 Publications
 Theoretical Framework for Stable Feature Selection
 Empirical Framework : Margin Based Instance Weighting
 Empirical Study



General Experimental Setup
Experiments on Synthetic Data
Experiments on Real-World Data
 Conclusion and Future Work
2
Feature Selection Applications
Gene
Selection
Pixel
Selection
Word
Selection
Sports
Travel
Politics
Tech
Artist
Life
Science
Internet.
Business
Health
Elections
3
Feature Selection from High-dimensional Data
High-Dimensional
Data
Feature Selection
Algorithms
LowDimensional
Data
Learning
Models
Knowledge Discovery on High-dimensional Data
p: # of features
n: # of samples
High-dimensional data: p >> n
Curse of Dimensionality:
•Effects on distance functions
•In optimization and learning
•In Bayesian statistics
Feature Selection:
Alleviating the effect of the curse of
dimensionality.
Enhancing generalization capability.
Speeding up learning process.
Improving model interpretability.
4
Stability of Feature Selection
Feature Selection Method
Feature Subset
Feature Subset
Feature Subset
Training Data
Training Data
Training Data
Consistent
or not???
Stability Issue of Feature Selection
Stability of Feature Selection: the insensitivity of the result of
a feature selection algorithm to variations to the training set.
Training Data
Training Data
Training Data
Learning
Learning
Model
Learning
Model
Model
Learning Algorithm
Stability of Learning Algorithm is
firstly examined by Turney in 1995
Stability of feature selection
was relatively neglected before
and attracted interests from researchers
in data mining recently.
5
Motivation for Stable Feature Selection
Sample
Space D
Training
Data D1
Training
Data D2
Given Unlimited Sample Size:
Feature selection results from D1 and D2
are the same
Given Limited Sample Size:
(n<<p for high dimensional data)
Feature selection results from D1 and D2
are different
Biologists cares about:
Prediction accuracy & Consistency of feature subsets;
Confidence for biological validation ;
Biomarkers to explain the observed phenomena.
6
Outline
 Introduction and Motivation
 Background and Related Work
 Major Contributions
 Publications
 Theoretical Framework for Stable Feature Selection
 Empirical Framework : Margin Based Instance Weighting
 Empirical Study



General Experimental Setup
Experiments on Synthetic Data
Experiments on Real-World Data
 Conclusion and Future Work
7
Feature Selection Methods
Original
set
Subset
Generation
Subset
Subset
Evaluation
Goodness of
subset
no
Search Strategies:
 Complete Search
 Sequential Search
 Random Search
Stopping
Criterion
Evaluation Criteria
 Filter Model
 Wrapper Model
 Embedded Model
Yes
Result
Validation
Representative Algorithms
 Relief, SFS, MDLM, etc.
 FSBC, ELSA, LVW, etc.
 BBHFS, Dash-Liu’s, etc.
8
Stable Feature Selection
Comparison of Feature Selection Algorithms w.r.t. Stability
(Davis et al. Bioinformatics, vol. 22, 2006; Kalousis et al. KAIS, vol. 12, 2007)
Quantify the stability in terms of consistency on subset or weight;
Algorithms varies on stability and equally well for classification;
Choose the best with both stability and accuracy.
Bagging-based Ensemble Feature Selection
(Saeys et al. ECML07)
Different bootstrapped samples of the same training set;
Apply a conventional feature selection algorithm;
Aggregates the feature selection results.
Group-based Stable Feature Selection
(Yu et al. KDD08; Loscalzo et al. KDD09)
Explore the intrinsic feature correlations;
Identify groups of correlated features;
Select relevant feature groups.
9
Margin based Feature Selection
Sample Margin: how much can
an instance travel before it hits
the decision boundary
Hypothesis Margin: how much can
the hypothesis travel before it hits an
instance (Distance between the
hypothesis and the opposite
hypothesis of an instance)
Representative Algorithms based on HM: Relief-F, G-flip, Simba, etc.
margin is used for feature weighting or feature selection
(totally different use in our study)
10
Outline
 Introduction and Motivation
 Background and Related Work
 Major Contributions
 Publications
 Theoretical Framework for Stable Feature Selection
 Empirical Framework : Margin Based Instance Weighting
 Empirical Study



General Experimental Setup
Experiments on Synthetic Data
Experiments on Real-World Data
 Conclusion and Future Work
11
Publications
 Yue Han and Lei Yu. Margin Based Sample Weighting for Stable
Feature Selection. In Proceedings of the 11th International Conference
on Web-Age Information Management (WAIM2010), pages 680-691,
Jiuzhaigou, China, July 15-17, 2010.
 Yue Han and Lei Yu. A Variance Reduction Framework for Stable
Feature Selection. In Proceedings of the 10th IEEE International
Conference on Data Mining (ICDM2010), pages 205-215, Sydney,
Australia, December 14-17, 2010.
 Lei Yu, Yue Han and Michael E. Berens. Stable Gene Selection from
Microarray Data via Sample Weighting. IEEE/ACM Transactions on
Computational Biology and Bioinformatics (TCBB), pages 262-272, vol.
9 no. 1, 2012.
 Yue Han and Lei Yu. A Variance Reduction Framework for Stable
Feature Selection. Statistical Analysis and Data Mining(SADM),
Accepted, 2012.
12
Outline
 Introduction and Motivation
 Background and Related Work
 Major Contributions
 Publications
 Theoretical Framework for Stable Feature Selection
 Empirical Framework : Margin Based Instance Weighting
 Empirical Study



General Experimental Setup
Experiments on Synthetic Data
Experiments on Real-World Data
 Conclusion and Future Work
13
Bias-variance Decomposition of Feature
Selection Error
Training Data:
D
Data Space:
FS Result:
r(D)
True FS Result:
r*
Error:
Bias:
Variance:
Bias-Variance Decomposition of Feature Selection Error:
oRelationship between accuracy(opposite of loss)&stability(opposite of variance);
o Suggests a better trade-off between the bias and variance of feature selection.
14
Bias, Variance and Error of Monte Carlo
Estimator
Feature Selection (Weighting)  Monte Carlo Estimator
Relevance Score:
Monte Carlo Estimator:
Error:
Bias:
Variance:
Impact Factor: feature selection algorithm and sample size
Increasing • Impractical
Sample size • Costly
15
Variance Reduction via Sample Weighting
Probability density function
Importance Sampling
Importance
A good importance sampling function h(x)
• More instances draw from important regions
• Less instances draw from other regions
Sampling
Instance
• Increase weights for instances from important regions
• Decrease weights for instances from other regions
Weighting
16
Outline
 Introduction and Motivation
 Background and Related Work
 Major Contributions
 Publications
 Theoretical Framework for Stable Feature Selection
 Empirical Framework : Margin Based Instance Weighting
 Empirical Study



General Experimental Setup
Experiments on Synthetic Data
Experiments on Real-World Data
 Conclusion and Future Work
17
Overall Framework
Challenges:
How to produce weights for
instances from the point view of
feature selection stability;
How to present weighted
instances to conventional
feature selection algorithms.
Margin Based Instance Weighting for Stable Feature Selection
18
Margin Vector Feature Space
Original
Feature Space
For each
Margin Vector Feature Space
For each
Hypothesis
Margin:
captures the local profile of feature relevance for all features at
Nearest Hit
Nearest Miss
 Instances exhibit different profiles of feature relevance;
 Instances influence feature selection results differently.
hit
miss
19
An Illustrative Example
(a) Original Feature Space
(b) Margin Vector Feature Space.
20
Extension for Hypothesis Margin of 1NN
 To reduce the effect of noise or outliers
Hypothesis Margin of
kNN(k>1)
Hypothesis Margin of
weighted kNN(k>1)
21
Margin Based Instance Weighting
Algorithm
exhibits different profiles of feature relevance
Instance
influence feature selection results differently
Review:
Variance reduction via
Importance Sampling
More instances draw
from important regions
Instance
Weighting
Weighting:
Higher Outlying Degree
Lower Weight
Lower Outlying Degree
Higher Weight
Less instances draw
from other regions
Outlying Degree:
22
Iterative Margin Based Instance Weighting
Assumption:
Instances are equally important in original feature space
Original
Weighted
Feature
FeatureSpace
Space
Weighted
Feature Space
Margin
MarginVector
Vector
Feature
Space
FeatureMargin
Space
Vector
Feature Space
Instance Weight
Updated Instance Weight
Final
Instance
Weight
 The iterative
procedure always
Converges fast;
 There exists little
difference in terms of
learned weights;
 Overall a stable
procedure.
23
Algorithm Illustration
Time Complexity Analysis:
o Dominated by Instance Weighting:
o Efficient for High-dimensional Data with small sample size (n<<d)
24
Outline
 Introduction and Motivation
 Background and Related Work
 Major Contributions
 Publications
 Theoretical Framework for Stable Feature Selection
 Empirical Framework : Margin Based Instance Weighting
 Empirical Study



General Experimental Setup
Experiments on Synthetic Data
Experiments on Real-World Data
 Conclusion and Future Work
25
Objective of Empirical Study
 to demonstrate the bias-variance decomposition in theoretical framework;
 to verify the effectiveness of the proposed instance weighting framework on
variance reduction;
 to study the impacts of variance reduction on the stability and predictive
performance of the selected subsets.
Feature Selection Method
Training Data
Training Data
Training Data
Feature Subset
Feature Subset
Feature Subset
Consistent
or not???
Stability of Feature Selection
26
Algorithms in Comparison
SVM-RFE
Baseline Algorithm
Relief-F
Baseline Algorithm
• Recursively eliminate 10 percent of training
features each iteration
• Linear kernel and default C parameter
• Hypothesis Margin (produce feature weights)
• Simply aggregate margin based on KNN
along each feature dimension
IW SVM-RFE Instance Weighting SVM-RFE
IW Relief-F Instance Weighting Relief-F
• Instance weight affects error penalty and thus
the choice of hyperplane
• Instance weight affects the aggregated feature
weight
En SVM-RFE
Ensemble SVM-RFE
• 20 bootstrapped training set
• Aggregate different rankings into a final
consensus ranking
En Relief-F Ensemble Relief-F
• 20 bootstrapped training set
Aggregate different rankings into a final
consensus ranking
27
Predictive
Accuracy
Stability Measures
Stability Measures, Predictive Accuracy
Measures
Feature Subset
Jaccard Index;
nPOGR;
SIMv;
Kuncheva Index.
Feature Ranking:
Spearman Coefficient
nPOGR:
Kuncheva Index:
Feature Weighting:
Pearson Correlation Coefficient
CV Accuracy:
Prediction Accuracy
base on Cross-validation
AUC Accuracy:
 the area under the receiver operating
characteristic (ROC) Curve
28
Outline
 Introduction and Motivation
 Background and Related Work
 Major Contributions
 Publications
 Theoretical Framework for Stable Feature Selection
 Empirical Framework : Margin Based Instance Weighting
 Empirical Study



General Experimental Setup
Experiments on Synthetic Data
Experiments on Real-World Data
 Conclusion and Future Work
29
Experiments on Synthetic Data
Synthetic Data Generation:
Feature Value:
two multivariate normal distributions
Covariance matrix
is a 10*10 square
matrix with elements
1 along the diagonal
and 0.8 off diagonal.
100 groups and 10 feature each
500 Training Data:
100 instances with 50 from
Leave-one-out Test Data:
5000 instances
and 50 from
Method in Comparison:
SVM-RFE: Recursively eliminate 10%
features of previous iteration till 10 features
remained.
Measures:
Variance, Bias, Error
Subset Stability (Kuncheva Index)
CV Accuracy (SVM)
Class label:
a weighted sum of all feature values with
optimal feature weight vector
30
Experiments on Synthetic Data
Observations:
 Error is equal to the sum of bias and variance for both versions of SVM-RFE;
 Error is dominated by bias during early iterations
and is dominated by variance during later iterations;
 IW SVM-RFE exhibits significantly lower bias, variance and error than
SVM-RFE when the number of remaining features approaches 50.
31
Experiments on Synthetic Data
Conclusion:
Variance Reduction via Margin Based Instance Weighting
better bias-variance tradeoff
increased subset stability
improved classification accuracy
32
Experiments on Synthetic Data
Observations:
 the sample size dependency
of the performance of SVMRFE
 the effectiveness of instance
weighting on alleviating
such dependency
33
Outline
 Introduction and Motivation
 Background and Related Work
 Major Contributions
 Publications
 Theoretical Framework for Stable Feature Selection
 Empirical Framework : Margin Based Instance Weighting
 Empirical Study



General Experimental Setup
Experiments on Synthetic Data
Experiments on Real-World Data
 Conclusion and Future Work
34
Experiments on Real-world Data
Microarray Data:
Methods in Comparison:
SVM-RFE
Ensemble SVM-RFE
Instance Weighting SVM-RFE
10-time 10-fold Cross-Validation
...
10 fold
Training
Data
100 bootstrapped(random repetition)
Bootstrapped
Training Data
2/3 Training Data
1/3 Test Data
100
...
Test Data
Measures:
Subset Stability: Kenchuva Index
CV Accuracies (KNN, SVM)
Measures:
Subset Stability: nPOGR
AUC Accuracies (KNN, SVM)
35
Experiments on Real-world Data
Observations:
Non-discriminative during
early iterations;
SVM-RFE sharply increase
as # of features approaches
10;
IW SVM-RFE shows
significantly slower rate of
increase.
Note: 40 iterations starting from
about 1000 features till 10
features remain
36
Experiments on Real-world Data
Observations:
Both ensemble and
instance weighting
approaches improve stability
consistently;
Ensemble is not as
significant as instance
weighting;
As # of features increases,
stability score decreases
because of the larger
correction factor.
Consistent Results showed
under random repetition
setting and not included here
37
Experiments on Real-world Data
Observations:
Instance Weighting
enables the selection of more
genes with higher frequency
Instance Weighting
produce much bigger
consensus gene signatures
38
Experiments on Real-world Data
Conclusions:
Improves stability of feature selection
without sacrificing prediction accuracy;
Performs much better than ensemble
approach and more efficient;
Leads to significantly increased stability
with slight extra cost of time.
Consistent Results showed under random repetition setting(also ReliefF) and
results can be found in the dissertation but not included here for conciseness.
39
Outline
 Introduction and Motivation
 Background and Related Work
 Major Contributions
 Publications
 Theoretical Framework for Stable Feature Selection
 Empirical Framework : Margin Based Instance Weighting
 Empirical Study



General Experimental Setup
Experiments on Synthetic Data
Experiments on Real-World Data
 Conclusion and Future Work
40
Conclusion and Future Work
Conclusion:






Theoretical Framework for Stable Feature Selection;
Empirical Weighting Framework for Stable Feature Selection;
Effective and Efficient Margin Based Instance Weighting Approaches;
Extensive Study on Proposed Theoretical And Empirical Frameworks;
Extensive Study on Proposed Weighting Approaches;
Extensive Study on Sample Size Effect on Feature Selection Stability.
Future Work:
 Explore Other Weighting Approaches;
 Study the Relationship Between Feature Selection and Classification w.r.t.
Bias-Variance Properties.
41
Thank you
and
Questions?
42