Download ppt - Computer Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

K-means clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
VARIANCE REDUCTION FOR
STABLE FEATURE SELECTION
Presenter: Yue Han
Advisor: Lei Yu
Department of Computer Science
10/27/10
OUTLINE
Introduction and Motivation
 Background and Related Work
 Preliminaries

Publications
 Theoretical Framework
 Empirical Framework : Margin Based Instance Weighting
 Empirical Study


Planned Tasks
OUTLINE
Introduction and Motivation
 Background and Related Work
 Preliminaries

Publications
 Theoretical Framework
 Empirical Framework : Margin Based Instance Weighting
 Empirical Study


Planned Tasks
INTRODUCTION AND MOTIVATION
FEATURE SELECTION APPLICATIONS
T1 T2 ….…… TN
C
Pixels
D1
D2
12 0 ….…… 6
Sports
Vs
…
…
0 11 ….…… 16
…
DM
3 10 ….…… 28 Travel
…
Documents
Terms
Jobs
Samples
Features(Genes or Proteins)
Features
INTRODUCTION AND MOTIVATION
FEATURE SELECTION FROM HIGH-DIMENSIONAL DATA
High-Dimensional Data
Feature Selection Algorithm
MRMR, SVMRFE, Relief-F,
F-statistics, etc.
p: # of features
n: # of samples
High-dimensional data: p >> n
Curse of Dimensionality:
•Effects on distance functions
•In optimization and learning
•In Bayesian statistics
Low-Dimensional Data
Learning Models
Classification,
Clustering, etc.
Knowledge Discovery on High-dimensional Data
Feature Selection:
Alleviating the effect of the curse of
dimensionality.
Enhancing generalization capability.
Speeding up learning process.
Improving model interpretability.
INTRODUCTION AND MOTIVATION
STABILITY OF FEATURE SELECTION
Feature Selection Method
Feature Subset
Feature Subset
Feature Subset
Training Data
Training Data
Training Data
Consistent
or not???
Stability Issue of Feature Selection
Stability of Feature Selection: the insensitivity of the result of
a feature selection algorithm to variations to the training set.
Training
Training
Data
Training
Data
Data
Learning
Learning
Model
Learning
Model
Model
Learning Algorithm
Stability of Learning Algorithm is
firstly examined by Turney in 1995
Stability of feature selection
was relatively neglected
before and attracted interests from
researchers in data mining recently.
INTRODUCTION AND MOTIVATION
MOTIVATION FOR STABLE FEATURE SELECTION
Samples
Features
D1
Given Unlimited Sample Size of D:
Feature selection results from D1 and D2 are the same
D2
Size of D is limited: (n<<p for high dimensional data)
Feature selection results from D1 and D2 are different
Challenge: Increasing #of samples could be very costly or impractical
Experts from Biology and Biomedicine are interested in:
not only the prediction accuracy but also the consistency of feature subsets;
validating stable genes or proteins less sensitive to variations to training data;
biomarkers to explain the observed phenomena.
OUTLINE
Introduction and Motivation
 Background and Related Work
 Preliminaries

Publications
 Theoretical Framework
 Empirical Framework : Margin Based Instance Weighting
 Empirical Study


Planned Tasks
BACKGROUND AND RELATED WORK
FEATURE SELECTION METHODS
Original
set
Subset
Generation
Subset
Subset
Evaluation
Goodness of
subset
no
Search Strategies:
 Complete Search
 Sequential Search
 Random Search
Stopping
Criterion
Evaluation Criteria
 Filter Model
 Wrapper Model
 Embedded Model
Yes
Result
Validation
Representative Algorithms
 Relief, SFS, MDLM, etc.
 FSBC, ELSA, LVW, etc.
 BBHFS, Dash-Liu’s, etc.
BACKGROUND AND RELATED WORK
STABLE FEATURE SELECTION
Comparison of Feature Selection Algorithms w.r.t. Stability
(Davis et al. Bioinformatics, vol. 22, 2006; Kalousis et al. KAIS, vol. 12, 2007)
Quantify the stability in terms of consistency on subset or weight;
Algorithms varies on stability and equally well for classification;
Choose the best with both stability and accuracy.
Bagging-based Ensemble Feature Selection
(Saeys et al. ECML07)
Different bootstrapped samples of the same training set;
Apply a conventional feature selection algorithm;
Aggregates the feature selection results.
Group-based Stable Feature Selection
(Yu et al. KDD08; Loscalzo et al. KDD09)
Explore the intrinsic feature correlations;
Identify groups of correlated features;
Select relevant feature groups.
BACKGROUND AND RELATED WORK
MARGIN BASED FEATURE SELECTION
Sample Margin: how much can
an instance travel before it hits
the decision boundary
Hypothesis Margin: how much
can the hypothesis travel before it
hits an instance (Distance between
the hypothesis and the opposite
hypothesis of an instance)
Representative Algorithms: Relief, Relief-F, G-flip, Simba, etc.
margin is used for feature weighting or feature selection
(totally different use in our study)
OUTLINE
Introduction and Motivation
 Background and Related Work
 Preliminaries

Publications
 Theoretical Framework
 Empirical Framework : Margin Based Instance Weighting
 Empirical Study


Planned Tasks
PUBLICATIONS




Yue Han and Lei Yu. An Empirical Study on Stability of Feature
Selection Algorithms. Technical Report from Data Mining Research
Laboratory, Binghamton University, 2009.
Yue Han and Lei Yu. Margin Based Sample Weighting for Stable
Feature Selection. In Proceedings of the 11th International Conference
on Web-Age Information Management (WAIM2010), pages 680-691,
Jiuzhaigou, China, July 15-17, 2010.
Yue Han and Lei Yu. A Variance Reduction Framework for Stable
Feature Selection. In Proceedings of the 10th IEEE International
Conference on Data Mining (ICDM2010), Sydney, Australia, December
14-17, 2010, To Appear.
Lei Yu, Yue Han and Michael E. Berens. Stable Gene Selection from
Microarray Data via Sample Weighting. IEEE/ACM Transactions on
Computational Biology and Bioinformatics (TCBB), 2010, Major
Revision Under Review.
OUTLINE
Introduction and Motivation
 Background and Related Work
 Preliminaries

Publications
 Theoretical Framework
 Empirical Framework : Margin Based Instance Weighting
 Empirical Study


Planned Tasks
THEORETICAL FRAMEWORK
BIAS-VARIANCE DECOMPOSITION OF FEATURE SELECTION ERROR
Training Data: D;
Data Space:
;
FS Result: r(D);
True FS Result: r*
Expected Loss(Error):
Bias:
Variance:
Bias-Variance Decomposition of Feature Selection Error:
o Reveals relationship between accuracy(opposite of loss) and stability (opposite of variance);
o Suggests a better trade-off between the bias and variance of feature selection.
THEORETICAL FRAMEWORK
VARIANCE REDUCTION VIA IMPORTANCE SAMPLING
Feature Selection (Weighting)  Monte Carlo Estimator
Relevance Score:
Monte Carlo Estimator:
Variance of Monte Carlo Estimator:
Impact Factor: feature selection algorithm and sample size
? Increasing sample size
Importance Sampling
impractical and costly
A good importance sampling function h(x)
Intuition behind h(x) :
More instances draw from important regions
Less instances draw from other regions
Instance
Weighting
Intuition behind instance weight :
Increase weights for instances from important regions
Decrease weights for instances from other regions
OUTLINE
Introduction and Motivation
 Background and Related Work
 Preliminaries

Publications
 Theoretical Framework
 Empirical Framework : Margin Based Instance Weighting
 Empirical Study


Planned Tasks
EMPIRICAL FRAMEWORK
OVERALL FRAMEWORK
Challenges:
How to produce weights for
instances from the point view
of feature selection stability;
How to present weighted
instances to conventional
feature selection algorithms.
Margin Based Instance Weighting for Stable Feature Selection
EMPIRICAL FRAMEWORK
MARGIN VECTOR FEATURE SPACE
Original Space
For each
Nearest Hit
Margin Vector
Feature Space
Nearest Miss
hit
miss
Hypothesis
Margin:
captures the local profile of feature relevance for all features at
 Instances exhibit different profiles of feature relevance;
 Instances influence feature selection results differently.
EMPIRICAL FRAMEWORK
AN ILLUSTRATIVE EXAMPLE
(a)
(b)
Hypothesis-Margin based Feature Space Transformation:
(a) Original Feature Space
(b) Margin Vector Feature Space.
EMPIRICAL FRAMEWORK
MARGIN BASED INSTANCE WEIGHTING ALGORITHM
exhibits different profiles of feature relevance
Instance
influence feature selection results differently
Review:
Variance reduction via
Importance Sampling
More instances draw
from important regions
Instance
Weighting
Weighting:
Higher Outlying Degree
Lower Weight
Lower Outlying Degree
Higher Weight
Outlying Degree:
Less instances draw
from other regions
EMPIRICAL FRAMEWORK
ALGORITHM ILLUSTRATION
Time Complexity Analysis:
o Dominated by Instance Weighting:
o Efficient for High-dimensional Data with small sample size (n<<d)
OUTLINE
Introduction and Motivation
 Background and Related Work
 Preliminaries

Publications
 Theoretical Framework
 Empirical Framework : Margin Based Instance Weighting
 Empirical Study


Planned Tasks
EMPIRICAL STUDY
SUBSET STABILITY MEASURES
Feature Selection Method
Training Data
Training Data
Training Data
Feature Subset
Feature Subset
Feature Subset
Consistent
or not???
Stability of Feature Selection
Average Pair-wise Similarity:
Feature Subset
Jaccard Index;
nPOGR;
SIMv.
Kuncheva Index:
Feature Ranking:
Spearman Coefficient
Feature Weighting:
Pearson Correlation Coefficient
EMPIRICAL STUDY
EXPERIMENTS ON SYNTHETIC DATA
Synthetic Data Generation:
Feature Value:
two multivariate normal distributions
Covariance matrix
is a 10*10 square
matrix with elements
1 along the diagonal
and 0.8 off diagonal.
100 groups and 10 feature each
Class label:
a weighted sum of all feature values with
optimal feature weight vector
500 Training Data:
100 instances with 50 from
Leave-one-out Test Data:
5000 instances
and 50 from
Method in Comparison:
SVM-RFE: Recursively eliminate 10%
features of previous iteration till 10
features remained.
Measures:
Variance, Bias, Error
Subset Stability (Kuncheva Index)
Accuracy (SVM)
EMPIRICAL STUDY
EXPERIMENTS ON SYNTHETIC DATA
Observations:
 Error is equal to the sum of bias and variance for both versions of SVM-RFE;
 Error is dominated by bias during early iterations
and is dominated by variance during later iterations;
 IW SVM-RFE exhibits significantly lower bias, variance and error than
SVM-RFE when the number of remaining features approaches 50.
EMPIRICAL STUDY
EXPERIMENTS ON SYNTHETIC DATA
Conclusion:
Variance Reduction via Margin Based Instance Weighting
better bias-variance tradeoff
increased subset stability
improved classification accuracy
EMPIRICAL STUDY
EXPERIMENTS ON REAL-WORLD DATA
Microarray Data:
Experiment Setup:
10-fold Cross-Validation
Methods in Comparison:
SVM-RFE
Ensemble SVM-RFE
Instance Weighting SVM-RFE
Test Data
20-Ensemble SVM-RFE
Bootstrapped
Training Data
Bootstrapped
Training Data
Feature Subset
...
20
...
Measures:
Variance
Subset Stability
Accuracies (KNN, SVM)
...
10 fold
Training
Data
Feature Subset
Aggregated
Feature
Subset
EMPIRICAL STUDY
EXPERIMENTS ON REAL-WORLD DATA
Observations:
Non-discriminative
during early iterations;
SVM-RFE sharply
increase as # of features
approaches 10;
IW SVM-RFE shows
significantly slower rate of
increase.
Note: 40 iterations starting
from about 1000 features till
10 features remain
EMPIRICAL STUDY
EXPERIMENTS ON REAL-WORLD DATA
Observations:
Both ensemble and
instance weighting
approaches improve
stability consistently;
Ensemble is not as
significant as instance
weighting;
As # of features
increases, stability score
decreases because of the
larger correction factor.
EMPIRICAL STUDY
EXPERIMENTS ON REAL-WORLD DATA
Conclusions:
Improves stability of feature selection
without sacrificing prediction accuracy;
Performs much better than ensemble
approach and more efficient;
Leads to significantly increased stability
with slight extra cost of time.
OUTLINE
Introduction and Motivation
 Background and Related Work
 Preliminaries

Publications
 Theoretical Framework
 Empirical Framework : Margin Based Instance Weighting
 Empirical Study


Planned Tasks
PLANNED TASKS
OVERALL FRAMEWORK
Theoretical Framework of Feature Selection Stability
Empirical Instance Weighting Framework
Text Data
Various Realworld Data Set
Gene Data
HHSVM
State-of-the-art
Weighting Schemes
F-statistics
Representative
FS Algorithms
Relief-F
Iterative Approach
SVM-RFE
Margin-based Instance Weighting
Relationship Between
Feature Selection
Stability and
Classification Accuracy
PLANNED TASKS
LISTED TASKS
A
Extensive Study on Instance Weighting Framework
A1
Extension to Various Feature Selection Algorithms
A2
Study on Datasets from Different Domains
B
Development of Algorithms under Instance Weighting Framework
B1
Development of Instance Weighting Schemes
B2
Iterative Approach for Margin Based Instance Weighting
C
Investigation on the Relationship between Stable Feature Selection
and Classification Accuracy
C1
How Bias-Variance Properties of Feature Selection Affect Classification Accuracy
C2
Study on Various Factors for Stability of Feature Selection
Oct-Dec 2010
A1
A2
B1
B2
C1
C2
Jan-Mar 2011
April-June2011
July-Aug 2011
Thank you
and
Questions?