Download Presentation file VI

Document related concepts
no text concepts found
Transcript
Feature Selection
311
Goals
– What is Feature Selection for classification?
– Why feature selection is important?
– What is the filter and what is the wrapper
approach to feature selection?
– Examples
312
What is Feature Selection for
classification?
– Given: a set of predictors (“features”) V and a target
variable T
– Find: minimum set F that achieves maximum
classification performance of T (for a given set of
classifiers and classification performance metrics)
313
Why feature selection is important?
– May Improve performance of classification
algorithm
– Classification algorithm may not scale up to the
size of the full feature set either in sample or
time
– Allows us to better understand the domain
– Cheaper to collect a reduced set of predictors
– Safer to collect a reduced set of predictors
314
Filters vs Wrappers: Wrappers
Say we have predictors A, B, C and classifier M. We want to
predict T given the smallest possible subset of {A,B,C},
while achieving maximal performance (accuracy)
FEATURE SET
{A,B,C}
{A,B}
{A,C}
{B,C}
{A}
{B}
{C}
{.}
CLASSIFIER PERFORMANCE
M
98%
M
98%
M
77%
M
56%
M
89%
M
90%
M
91%
M
85%
315
Filters vs Wrappers: Wrappers
The set of all subsets is the power set and its size is 2|V| .
Hence for large V we cannot do this procedure exhaustively;
instead we rely on heuristic search of the space of all possible
feature subsets.
{A,B} 98
start
{A} 89
{A,C} 77
{A,B}98
{} 85
{B} 90
{C} 91
{B,C} 56
{A,B,C}98
{A,C}77
{B,C}56
end
316
Filters vs Wrappers: Wrappers
A common example of heuristic search is hill climbing: keep
adding features one at a time until no further improvement
can be achieved.
{A,B} 98
start
{A} 89
{A,C} 77
{A,B}98
{} 85
{B} 90
{C} 91
{B,C} 56
{A,B,C}98
{A,C}77
{B,C}56
end
317
Filters vs Wrappers: Wrappers
A common example of heuristic search is hill climbing: keep adding
features one at a time until no further improvement can be achieved
(“forward greedy wrapping”)
Alternatively we can start with the full set of predictors and keep
removing features one at a time until no further improvement can be
achieved (“backward greedy wrapping”)
A third alternative is to interleave the two phases (adding and removing)
either in forward or backward wrapping (“forward-backward wrapping”).
Of course other forms of search can be used; most notably:
-Exhaustive search
-Genetic Algorithms
-Branch-and-Bound (e.g., cost=# of features, goal is to reach
performance th or better)
318
Filters vs Wrappers: Filters
In the filter approach we do not rely on running a particular
classifier and searching in the space of feature subsets;
instead we select features on the basis of statistical properties.
A classic example is univariate associations:
FEATURE
ASSOCIATION WITH TARGET
{A}
89%
{B}
90%
{C}
91%
Threshold gives
suboptimal solution
Threshold gives
optimal solution
Threshold gives
suboptimal solution
319
Example Feature Selection Methods in
Biomedicine: Univariate Association Filtering
– Order all predictors according to strength of association
with target
– Choose the first k predictors and feed them to the
classifier
– Various measures of association may be used: X2, G2,
Pearson r, Fisher Criterion Scoring, etc.
– How to choose k?
– What if we have too many variables?
320
Example Feature Selection Methods in
Biomedicine: Recursive Feature Elimination
–
Filter algorithm where feature selection is done as
follows:
1. build linear Support Vector Machine classifiers using V
features
2. compute weights of all features and choose the best V/2
3. repeat until 1 feature is left
4. choose the feature subset that gives the best performance
(using cross-validation)
321
Example Feature Selection Methods in
Bioinformatics: GA/KNN
Wrapper approach whereby:
•
•
heuristic search=Genetic Algorithm, and
classifier=KNN
322
How do we approach the feature
selection problem in our research?
– Find the Markov Blanket
– Why?
323
A fundamental property of the Markov Blanket
– MB(T) is the minimal set of predictor variables needed for
classification (diagnosis, prognosis, etc.) of the target variable T
(given a powerful enough classifier and calibrated classification)
V
V
C
V
VV
D
V
V
VVV V
VV
V V V
V
V
V V
V
T
H
V
V
V
VV V
V
V
VV
V
VV
V V V
V
V
V V
V
I
324
HITON: An algorithm for feature selection
that combines MB induction with wrapping
C.F. Aliferis M.D., Ph.D., I. Tsamardinos Ph.D., A.
Statnikov M.S.
Department of Biomedical Informatics, Vanderbilt
University
AMIA Fall Conference, November 2003
325
HITON: An algorithm for feature selection that combines MB
induction with wrapping
ALGORITHM
SOUND
SCALABLE
SAMPLE
EXPONENTIAL
TO |MB|
COMMENTS
Cheng and Greiner
YES
NO
NO
Post-processing
on learning BN
Cooper et al.
NO
NO
NO
Uses full BN
learning
Margaritis and
Thrun
YES
YES
YES
Intended to
facilitate BN
learning
Koller and Sahami
NO
NO
NO
Most widelycited MB
induction
algorithm
Tsamardinos and
Aiferis
YES
YES
YES
Some use BN
learning as subroutine
HITON
YES
YES
NO
326
HITON: An algorithm for feature selection
that combines MB induction with wrapping




Step #1: Find the parents and children of T; call this set
PC(T)
Step #2: Find the PC(.) set of each member of PC(T); take
the union of all these sets to be PCunion
Step #3: Run a special test to filter out from PCunion the
non-members of MB(T) that can be identified as such (not
all can); call the resultant set TMB (tentative MB)
Step #4: Apply heuristic search with a desired
classifier/loss function and cross-validation to identify
variables that can be dropped from TMB without loss of
accuracy
327
HITON (Data D; Target T; Classifier A)
“returns a minimal set of variables required for optimal classification of T using algorithm A”
MB(T) = HITON-MB(D, T) // Identify Markov Blanket
Vars = Wrapper(MB(T), T, A) // Use wrapping to remove unnecessary variables
Return Vars
HITON-MB(Data D, Target T)
“returns the Markov Blanket of T”
PC = parents and children of T returned by HITON-PC(D, T)
PCPC = parents and children of the parents and children or T
CurrentMB = PC  PCPC
// Retain only parents of common children and remove false positives
" potential spouse X in CurrentMB and " Y in PC:
if not$ S in {Y}  V -{T, X} so that ^ (T ; X | S )
then retain X in CurrentMB
else remove it
Return CurrentMB
HITON-PC(Data D, Target T)
“returns parents and children of T”
Wrapper(Vars, T, A)
“returns a minimal set among variables Vars for predicting T using algorithm A and a wrapping approach”
Select and remove a variable.
If internally cross-validated performance of A remains the same permanently remove the variable.
Continue until all variables are considered.
328
HITON-PC(Data D, Target T)
“returns parents and children of T”
CurrentPC = {}
Repeat
Find variable Vi not in CurrentPC that maximizes association(Vi, T) and admit Vi into CurrentPC
If there is a variable X and a subset S of CurrentPC s.t. ^(X : T | S)
remove Vi from CurrentPC;
mark Vi and do not consider it again
Until no more variables are left to consider
Return CurrentPC
329
Variable #
Variable Types
Thrombin
Drug
Discovery
139,351
binary
Target
Sample
Vars-to-Sample
Evaluation metric
Design
binary
2,543
54.8
ROC AUC
1-fold c.v.
Dataset
Problem Type
Arrythmia
Clinical
Diagnosis
279
nominal/ordinal
/continuous
nominal
417
0.67
Accuracy
10-fold c.v.
Ohsumed
Text
Categorization
14,373
binary and
continuous
binary
2000
7.2
ROC AUC
1-fold c.v.
Lung Cancer
Gene Expression
Diagnosis
12,600
continuous
Prostate Cancer
Mass-Spec
Diagnosis
779
continuous
binary
160
60
ROC AUC
5-fold c.v.
binary
326
2.4
ROC AUC
10-fold c.v.
Figure 2: Dataset Characteristics
330
331
332
Filters vs Wrappers: Which Is Best?


None over all possible classification tasks!
We can only prove that a specific filter (or
wrapper) algorithm for a specific classifier (or
class of classifiers), and a specific class of
distributions yields optimal or sub-optimal
solutions. Unless we provide such proofs we are
operating on faith and hope…
333
A final note: What is the biological
significance of selected features?


In MB-based feature selection and CPN-faithful
distributions: causal neighborhood of target (i.e.,
direct causes, direct effects, direct causes of the
direct effects of target).
In other methods: ???
334
Case Studies
335
Case Study: Categorizing Text Into
Content Categories
Automatic Identification of Purpose and Quality of
Articles In Journals Of Internal Medicine
Yin Aphinyanaphongs M.S. ,
Constantin Aliferis M.D., Ph.D.
(presented in AMIA 2003)
336
Case Study: Categorizing Text Into
Content Categories
The problem: classify Pubmed articles as
[high quality & treatment specific]
or not
 Same function as the current Clinical
Quality Filters of Pubmed (in the treatment
category)

337
338
Case Study: Categorizing Text Into
Content Categories

Overview:
–
–
–
–
–
–
Select Gold Standard
Corpus Construction
Document representation
Cross-validation Design
Train classifiers
Evaluate the classifiers
339
Case Study: Categorizing Text Into
Content Categories

Select Gold Standard:
– ACP journal club. Expert reviewers strictly evaluate and categorize
in each medical area articles from the top journals in internal
medicine.
– Their mission is “to select from the biomedical literature those
articles reporting original studies and systematic reviews that
warrant immediate attention by physicians.”

The treatment criteria -ACP journal club
– “Random allocation of participants to comparison groups.”
– “80% follow up of those entering study.”
– “Outcome of known or probable clinical importance.”

If an article is cited by the ACP , it is a high quality article.
340
Case Study: Categorizing Text Into
Content Categories

Corpus construction:
12/2000
8/1998
9/1999
Get all articles from the 49 journals in the study period.
Review ACP Journal from 8/1998 to 12/2000 for articles that
are cited by the ACP.
 15,803 total articles, 396 positives (high quality
treatment related)
341
Case Study: Categorizing Text Into
Content Categories

Document representation:
– “Bag of words”
– Title, abstract, Mesh terms, publication type
– Term extraction and processing: e.g. “The clinical
significance of cerebrospinal.”
1. Term extraction
» “The”, “clinical”, “significance”, “of”, “cerebrospinal”
2. Stop word removal
» “Clinical”, “Significance”, “Cerebrospinal”
3. Porter Stemming (i.e. getting the roots of words)
» “Clinic*”, “Signific*”, “Cerebrospin*”
4. Term weighting
» log frequency with redundancy.
342
Case Study: Categorizing Text Into
Content Categories

Cross-validation design
20% reserve
train
10 fold cross
Validation to measure
error
80%
validation
15803
articles
test
343
Case Study: Categorizing Text Into
Content Categories

Classifier families
– Naïve Bayes (no parameter optimization)
– Decision Trees with Boosting (# of iterations =
# of simple rules)
– Linear & Polynomial Support Vector Machines
(cost from {0.1, 0.2, 0.4, 0.7, 0.9, 1, 5, 10, 20,
100, 1000}, degree from {1,2,3,5,8})
344
Case Study: Categorizing Text Into
Content Categories

Evaluation metrics (averaged over 10 crossvalidation folds):
–
–
–
–
–
Sensitivity for fixed specificity
Specificity for fixed sensitivity
Area under ROC curve
Area under 11-point precision-recall curve
“Ranked retrieval”
345
Case Study: Categorizing Text Into
Content Categories
Classifiers
Average
AUC
over 10
folds
Range
over 10 folds
p-value
compar
ed to
largest
LinSVM
0.965
0.948 – 0.978
0.01
PolySVM
0.976
0.970 – 0.983
N/A
Naïve Bayes
0.948
0.932 – 0.963
0.001
Boost Raw
0.957
0.928 – 0.969
0.001
Boost Wght
0.941
0.900 – 0.958
0.001
346
Case Study: Categorizing Text Into
Content Categories
Sensitivity
Specificity
Precision
Number
Needed to Read
(Average)
CQF
0.96
0.75
0.071
14
Poly
SVM
0.9673
(0.830-0.99)
0.8995
(0.884-0.914)
6
0.1744
(0.120-0.240)
Sensitivity
Specificity
Precision
Number
Needed to Read
(Average)
CQF
0.367
0.959
0.149
6.7
Poly
SVM
0.8181
(0.641-0.93)
0.959
(0.948-0.97)
0.2816
3.55
(0.191-0.388)
347
Case Study: Categorizing Text Into
Content Categories
36
27
18
Clinical Query Filter
Performance
Clinical
Query
Filter
9
348
Case Study: Categorizing Text Into
Content Categories
Clinical
Query
Filter
349
Case Study: Categorizing Text Into
Content Categories

Alternative/additional approaches?
–
–
–
–
Negation detection
Citation analysis
Sequence of words
Variable selection to produce userunderstandable models
– Analysis of ACPJ potential bias
– Others???
350
Case Study: Diagnostic Model From
Array Gene Expression Data
Computational Models of Lung Cancer:
Connecting Classification, Gene Selection, and Molecular
Sub-typing
C.Aliferis M.D., Ph.D., Pierre Massion M.D.
I. Tsamardinos Ph.D., D. Hardin Ph.D.
351
Case Study: Diagnostic Model From
Array Gene Expression Data


Specific Aim 1: “Construct computational models that
distinguish between important cellular states related to
lung cancer, e.g., (i) Cancerous vs Normal Cells; (ii)
Metastatic vs Non-Metastatic cells; (iii) Adenocarcinomas
vs Squamous carcinomas”.
Specific Aim 2: “Reduce the number of gene markers by
application of biomarker (gene) selection algorithms such
that small sets of genes can distinguish among the different
states (and ideally reveal important genes in the
pathophysiology of lung cancer).”
352
Case Study: Diagnostic Model From
Array Gene Expression Data
Bhattacharjee et al. PNAS, 2001
 12,600 gene expression measurements
obtained using Affymetrix oligonucleotide
arrays
 203 patients and normal subjects, 5 disease
types, ( plus staging and survival
information)

353
Case Study: Diagnostic Model From
Array Gene Expression Data




Linear and polynomial-kernel Support Vector Machines (LSVM, and
PSVM respectively)
C optimized via C.V. from {10-8, 10-7, 10-6, 10-5, 10-4, 10-3, 10-2, 0.1, 1,
10, 100, 1000} and degree from the set: {1, 2, 3, 4}.
K-Nearest Neighbors (KNN) (k optimized via C.V.)
Feed-forward Neural Networks (NNs). 1 hidden layer, number of units
chosen (heuristically) from the set {2, 3, 5, 8, 10, 30, 50}, variablelearning-rate back propagation, custom-coded early stopping with
(limiting) performance goal=10-8 (i.e., an arbitrary value very close to
zero), and number of epochs in the range [100,…,10000], and a fixed
momentum of 0.001
Stratified nested n-fold cross-validation (n=5 or 7 depending on task)
354
Case Study: Diagnostic Model From
Array Gene Expression Data

Area under the Receiver Operator Characteristic (ROC)
curve (AUC) computed with the trapezoidal rule (DeLong
et al. 1998).
Statistical comparisons among AUCs were performed
using a paired Wilcoxon rank sum test (Pagano et al.
2000).
Scale gene values linearly to [0,1]

Feature selection:


– RFE (parameters as the ones used in Guyon et al 2002)
– UAF (Fisher criterion scoring; k optimized via C.V.)
355
Case Study: Diagnostic Model From
Array Gene Expression Data

Classification Performance
Adenocarcinomas vs squamous
carcinomas
Cancer vs normal
classifiers
Metastatic vs non-metastatic
adenocarcinomas
RFE
UAF
All
Features
RFE
UAF
All
Features
RFE
UAF
All
Features
LSVM
97.03%
99.26%
99.64%
98.57%
99.32%
98.98%
96.43%
95.63%
96.83%
PSVM
97.48%
99.26%
99.64%
98.57%
98.70%
99.07%
97.62%
96.43%
96.33%
KNN
87.83%
97.33%
98.11%
91.49%
95.57%
97.59%
92.46%
89.29%
92.56%
NN
Averages
over
classifier
97.57%
99.80%
N/A
98.70%
99.63%
N/A
96.83%
86.90%
N/A
94.97%
98.91% 99.13%
96.83%
98.30%
98.55%
95.84%
92.06%
95.24%
356
Case Study: Diagnostic Model From
Array Gene Expression Data

Gene selection
Numberoffeaturesdiscovered
FeatureSelectionMethod
Cancervsnormal
Adenocarcinomasvs
squamouscarcinomas
Metastaticvsnon-metastatic
adenocarcinomas
RFE
6
12
UAF
100
500
6
500
357
Case Study: Diagnostic Model From
Array Gene Expression Data

Novelty
Cancer
vs normal
Contributed by method on the
left compared with method on RFE
the right
Adenocarcinomas vs squamous
carcinomas
Metastatic vs non-metastatic
adenocarcinomas
UAF
RFE
UAF
RFE
UAF
RFE
0
2
0
5
0
2
UAF
96
0
493
0
496
0
358
Case Study: Diagnostic Model From
Array Gene Expression Data

A more detailed look:
– Specific Aim 3: “Study how aspects of experimental design
(including data set, measured genes, sample size, cross-validation
methodology) determine the performance and stability of several
machine learning (classifier and feature selection) methods used
in the experiments”.
359
Case Study: Diagnostic Model From
Array Gene Expression Data




Overfitting: we replace actual gene measurements by
random values in the same range (while retaining the
outcome variable values).
Target class rarity: we contrast performance in tasks with
rare vs non-rare categories.
Sample size: we use samples from the set {40,80,120,160,
203} range (as applicable in each task).
Predictor info redundancy: we replace the full set of
predictors by random subsets with sizes in the set {500,
1000, 5000, 12600}.
360
Case Study: Diagnostic Model From
Array Gene Expression Data



Train-test split ratio: we use train-test ratios from the set {80/20,
60/40, 40/60} (for tasks II and III, while for task I modified ratios were
used due to small number of positives, see Figure 1).
Cross-validated fold construction: we construct n-fold crossvalidation samples retaining the proportion of the rarer target category
to the more frequent one in folds with smaller sample, or, alternatively
we ensure that all rare instances are included in the union of test sets
(to maximize use of rare-case instances).
Classifier type: Kernel vs non-kernel and linear vs non-linear
classifiers are contrasted. Specifically we compare linear and nonlinear SVMs (a prototypical kernel method) to each other and to KNN
(a robust and well-studied non-kernel classifier and density estimator).
361
Case Study: Diagnostic Model From Array Gene
Expression Data  Random gene values
Classifier TaskI.M
etastatic(7)– TaskII.Cancer(186)- TaskIII.Adenocarcinomas
Nonmetastatic(132)
SVMs
KNN
0.584(139cases)
0.581(139cases)
Normal(17)
(139)-Squamous
carcinomas(21)
0.583(203cases)
0.522(203cases)
0.572(160cases)
0.559(160cases)
Classifier TaskI. Metastatic(7)– TaskII. Cancer(186)- TaskIII. Adenocarcinomas
Nonmetastatic(132)
SVMs
KNN
0.968(139cases)
0.926(139cases)
Normal(17)
(139)-Squamous
carcinomas(21)
0.996(203cases)
0.981(203cases)
0.990(160cases)
0.976(160cases)
362
Case Study: Diagnostic Model From Array Gene
Expression Data  varying sample size
Classifier
Task I. Metastatic (7) –
Nonmetastatic (132)
SVMs 0.982(40cases) ,
0,982(80cases),
0.969(120cases)
KNN
0.893(40cases) ,
0,832(80cases),
0.925(120cases)
Task II. Cancer (186)Normal (17)
1(40cases) ,
1(80cases),
1(120cases),
0.995(160cases)
1(40cases) ,
1(80cases),
0.993(120cases),
0.970(160cases)
Task III. Adenocarcinomas
(139) - Squamous
carcinomas (21)
0.981(40cases) ,
0.988(80cases),
0.980(120cases)
0.916(40cases) ,
0.960(80cases),
0.965(120cases)
Classifier TaskI.Metastatic(7)– TaskII.Cancer(186)- TaskIII.Adenocarcinomas
Nonmetastatic(132)
SVMs
KNN
0.968(139cases)
0.926(139cases)
Normal(17)
(139)-Squamous
carcinomas(21)
0.996(203cases)
0.981(203cases)
0.990(160cases)
0.976(160cases)
363
Case Study: Diagnostic Model From Array Gene
Expression Data  Random gene selection
Classifier
TaskI. Metastatic(7)–
Nonmetastatic(132)
SVMs 0.944(500genes),
0.948(1000genes),
0.956(5000genes)
KNN 0.893(500genes),
0.893(1000genes),
0.941(5000genes)
TaskII. Cancer(186)Normal (17)
0.991(500genes),
0.989(1000genes),
0.995(5000genes)
0.959(500genes),
0.961(1000genes),
0.984(5000genes)
TaskIII. Adenocarcinomas
(139)-Squamous
carcinomas(21)
0.982(500genes),
0.987(1000genes),
0.990(5000genes)
0.928(500genes),
0.955(1000
genes), 0.965(5000genes)
Classifier TaskI.Metastatic(7)– TaskII.Cancer(186)- TaskIII.Adenocarcinomas
Nonmetastatic(132)
SVMs
KNN
0.968(139cases)
0.926(139cases)
Normal(17)
(139)-Squamous
carcinomas(21)
0.996(203cases)
0.981(203cases)
0.990(160cases)
0.976(160cases)
364
Case Study: Diagnostic Model From Array Gene
Expression Data  Split ratio
Classifier
SVMs
KNN
Task I. Metastatic (7) –
Nonmetastatic (132)
0.915 (30/70),
0.938 (43/57),
0.954 (57/43),
0.962 (70/30),
0.968 (85/15)
0.782 (30/70),
0.833 (43/57),
0.866 (57/43),
0.901 (70/30),
0.990 (85/15)
Task II. Cancer (186)Normal (17)
Task III. Adenocarcinomas
(139) - Squamous
carcinomas (21)
0.997 (40/60),
0.996 (60/40),
0.996 (80/20)
0.989 (40/60),
0.990 (60/40),
0.990 (80/20)
0.960 (40/60),
0.962 (60/40),
0.976 (80/20)
0.960 (40/60),
0.962 (60/40),
0.976 (80/20)
Classifier TaskI.Metastatic(7)– TaskII.Cancer(186)- TaskIII.Adenocarcinomas
Nonmetastatic(132)
SVMs
KNN
0.968(139cases)
0.926(139cases)
Normal(17)
(139)-Squamous
carcinomas(21)
0.996(203cases)
0.981(203cases)
0.990(160cases)
0.976(160cases)
365
Case Study: Diagnostic Model From Array Gene
Expression Data  Use of rare categories
Classifier
SVMs
KNN
Classifier
Task I. Metastatic (7) –
Nonmetastatic (132)
0.890 (40 cases) ,
0,993 (80 cases),
0.965 (120 cases)
0.918 (40 cases) ,
0,849 (80 cases),
0.80 (120 cases)
Task I. Metastatic (7) –
Nonmetastatic (132)
SVMs
0.982 (40 cases) ,
0,982 (80 cases),
0.969 (120 cases)
KNN
0.893 (40 cases) ,
0,832 (80 cases),
0.925 (120 cases)
Task II. Cancer (186)Normal (17)
1 (40 cases) ,
1 (80 cases),
1 (120 cases),
0.995 (160 cases)
1 (40 cases) ,
0.96 (80 cases),
0.972 (120 cases),
0.982 (160 cases)
Task II. Cancer (186)Normal (17)
1 (40 cases) ,
1 (80 cases),
1 (120 cases),
0.995 (160 cases)
1 (40 cases) ,
1 (80 cases),
0.993 (120 cases),
0.970 (160 cases)
Task III. Adenocarcinomas
(139) - Squamous
carcinomas (21)
1 (40 cases) ,
1 (80 cases),
0.985 (120 cases)
0.992 (40 cases) ,
0.960 (80 cases),
0.990 (120 cases)
Task III. Adenocarcinomas
(139) - Squamous
carcinomas (21)
0.981 (40 cases) ,
0.988 (80 cases),
0.980 (120 cases)
0.916 (40 cases) ,
0.960 (80 cases),
0.965 (120 cases)
366
Case Study: Diagnostic Model From
Array Gene Expression Data



We have recently completed an extensive analysis of all
multi-category gene expression-based cancer datasets in
the public domain. The analysis spans >75 cancer types
and >1,000 patients in 12 datasets.
On the basis of this study we have created a tool that
automatically analyzes data to create diagnostic systems
and identify biomarker candidates using a variety of
techniques.
The present incarnation of the tool is oriented toward the
computer-savvy researcher; a more biologist-friendly webaccessible version is under development.
367
Case Study: Diagnostic Model From
Array Gene Expression Data

Questions:
– What would you do differently?
– How to interpret the biological significance of
the selected genes?
– What is wrong with having so many and robust
good classification models?
– Why do we have so many good models?
368
Case Study: Diagnostic Model From
Array Gene Expression Data
MC-SVM (alpha) Demonstration
369
Case Study: Imputation for Machine
Learning Models For Lung Cancer
Classification Using Array Comparative
Genomic Hybridization
C.F. Aliferis M.D., Ph.D., D. Hardin Ph.D., P. P.
Massion M.D.
AMIA 2002
370
Case Study: A Protocol to Address the
Missing Values Problem

Context:
– Array comparative genomic hybridization (array CGH): recently
introduced technology that measures gene copy number changes of
hundreds of genes in a single experiment. Gene copy number
changes (deletion, amplification) are often characteristic of disease
and cancer in particular.
– aCGH as been shown in studies published during the last years that
it enables development of powerful classification models, facilitate
selection of genes for array design, and identification of likely
oncogenes in a variety of cancers (e.g., esophageal, renal,
head/neck, lymphomas, breast, and glioblastomas).
– Interestingly a recent study (Fritz et al. June 2002) has shown that
aCGH enables better classification of liposarcoma differentiation
than gene expression information.

371
Case Study: A Protocol to Address the
Missing Values Problem

Context:
– While significant experience has been gathered so far in the
application of various machine learning/data mining approaches to
explore development of diagnostic/classification models with gene
expression Microarray Data for lung cancer and of aCGH in a
variety of cancers, little was known about the feasibility of using
machine learning methods with aCGH data to create such models.
– In this study we conducted such an experiment for the
classification of non-small Lung Cancers (NSCLCs) as squamus
carcinomas (SqCa) or adenocarcinomas (AdCa)). A related goal
was to compare several machine learning methods in this learning
task.
372
Case Study: A Protocol to Address the
Missing Values Problem

Context:
– DNA from tumors of 37 patients (21 squamous carcinomas,
(SqCa) and 16 adenocarcinomas (AdCa)) were extracted after
microdissection and hybridized onto a 452 BAC clone array
(printed in quadruplicate) carrying genes of potential importance in
cancer.
– aCGH is a technology in formative stages of development. As a
result a high percentage of missing values was observed in most
gene measurements.
373
Case Study: A Protocol to Address the
Missing Values Problem

We decided to create a protocol for gene
inclusion/exclusion for analysis on the basis of three
criteria:
– (a) percentage of missing values,
– (b) a priori importance of a gene (based on known functional role
in pathways that are implicated in carcinogenesis such as the PI3kinase pathway), and
– (c) whether the existence of missing values was statistically
significantly associated with the class to be predicted (at the 0.05
level and determined by a G2 test).
374
Case Study: A Protocol to Address the
Missing Values Problem
1. For each gene Gi compute an indicator variable MGi s.t.
MGi is 1 in cases where Gi is missing , and 0 in cases
where Gi was observed
2. Compute the association of MGi to the class variable C,
assoc(MGi , C) for every i. (C takes values in {SqCa,
AdCa})
3. Accept a set of important genes I
4. if assoc(MGi , C)
is statistically significant then
reject gene Gi
else
if Gi  I then accept Gi
else
if fraction of missing values of Gi is
>15% then reject Gi
else accept Gi

388 variables were selected according to this protocol and were
imputed before analysis.
375
Case Study: A Protocol to Address the
Missing Values Problem

K-Nearest Neighbors (KNN) method for imputation
–
for each instance of a gene that had a missing value the case closest to the
case containing that missing value (i.e., the closest neighbor) that did have
an observed value for the gene was found using Euclidean Distance (ED).
That value was substituted for the missing one.
– To compute distances between cases with missing values, if one of the two
corresponding gene measurements was missing, the mean of the observed
values for this gene across all cases was used to compute the ED
component.
– When both values were missing, the mean observed difference was used
for the ED component.



The above procedure because it is non-parametric and multivariate.
More naïve approaches (such as imputing with the mean or with a
random value from the observed distribution for the gene) typically
produced uniformly worse models.
A variant of the above method iterates the KNN imputation until
convergence is attained.
376
Case Study: A Protocol to Address the
Missing Values Problem


Important note: Clear understanding of what types of missing values
we have and what are the mechanisms that generate them is required
and sometimes translates into ability to effectively replace missing
values as well as avoid serious biases.
Example types of missing values:
– Value is not produced by device or respondent etc.
– Value was produced but not entered in the study database
– Value was produced and entered but is deemed invalid on substantive
grounds
– Value was produced and entered but was subsequently corrupted due to
transmission/storage or data conversion operations, etc.

Example where knowing process that generates missing values may be
sufficient to fill them in: physician does not measure a lab value (say
x-ray) because an equivalent or more informative test has been
conducted (e.g., MRI) or because it is physiologically impossible for a
change to have occurred from last measurement, etc.
377