Download Document

Document related concepts
no text concepts found
Transcript
Case Studies
339
Case Study: Diagnostic Model From
Array Gene Expression Data
Computational Models of Lung Cancer:
Connecting Classification, Gene Selection, and Molecular
Sub-typing
C.Aliferis M.D., Ph.D., Pierre Massion M.D.
I. Tsamardinos Ph.D., D. Hardin Ph.D.
340
Case Study: Diagnostic Model From
Array Gene Expression Data


Specific Aim 1: “Construct computational models that
distinguish between important cellular states related to
lung cancer, e.g., (i) Cancerous vs Normal Cells; (ii)
Metastatic vs Non-Metastatic cells; (iii) Adenocarcinomas
vs Squamous carcinomas”.
Specific Aim 2: “Reduce the number of gene markers by
application of biomarker (gene) selection algorithms such
that small sets of genes can distinguish among the different
states (and ideally reveal important genes in the
pathophysiology of lung cancer).”
341
Case Study: Diagnostic Model From
Array Gene Expression Data
Bhattacharjee et al. PNAS, 2001
 12,600 gene expression measurements
obtained using Affymetrix oligonucleotide
arrays
 203 patients and normal subjects, 5 disease
types, ( plus staging and survival
information)

342
Case Study: Diagnostic Model From
Array Gene Expression Data




Linear and polynomial-kernel Support Vector Machines (LSVM, and
PSVM respectively)
C optimized via C.V. from {10-8, 10-7, 10-6, 10-5, 10-4, 10-3, 10-2, 0.1, 1,
10, 100, 1000} and degree from the set: {1, 2, 3, 4}.
K-Nearest Neighbors (KNN) (k optimized via C.V.)
Feed-forward Neural Networks (NNs). 1 hidden layer, number of units
chosen (heuristically) from the set {2, 3, 5, 8, 10, 30, 50}, variablelearning-rate back propagation, custom-coded early stopping with
(limiting) performance goal=10-8 (i.e., an arbitrary value very close to
zero), and number of epochs in the range [100,…,10000], and a fixed
momentum of 0.001
Stratified nested n-fold cross-validation (n=5 or 7 depending on task)
343
Case Study: Diagnostic Model From
Array Gene Expression Data

Area under the Receiver Operator Characteristic (ROC)
curve (AUC) computed with the trapezoidal rule (DeLong
et al. 1998).
Statistical comparisons among AUCs were performed
using a paired Wilcoxon rank sum test (Pagano et al.
2000).
Scale gene values linearly to [0,1]

Feature selection:


– RFE (parameters as the ones used in Guyon et al 2002)
– UAF (Fisher criterion scoring; k optimized via C.V.)
344
Case Study: Diagnostic Model From
Array Gene Expression Data

Classification Performance
Adenocarcinomas vs squamous
carcinomas
Cancer vs normal
classifiers
Metastatic vs non-metastatic
adenocarcinomas
RFE
UAF
All
Features
RFE
UAF
All
Features
RFE
UAF
All
Features
LSVM
97.03%
99.26%
99.64%
98.57%
99.32%
98.98%
96.43%
95.63%
96.83%
PSVM
97.48%
99.26%
99.64%
98.57%
98.70%
99.07%
97.62%
96.43%
96.33%
KNN
87.83%
97.33%
98.11%
91.49%
95.57%
97.59%
92.46%
89.29%
92.56%
NN
Averages
over
classifier
97.57%
99.80%
N/A
98.70%
99.63%
N/A
96.83%
86.90%
N/A
94.97%
98.91% 99.13%
96.83%
98.30%
98.55%
95.84%
92.06%
95.24%
345
Case Study: Diagnostic Model From
Array Gene Expression Data

Gene selection
Numberoffeaturesdiscovered
FeatureSelectionMethod
Cancervsnormal
Adenocarcinomasvs
squamouscarcinomas
Metastaticvsnon-metastatic
adenocarcinomas
RFE
6
12
UAF
100
500
6
500
346
Case Study: Diagnostic Model From
Array Gene Expression Data

Novelty
Cancer
vs normal
Contributed by method on the
left compared with method on RFE
the right
Adenocarcinomas vs squamous
carcinomas
Metastatic vs non-metastatic
adenocarcinomas
UAF
RFE
UAF
RFE
UAF
RFE
0
2
0
5
0
2
UAF
96
0
493
0
496
0
347
Case Study: Diagnostic Model From
Array Gene Expression Data

A more detailed look:
– Specific Aim 3: “Study how aspects of experimental design
(including data set, measured genes, sample size, cross-validation
methodology) determine the performance and stability of several
machine learning (classifier and feature selection) methods used
in the experiments”.
348
Case Study: Diagnostic Model From
Array Gene Expression Data




Overfitting: we replace actual gene measurements by
random values in the same range (while retaining the
outcome variable values).
Target class rarity: we contrast performance in tasks with
rare vs non-rare categories.
Sample size: we use samples from the set {40,80,120,160,
203} range (as applicable in each task).
Predictor info redundancy: we replace the full set of
predictors by random subsets with sizes in the set {500,
1000, 5000, 12600}.
349
Case Study: Diagnostic Model From
Array Gene Expression Data



Train-test split ratio: we use train-test ratios from the set {80/20,
60/40, 40/60} (for tasks II and III, while for task I modified ratios were
used due to small number of positives, see Figure 1).
Cross-validated fold construction: we construct n-fold crossvalidation samples retaining the proportion of the rarer target category
to the more frequent one in folds with smaller sample, or, alternatively
we ensure that all rare instances are included in the union of test sets
(to maximize use of rare-case instances).
Classifier type: Kernel vs non-kernel and linear vs non-linear
classifiers are contrasted. Specifically we compare linear and nonlinear SVMs (a prototypical kernel method) to each other and to KNN
(a robust and well-studied non-kernel classifier and density estimator).
350
Case Study: Diagnostic Model From Array Gene
Expression Data  Random gene values
Classifier TaskI.M
etastatic(7)– TaskII.Cancer(186)- TaskIII.Adenocarcinomas
Nonmetastatic(132)
SVMs
KNN
0.584(139cases)
0.581(139cases)
Normal(17)
(139)-Squamous
carcinomas(21)
0.583(203cases)
0.522(203cases)
0.572(160cases)
0.559(160cases)
Classifier TaskI. Metastatic(7)– TaskII. Cancer(186)- TaskIII. Adenocarcinomas
Nonmetastatic(132)
SVMs
KNN
0.968(139cases)
0.926(139cases)
Normal(17)
(139)-Squamous
carcinomas(21)
0.996(203cases)
0.981(203cases)
0.990(160cases)
0.976(160cases)
351
Case Study: Diagnostic Model From Array Gene
Expression Data  varying sample size
Classifier
Task I. Metastatic (7) –
Nonmetastatic (132)
SVMs 0.982(40cases) ,
0,982(80cases),
0.969(120cases)
KNN
0.893(40cases) ,
0,832(80cases),
0.925(120cases)
Task II. Cancer (186)Normal (17)
1(40cases) ,
1(80cases),
1(120cases),
0.995(160cases)
1(40cases) ,
1(80cases),
0.993(120cases),
0.970(160cases)
Task III. Adenocarcinomas
(139) - Squamous
carcinomas (21)
0.981(40cases) ,
0.988(80cases),
0.980(120cases)
0.916(40cases) ,
0.960(80cases),
0.965(120cases)
Classifier TaskI.Metastatic(7)– TaskII.Cancer(186)- TaskIII.Adenocarcinomas
Nonmetastatic(132)
SVMs
KNN
0.968(139cases)
0.926(139cases)
Normal(17)
(139)-Squamous
carcinomas(21)
0.996(203cases)
0.981(203cases)
0.990(160cases)
0.976(160cases)
352
Case Study: Diagnostic Model From Array Gene
Expression Data  Random gene selection
Classifier
TaskI. Metastatic(7)–
Nonmetastatic(132)
SVMs 0.944(500genes),
0.948(1000genes),
0.956(5000genes)
KNN 0.893(500genes),
0.893(1000genes),
0.941(5000genes)
TaskII. Cancer(186)Normal (17)
0.991(500genes),
0.989(1000genes),
0.995(5000genes)
0.959(500genes),
0.961(1000genes),
0.984(5000genes)
TaskIII. Adenocarcinomas
(139)-Squamous
carcinomas(21)
0.982(500genes),
0.987(1000genes),
0.990(5000genes)
0.928(500genes),
0.955(1000
genes), 0.965(5000genes)
Classifier TaskI.Metastatic(7)– TaskII.Cancer(186)- TaskIII.Adenocarcinomas
Nonmetastatic(132)
SVMs
KNN
0.968(139cases)
0.926(139cases)
Normal(17)
(139)-Squamous
carcinomas(21)
0.996(203cases)
0.981(203cases)
0.990(160cases)
0.976(160cases)
353
Case Study: Diagnostic Model From Array Gene
Expression Data  Split ratio
Classifier
SVMs
KNN
Task I. Metastatic (7) –
Nonmetastatic (132)
0.915 (30/70),
0.938 (43/57),
0.954 (57/43),
0.962 (70/30),
0.968 (85/15)
0.782 (30/70),
0.833 (43/57),
0.866 (57/43),
0.901 (70/30),
0.990 (85/15)
Task II. Cancer (186)Normal (17)
Task III. Adenocarcinomas
(139) - Squamous
carcinomas (21)
0.997 (40/60),
0.996 (60/40),
0.996 (80/20)
0.989 (40/60),
0.990 (60/40),
0.990 (80/20)
0.960 (40/60),
0.962 (60/40),
0.976 (80/20)
0.960 (40/60),
0.962 (60/40),
0.976 (80/20)
Classifier TaskI.Metastatic(7)– TaskII.Cancer(186)- TaskIII.Adenocarcinomas
Nonmetastatic(132)
SVMs
KNN
0.968(139cases)
0.926(139cases)
Normal(17)
(139)-Squamous
carcinomas(21)
0.996(203cases)
0.981(203cases)
0.990(160cases)
0.976(160cases)
354
Case Study: Diagnostic Model From Array Gene
Expression Data  Use of rare categories
Classifier
SVMs
KNN
Classifier
Task I. Metastatic (7) –
Nonmetastatic (132)
0.890 (40 cases) ,
0,993 (80 cases),
0.965 (120 cases)
0.918 (40 cases) ,
0,849 (80 cases),
0.80 (120 cases)
Task I. Metastatic (7) –
Nonmetastatic (132)
SVMs
0.982 (40 cases) ,
0,982 (80 cases),
0.969 (120 cases)
KNN
0.893 (40 cases) ,
0,832 (80 cases),
0.925 (120 cases)
Task II. Cancer (186)Normal (17)
1 (40 cases) ,
1 (80 cases),
1 (120 cases),
0.995 (160 cases)
1 (40 cases) ,
0.96 (80 cases),
0.972 (120 cases),
0.982 (160 cases)
Task II. Cancer (186)Normal (17)
1 (40 cases) ,
1 (80 cases),
1 (120 cases),
0.995 (160 cases)
1 (40 cases) ,
1 (80 cases),
0.993 (120 cases),
0.970 (160 cases)
Task III. Adenocarcinomas
(139) - Squamous
carcinomas (21)
1 (40 cases) ,
1 (80 cases),
0.985 (120 cases)
0.992 (40 cases) ,
0.960 (80 cases),
0.990 (120 cases)
Task III. Adenocarcinomas
(139) - Squamous
carcinomas (21)
0.981 (40 cases) ,
0.988 (80 cases),
0.980 (120 cases)
0.916 (40 cases) ,
0.960 (80 cases),
0.965 (120 cases)
355
Case Study: Diagnostic Model From
Array Gene Expression Data

Questions:
– What would you do differently?
– How to interpret the biological significance of
the selected genes?
– What is wrong with having so many and robust
good classification models?
– Why do we have so many good models?
356
Case Study: Diagnostic Model From
Array Gene Expression Data



We have recently completed an extensive analysis of all
multi-category gene expression-based cancer datasets in
the public domain. The analysis spans >75 cancer types
and >1,000 patients in 12 datasets.
On the basis of this study we have created a tool (GEMS)
that automatically analyzes data to create diagnostic
systems and identify biomarker candidates using a variety
of techniques.
The present incarnation of the tool is oriented toward the
computer-savvy researcher; a more biologist-friendly webaccessible version is under development.
357
Case Study: Diagnostic Model From
Array Gene Expression Data
GEMS System
“Methods for Multi-Category Cancer Diagnosis from
Gene Expression Data: A Comprehensive
Evaluation to Inform Decision Support System
Development”
A. Statnikov, C.F. Aliferis, I. Tsamardinos
AMIA/MEDINFO 2004
358
Case Study: Diagnostic Modeling From
Mass Spectrometry Data
359
Creating a Tool (FAST-AIMS) for Cancer
Diagnostic Decision Support Using Mass
Spectrometry Data
Nafeh Fananapazir
Department of Biomedical Informatics
Vanderbilt University
Academic Committee: Constantin Aliferis (Primary Advisor), Dean Billheimer, Douglas
Hardin, Shawn Levy, Daniel Liebler, Ioannis Tsamardinos
360
Introduction
Problem:
• In the last two years, we have seen the emergence of mass
spectrometry in the domain of cancer diagnosis
• Mass spectrometry on biological samples produces data with a size
and complexity that defies simple analysis.
• There is a need for clinicians without expertise in the field of
machine learning to have access to intelligent software that permits
at least a first pass analysis as to the diagnostic capabilities of data
obtained from mass spectrometry analysis.
361
MS Studies in Cancer Research:
Types
Cancer Types
Specimen Types
Ovarian
Blood Serum
Prostate
Tissue Biopsy
Renal
Nipple Aspirate Fluid
Breast
Pancreatic Juice
Head & Neck
Lung
Pancreatic
362
MS Studies in Cancer Research:
Problems
1.
2.
3.
4.
5.
Lack of disclosure of key methods components
Overfitting
One-time partitioning of data
Lack of randomization when allocating to test/train sets
Lack of an appropriate performance metric
363
Data Source: Blood Serum

Advantages
–
–
–
–

Relatively non-invasive
Easily obtained
Access to most tissues in the body
Screening possibilities
Composition/Derivation
– Blood Plasma

Protein Constituents
–
–
–
–
Albumins
Globulins
Fibrinogen
Low Molecular Weight (LMW) Proteins
364
Data Representation: Mass Spectrometry

MALDI-TOF/SELDI-TOF1
– Relatively little sample purification is required
– Direct measurement of proteins from serum, tissue, other bio.
samples
– Relatively rapid analysis time
– Production of intact molecular ions with little fragmentation
– Detection of proteins with m/z ranging from 2000-100,000 daltons
– Collection of useful spectra from complex mixtures
– Accuracies approaching 1 part in 10,000

Data Characteristics
– Parameters
» Mass/Charge (M/Z)
» Intensity
– Format
1Billheimer
» Continuous
D., A Functional
» Peak Detection
Data Approach to MALDI-TOF MS Protein Analysis
365
Data Analysis: Paradigm
366
Data Analysis: Preparations
Get Mass Spectra
Data Pre-Processing
a.
b.
•
•
•
•
•
Baseline subtraction
Peak detection [Coombes 2003]
Feature Selection
Normalization of intensities
Peak alignment
367
Data Analysis: Experimental Design
c. Classification: Parameter Optimization
368
Data Analysis: Classifiers
c. Classification: Classifiers
• KNN: Optimize K
• SVM: Optimize cost, kernel, gamma
LSVM
PSVM
RBF-SVM
369
Preliminary Studies
1.
Datasets
•
•
•
2.
Feature Selection
•
3.
RFE
Experimental Design
•
4.
Petricoin Ovarian
Petricoin Prostate
Adam Prostate
10-fold nested cross-validation
Performance Metric
•
ROC (rationale for selecting)
370
Feature Selection Method
A. Average ROC values
Preliminary Studies: Results
Adam_Prostate_070102
Petricoin_Ovarian_021402
Petricoin_Prostate_070302
Classifier
All Features
LSVM - RFE
PSVM - RFE
KNN
0.84425
0.96863
0.93739
LSVM
0.99460
0.99595
0.99796
PSVM
0.99659
0.99316
0.99747
KNN
0.88150
0.91382
0.79238
LSVM
0.95455
0.91898
0.85350
PSVM
0.94409
0.91564
0.84032
KNN
0.85498
0.92219
0.83498
LSVM
0.92981
0.93026
0.85102
PSVM
0.93121
0.92788
0.83679
Feature Selection Method
B. ROC Range
Adam_Prostate_070102
Petricoin_Ovarian_021402
Petricoin_Prostate_070302
Classifier
All Features
LSVM - RFE
PSVM - RFE
KNN
0.58847 - 0.97263
0.93157 - 1.00000
0.87928 - 0.99413
LSVM
0.98534 - 1.00000
0.99022 - 1.00000
0.99413 - 1.00000
PSVM
0.99120 - 0.99965
0.98240 - 1.00000
0.98925 - 1.00000
KNN
0.50588 - 0.99545
0.62273 - 1.00000
0.40455 - 0.96471
LSVM
0.80000 - 1.00000
0.49091 - 1.00000
0.52727 - 0.99412
PSVM
0.75455 - 1.00000
0.50909 - 1.00000
0.43636 - 0.99412
KNN
0.59643 - 0.99667
0.66190 - 1.00000
0.28333 - 1.00000
LSVM
0.57143 - 1.00000
0.57262 - 1.00000
0.36667 - 1.00000
PSVM
0.59881 - 1.00000
0.59881 - 1.00000
0.31667 - 1.00000
Number of Features Selected by Feature Selection Method
C. Number of Features Selected
All Features
LSVM - RFE
PSVM - RFE
Adam_Prostate_070102
Range (Avg.)
779 - 779 (779)
24 - 97 (65.2)
97 - 389 (194.1)
Petricoin_Ovarian_021402
Range (Avg.)
15154-15154 (15154)
14 - 118 (49.9)
14 - 1894 (1421.9)
371
Petricoin_Prostate_070302
Range (Avg.)
15154-15154 (15154)
14 - 236 (70.5)
3 - 59 (16.8)
Case Study: Categorizing Text Into
Content Categories
Automatic Identification of Purpose and Quality of
Articles In Journals Of Internal Medicine
Yin Aphinyanaphongs M.S. ,
Constantin Aliferis M.D., Ph.D.
(presented in AMIA 2003)
372
Case Study: Categorizing Text Into
Content Categories
The problem: classify Pubmed articles as
[high quality & treatment specific]
or not
 Same function as the current Clinical
Quality Filters of Pubmed (in the treatment
category)

373
374
Case Study: Categorizing Text Into
Content Categories

Overview:
–
–
–
–
–
–
Select Gold Standard
Corpus Construction
Document representation
Cross-validation Design
Train classifiers
Evaluate the classifiers
375
Case Study: Categorizing Text Into
Content Categories

Select Gold Standard:
– ACP journal club. Expert reviewers strictly evaluate and categorize
in each medical area articles from the top journals in internal
medicine.
– Their mission is “to select from the biomedical literature those
articles reporting original studies and systematic reviews that
warrant immediate attention by physicians.”

The treatment criteria -ACP journal club
– “Random allocation of participants to comparison groups.”
– “80% follow up of those entering study.”
– “Outcome of known or probable clinical importance.”

If an article is cited by the ACP , it is a high quality article.
376
Case Study: Categorizing Text Into
Content Categories

Corpus construction:
12/2000
8/1998
9/1999
Get all articles from the 49 journals in the study period.
Review ACP Journal from 8/1998 to 12/2000 for articles that
are cited by the ACP.
 15,803 total articles, 396 positives (high quality
treatment related)
377
Case Study: Categorizing Text Into
Content Categories

Document representation:
– “Bag of words”
– Title, abstract, Mesh terms, publication type
– Term extraction and processing: e.g. “The clinical
significance of cerebrospinal.”
1. Term extraction
» “The”, “clinical”, “significance”, “of”, “cerebrospinal”
2. Stop word removal
» “Clinical”, “Significance”, “Cerebrospinal”
3. Porter Stemming (i.e. getting the roots of words)
» “Clinic*”, “Signific*”, “Cerebrospin*”
4. Term weighting
» log frequency with redundancy.
378
Case Study: Categorizing Text Into
Content Categories

Cross-validation design
20% reserve
train
10 fold cross
Validation to measure
error
80%
validation
15803
articles
test
379
Case Study: Categorizing Text Into
Content Categories

Classifier families
– Naïve Bayes (no parameter optimization)
– Decision Trees with Boosting (# of iterations =
# of simple rules)
– Linear & Polynomial Support Vector Machines
(cost from {0.1, 0.2, 0.4, 0.7, 0.9, 1, 5, 10, 20,
100, 1000}, degree from {1,2,3,5,8})
380
Case Study: Categorizing Text Into
Content Categories

Evaluation metrics (averaged over 10 crossvalidation folds):
–
–
–
–
–
Sensitivity for fixed specificity
Specificity for fixed sensitivity
Area under ROC curve
Area under 11-point precision-recall curve
“Ranked retrieval”
381
Case Study: Categorizing Text Into
Content Categories
Classifiers
Average
AUC
over 10
folds
Range
over 10 folds
p-value
compar
ed to
largest
LinSVM
0.965
0.948 – 0.978
0.01
PolySVM
0.976
0.970 – 0.983
N/A
Naïve Bayes
0.948
0.932 – 0.963
0.001
Boost Raw
0.957
0.928 – 0.969
0.001
Boost Wght
0.941
0.900 – 0.958
0.001
382
Case Study: Categorizing Text Into
Content Categories
Sensitivity
Specificity
Precision
Number
Needed to Read
(Average)
CQF
0.96
0.75
0.071
14
Poly
SVM
0.9673
(0.830-0.99)
0.8995
(0.884-0.914)
6
0.1744
(0.120-0.240)
Sensitivity
Specificity
Precision
Number
Needed to Read
(Average)
CQF
0.367
0.959
0.149
6.7
Poly
SVM
0.8181
(0.641-0.93)
0.959
(0.948-0.97)
0.2816
3.55
(0.191-0.388)
383
Case Study: Categorizing Text Into
Content Categories
36
27
18
Clinical Query Filter
Performance
Clinical
Query
Filter
9
384
Case Study: Categorizing Text Into
Content Categories
Clinical
Query
Filter
385
Case Study: Categorizing Text Into
Content Categories

Alternative/additional approaches?
–
–
–
–
Negation detection
Citation analysis
Sequence of words
Variable selection to produce userunderstandable models
– Analysis of ACPJ potential bias
– Others???
386
Supplementary: Case Study: Imputation
for Machine Learning Models For Lung
Cancer Classification Using Array
Comparative Genomic Hybridization
C.F. Aliferis M.D., Ph.D., D. Hardin Ph.D., P. P.
Massion M.D.
AMIA 2002
387
Case Study: A Protocol to Address the
Missing Values Problem

Context:
– Array comparative genomic hybridization (array CGH): recently
introduced technology that measures gene copy number changes of
hundreds of genes in a single experiment. Gene copy number
changes (deletion, amplification) are often characteristic of disease
and cancer in particular.
– aCGH as been shown in studies published during the last years that
it enables development of powerful classification models, facilitate
selection of genes for array design, and identification of likely
oncogenes in a variety of cancers (e.g., esophageal, renal,
head/neck, lymphomas, breast, and glioblastomas).
– Interestingly a recent study (Fritz et al. June 2002) has shown that
aCGH enables better classification of liposarcoma differentiation
than gene expression information.

388
Case Study: A Protocol to Address the
Missing Values Problem

Context:
– While significant experience has been gathered so far in the
application of various machine learning/data mining approaches to
explore development of diagnostic/classification models with gene
expression Microarray Data for lung cancer and of aCGH in a
variety of cancers, little was known about the feasibility of using
machine learning methods with aCGH data to create such models.
– In this study we conducted such an experiment for the
classification of non-small Lung Cancers (NSCLCs) as squamus
carcinomas (SqCa) or adenocarcinomas (AdCa)). A related goal
was to compare several machine learning methods in this learning
task.
389
Case Study: A Protocol to Address the
Missing Values Problem

Context:
– DNA from tumors of 37 patients (21 squamous carcinomas,
(SqCa) and 16 adenocarcinomas (AdCa)) were extracted after
microdissection and hybridized onto a 452 BAC clone array
(printed in quadruplicate) carrying genes of potential importance in
cancer.
– aCGH is a technology in formative stages of development. As a
result a high percentage of missing values was observed in most
gene measurements.
390
Case Study: A Protocol to Address the
Missing Values Problem

We decided to create a protocol for gene
inclusion/exclusion for analysis on the basis of three
criteria:
– (a) percentage of missing values,
– (b) a priori importance of a gene (based on known functional role
in pathways that are implicated in carcinogenesis such as the PI3kinase pathway), and
– (c) whether the existence of missing values was statistically
significantly associated with the class to be predicted (at the 0.05
level and determined by a G2 test).
391
Case Study: A Protocol to Address the
Missing Values Problem
1. For each gene Gi compute an indicator variable MGi s.t.
MGi is 1 in cases where Gi is missing , and 0 in cases
where Gi was observed
2. Compute the association of MGi to the class variable C,
assoc(MGi , C) for every i. (C takes values in {SqCa,
AdCa})
3. Accept a set of important genes I
4. if assoc(MGi , C)
is statistically significant then
reject gene Gi
else
if Gi  I then accept Gi
else
if fraction of missing values of Gi is
>15% then reject Gi
else accept Gi

388 variables were selected according to this protocol and were
imputed before analysis.
392
Case Study: A Protocol to Address the
Missing Values Problem

K-Nearest Neighbors (KNN) method for imputation
–
for each instance of a gene that had a missing value the case closest to the
case containing that missing value (i.e., the closest neighbor) that did have
an observed value for the gene was found using Euclidean Distance (ED).
That value was substituted for the missing one.
– To compute distances between cases with missing values, if one of the two
corresponding gene measurements was missing, the mean of the observed
values for this gene across all cases was used to compute the ED
component.
– When both values were missing, the mean observed difference was used
for the ED component.



The above procedure because it is non-parametric and multivariate.
More naïve approaches (such as imputing with the mean or with a
random value from the observed distribution for the gene) typically
produced uniformly worse models.
A variant of the above method iterates the KNN imputation until
convergence is attained.
393
Case Study: A Protocol to Address the
Missing Values Problem


Important note: Clear understanding of what types of missing values
we have and what are the mechanisms that generate them is required
and sometimes translates into ability to effectively replace missing
values as well as avoid serious biases.
Example types of missing values:
– Value is not produced by device or respondent etc.
– Value was produced but not entered in the study database
– Value was produced and entered but is deemed invalid on substantive
grounds
– Value was produced and entered but was subsequently corrupted due to
transmission/storage or data conversion operations, etc.

Example where knowing process that generates missing values may be
sufficient to fill them in: physician does not measure a lab value (say
x-ray) because an equivalent or more informative test has been
conducted (e.g., MRI) or because it is physiologically impossible for a
change to have occurred from last measurement, etc.
394