Download Presentation slides - ePublications@bond

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Metagenomics wikipedia , lookup

Computational phylogenetics wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene expression profiling wikipedia , lookup

Quantitative comparative linguistics wikipedia , lookup

Transcript
A comparison of PLS-based and other
dimension reduction methods for tumour
classification using microarray data
Cameron Hurst
Institute of Health and Biomedical Innovation,
Queensland University of Technology
Janet Chaseling - Griffith University
Michael Steele - Bond University
CRICOS No. 000213J
Queensland University of Technology
Developments in ‘omics
• Developments in genomics, proteomics and
metabolomics has seen the generation of huge
amounts of data
• Of particular interest in this study are large scale
microarray studies
a university for the
real world
R
CRICOS No. 000213J
Microarray data
• Microarray datasets involves the expression
levels of many genes (circa 1-50K)
• Gene expression data can be thought of
representing how much a gene is turned
off/on
• To date, microarray studies generally involve
a small number of patients (<100)
a university for the
real world
R
CRICOS No. 000213J
Microarray studies
• There are a number reasons why microarray
studies are conducted
• I will focus on the area of cancer classification
studies
• However, the techniques I am evaluating readily
translate to other types of ‘omic classification
studies
a university for the
real world
R
CRICOS No. 000213J
Objectives in cancer studies based on
microarrays
• There are three main types of objectives to
microarray experiments involved different cancers:
1. Identification of genes that are differentially expressed
among cancer classes
2. Class discovery
3. Class prediction (e.g. Tumour classification)
• It is this last objective that is the focus of my study
– However, the first objective can also examined in this
type of study
a university for the
real world
R
CRICOS No. 000213J
Tumour class classification
Xn samples x p genes
1
...
Yn samples x 1
..
..
p
Tumour
class
1
Class 1
2
Class 1
.
.
Class k
Class k
n
a university for the
real world
R
CRICOS No. 000213J
Analytical problems in tumour
classification
• The large numbers of genes and small number of
subjects in microarray datasets present problems
for many traditional statistical methods
• This ‘small n, large p’ problem has led to two main
approaches to the analysis of this data:
– Machine learning methods
– Dimension reduction method
• Which approach used has usually been a matter
of the discipline of the analyst (informatics or
statistics)
a university for the
real world
R
CRICOS No. 000213J
Many methods but limited knowledge
• A large number of methods have been proposed for
tumour classification
• Many previous studies comparing tumour classification
methods have not been systematic about comparing
‘classes’ of methods
– i.e. previous comparisons have differed in a number of ways, so unclear
which properties of the methods represent their relative strengths and
weaknesses
• So there is a lack of knowledge about the reasons for
difference in performance
a university for the
real world
R
CRICOS No. 000213J
Techniques considered…..
• Present study involves the comparison of three
‘classes’ of methods for tumour classification.
1. Indirect methods
a) Principal Component Analysis (PCA)  DFA
b) Spectral Map Analysis (SMA)  DFA
2. Canonical (direct) ordination methods
a) Redundancy Analysis (RDA)
b) Canonical Correspondence Analysis (CCoA)
c) Canonical Analysis of Principal Coordinates (CAP)
a university for the
real world
R
CRICOS No. 000213J
Techniques considered…..
• The within-class differences among these
techniques is really about the analysis space
employed.
• For example, the main difference between the
Canonical ordination methods Redundancy Analysis
and Canonical Correspondence Analysis is the
former eigen-decomposes a Euclidean dissimilarity
matrix and the latter a 2 dissimilarity matrix
– i.e. Differences in their performance reflect implicit data
standardizations
a university for the
real world
R
CRICOS No. 000213J
Techniques considered…..
3. PLS-based methods
a) PLS data reduction followed by Linear Discriminant Analysis
(PLS)
b) PLS data reduction followed a ridge-penalized logistic regression
–ridge PLS (rPLS)
c) PLS data reduction followed iteratively reweighted least squares
regression– generalized PLS (gPLS)
• All of these methods involve a two-step process where PLS
components are derived (step 1) which are subsequently used
to train some classification rule(step 2)
• In this respect, these methods only differ in the classification
rule used in step 2
a university for the
real world
R
CRICOS No. 000213J
STEP 1: PLS dimension reduction
Xn samples x p genes
1
... ..
..
Yn samples x (k-1)
p
STEP 2: Discriminant Analysis
Tn samples x m components
1
Tumour
class
..
m
Yn samples x (k-1)
Tumour
class
1
Class 1
1
Class 1
2
Class 1
2
Class 1
.
.
.
.
Class k
Class k
Class k
n
a university for the
real world
R
n
Class k
m << p
CRICOS No. 000213J
Comparison of methods…..
• All eight methods were run on three benchmark
datasets
– Colon cancer
• p = 2000 genes; n= 62 [ ncases = 40 + ncontrols =22]
– Small Round Blue Cells Tumours [SRBCT]
• p = 2308 genes; n= 83 [distributed among 4 classes]
– Brain tumours
• p = 5597 genes; n= 42 [distributed among 5 classes]
a university for the
real world
R
CRICOS No. 000213J
Comparison of methods……
• Effectiveness of classification rules was gauged
using misclassification rates based on Leave-OneOut-Cross-Validation
• As the number of components retained represents a
meta-parameter for both the indirect and PLS-based
methods, misclassifications were evaluated using 2,
4 and 6 components.
• Methods were also run on datasets which had both
had or had not been reduced using a priori feature
selection (Significance Analysis of Microarrays)
a university for the
real world
R
CRICOS No. 000213J
Example: Results for SRBCT dataset
0.8
PLS
rPLS
gPLS
0.6
PCA
SMA
RDA
0.4
CCoA
CAP
0.2
0
p<0.01
full
p<0.01
full
2D
4D
p<0.01
full
6D
• Here, PLS methods outperformed all other methods.
• Indirect methods only performed comparably where either a priori
feature selection was employed, or where a larger number of
components were retained
• PLS miclass. < Canonical ordination misclass. < Indirect misclass.
a university for the
real world
R
CRICOS No. 000213J
Results…
• Indirect methods preformed comparably only
when non-differentially expressed genes were
removed prior to classification rule training
• Removal of differentially expressed genes made
little difference to PLS-based methods
a university for the
real world
R
CRICOS No. 000213J
Results….
• Canonical methods were highly inconsistent in
their performance across datasets, varying from
quite effective in the colon and SRBCT datasets
to very poor (worse than indirect methods) for
the brain tumour dataset.
a university for the
real world
R
CRICOS No. 000213J
Results…..
• PLS-based methods generally performed better
than both indirect and canonical ordination
methods
• In most cases, rPLS and gPLS were superior to
PLS using linear discriminant functions
a university for the
real world
R
CRICOS No. 000213J
Conclusions
• Indirect methods should be restricted to a class
discovery role (e.g. examining gene-gene and
gene-patient interactions)
• Relative performance of techniques tended to align
to the ‘class’ of the method (Indirect, Canonical
ordination or PLS-based methods).
– That is, the analysis space of the method (i.e. for indirect
methods, or the canonical ordination methods) tended to
make little difference to misclassification rates
a university for the
real world
R
CRICOS No. 000213J
Further work….
• It is not clear which properties of the various
datasets leads to inconsistencies in the
performance of the classification methods
• Protocols to systematically evaluate microarray
classification methods need further development
– Most promising avenue likely to involve simulated
microarray datasets allowing aspects of the data to be
systematically varied so strength/limitations of the
methods can be directly linked to data properties.
a university for the
real world
R
CRICOS No. 000213J
Further work….
• The focus in this study has been solely on
dimension reduction methods
• Any comprehensive comparative study should
include promising machine learning methods
a university for the
real world
R
CRICOS No. 000213J
Thank you
Questions???
CRICOS No. 000213J
Queensland University of Technology