Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
A comparison of PLS-based and other dimension reduction methods for tumour classification using microarray data Cameron Hurst Institute of Health and Biomedical Innovation, Queensland University of Technology Janet Chaseling - Griffith University Michael Steele - Bond University CRICOS No. 000213J Queensland University of Technology Developments in ‘omics • Developments in genomics, proteomics and metabolomics has seen the generation of huge amounts of data • Of particular interest in this study are large scale microarray studies a university for the real world R CRICOS No. 000213J Microarray data • Microarray datasets involves the expression levels of many genes (circa 1-50K) • Gene expression data can be thought of representing how much a gene is turned off/on • To date, microarray studies generally involve a small number of patients (<100) a university for the real world R CRICOS No. 000213J Microarray studies • There are a number reasons why microarray studies are conducted • I will focus on the area of cancer classification studies • However, the techniques I am evaluating readily translate to other types of ‘omic classification studies a university for the real world R CRICOS No. 000213J Objectives in cancer studies based on microarrays • There are three main types of objectives to microarray experiments involved different cancers: 1. Identification of genes that are differentially expressed among cancer classes 2. Class discovery 3. Class prediction (e.g. Tumour classification) • It is this last objective that is the focus of my study – However, the first objective can also examined in this type of study a university for the real world R CRICOS No. 000213J Tumour class classification Xn samples x p genes 1 ... Yn samples x 1 .. .. p Tumour class 1 Class 1 2 Class 1 . . Class k Class k n a university for the real world R CRICOS No. 000213J Analytical problems in tumour classification • The large numbers of genes and small number of subjects in microarray datasets present problems for many traditional statistical methods • This ‘small n, large p’ problem has led to two main approaches to the analysis of this data: – Machine learning methods – Dimension reduction method • Which approach used has usually been a matter of the discipline of the analyst (informatics or statistics) a university for the real world R CRICOS No. 000213J Many methods but limited knowledge • A large number of methods have been proposed for tumour classification • Many previous studies comparing tumour classification methods have not been systematic about comparing ‘classes’ of methods – i.e. previous comparisons have differed in a number of ways, so unclear which properties of the methods represent their relative strengths and weaknesses • So there is a lack of knowledge about the reasons for difference in performance a university for the real world R CRICOS No. 000213J Techniques considered….. • Present study involves the comparison of three ‘classes’ of methods for tumour classification. 1. Indirect methods a) Principal Component Analysis (PCA) DFA b) Spectral Map Analysis (SMA) DFA 2. Canonical (direct) ordination methods a) Redundancy Analysis (RDA) b) Canonical Correspondence Analysis (CCoA) c) Canonical Analysis of Principal Coordinates (CAP) a university for the real world R CRICOS No. 000213J Techniques considered….. • The within-class differences among these techniques is really about the analysis space employed. • For example, the main difference between the Canonical ordination methods Redundancy Analysis and Canonical Correspondence Analysis is the former eigen-decomposes a Euclidean dissimilarity matrix and the latter a 2 dissimilarity matrix – i.e. Differences in their performance reflect implicit data standardizations a university for the real world R CRICOS No. 000213J Techniques considered….. 3. PLS-based methods a) PLS data reduction followed by Linear Discriminant Analysis (PLS) b) PLS data reduction followed a ridge-penalized logistic regression –ridge PLS (rPLS) c) PLS data reduction followed iteratively reweighted least squares regression– generalized PLS (gPLS) • All of these methods involve a two-step process where PLS components are derived (step 1) which are subsequently used to train some classification rule(step 2) • In this respect, these methods only differ in the classification rule used in step 2 a university for the real world R CRICOS No. 000213J STEP 1: PLS dimension reduction Xn samples x p genes 1 ... .. .. Yn samples x (k-1) p STEP 2: Discriminant Analysis Tn samples x m components 1 Tumour class .. m Yn samples x (k-1) Tumour class 1 Class 1 1 Class 1 2 Class 1 2 Class 1 . . . . Class k Class k Class k n a university for the real world R n Class k m << p CRICOS No. 000213J Comparison of methods….. • All eight methods were run on three benchmark datasets – Colon cancer • p = 2000 genes; n= 62 [ ncases = 40 + ncontrols =22] – Small Round Blue Cells Tumours [SRBCT] • p = 2308 genes; n= 83 [distributed among 4 classes] – Brain tumours • p = 5597 genes; n= 42 [distributed among 5 classes] a university for the real world R CRICOS No. 000213J Comparison of methods…… • Effectiveness of classification rules was gauged using misclassification rates based on Leave-OneOut-Cross-Validation • As the number of components retained represents a meta-parameter for both the indirect and PLS-based methods, misclassifications were evaluated using 2, 4 and 6 components. • Methods were also run on datasets which had both had or had not been reduced using a priori feature selection (Significance Analysis of Microarrays) a university for the real world R CRICOS No. 000213J Example: Results for SRBCT dataset 0.8 PLS rPLS gPLS 0.6 PCA SMA RDA 0.4 CCoA CAP 0.2 0 p<0.01 full p<0.01 full 2D 4D p<0.01 full 6D • Here, PLS methods outperformed all other methods. • Indirect methods only performed comparably where either a priori feature selection was employed, or where a larger number of components were retained • PLS miclass. < Canonical ordination misclass. < Indirect misclass. a university for the real world R CRICOS No. 000213J Results… • Indirect methods preformed comparably only when non-differentially expressed genes were removed prior to classification rule training • Removal of differentially expressed genes made little difference to PLS-based methods a university for the real world R CRICOS No. 000213J Results…. • Canonical methods were highly inconsistent in their performance across datasets, varying from quite effective in the colon and SRBCT datasets to very poor (worse than indirect methods) for the brain tumour dataset. a university for the real world R CRICOS No. 000213J Results….. • PLS-based methods generally performed better than both indirect and canonical ordination methods • In most cases, rPLS and gPLS were superior to PLS using linear discriminant functions a university for the real world R CRICOS No. 000213J Conclusions • Indirect methods should be restricted to a class discovery role (e.g. examining gene-gene and gene-patient interactions) • Relative performance of techniques tended to align to the ‘class’ of the method (Indirect, Canonical ordination or PLS-based methods). – That is, the analysis space of the method (i.e. for indirect methods, or the canonical ordination methods) tended to make little difference to misclassification rates a university for the real world R CRICOS No. 000213J Further work…. • It is not clear which properties of the various datasets leads to inconsistencies in the performance of the classification methods • Protocols to systematically evaluate microarray classification methods need further development – Most promising avenue likely to involve simulated microarray datasets allowing aspects of the data to be systematically varied so strength/limitations of the methods can be directly linked to data properties. a university for the real world R CRICOS No. 000213J Further work…. • The focus in this study has been solely on dimension reduction methods • Any comprehensive comparative study should include promising machine learning methods a university for the real world R CRICOS No. 000213J Thank you Questions??? CRICOS No. 000213J Queensland University of Technology