Download Artificial Neural Network

Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks From Nature Medicine 7(6) 2001 By Javed Khan et al. (Summarized by Marcílio Souto – ICMC/USPSão Carlos) [email protected] Abstract • Small, round blue-cell tumors (SRBCTs) • Four distinct categories hard to discriminate • cDNA microarray and Artificial Neural Networks (ANNs) • Tumor diagnosis and the identification of candidate targets for therapy 2 The Problem • SRBCTs of childhood • Neuroblastoma (NB) • Rhabdomyosarcoma (RMS) • Non Hodgkin lymphoma (NHL) • The Ewing family of tumors (EWS) • All four distinctions have similar appearances in routine histology • Accurate diagnosis is essential • In clinical practice • Immunohistochemistry: the detection of protein expression • Reverse transcription-PCR: tumor-specific translocation • EWS-FLI1 in EWS and the PAX3-FKHR in ARMS 3 The Approach • Gene-expression profiling using cDNA microarrays • A simultaneous analysis of multiple markers • Multiple categorical distinctions • Artificial neural networks (ANNs) • Diagnosing myocardial infarcts • Diagnosing arrhythmias from electrocardiograms • Interpreting radiographs • Interpreting magnetic resonance images 4 The Experiment • cDNA microarray with 6,567 genes • 63 training examples • Tumor biopsy material • Cell lines • Filtering for a minimal level of expression • 2,308 genes • PCA further reduced the dimensionality. • 10 dominant PCA components were used. (63% of the variance in the data matrix) • Three-fold cross-validation • 3,750 ANNs were constructed (average vote) • No overfitting and zero classification error in the training sample 5 Data Sets Table for the train The number of train samples for cancer I (EWS) 23 The number of train samples for cancer II (RMS) 20 The number of train samples for cancer III (NB) Table I for the test The number of test samples for cancer I (EWS) The number of test samples for cancer II (RMS) 6 12 The number of test samples for cancer III (NB) 6 The number of train samples for cancer IV (BL) 8 The number of test samples for cancer IV (BL) 3 The number of unlabeled samples 0 The number of unlabeled samples (non-SRBCT) 5 Total number of samples for train and validation 63 Total number of test samples 5 25 6 The Schematic View of the Analysis Process 7 Data Analysis • Initial Cuts • Principal Components Analysis • Artificial Neural Network Prediction • Extraction of Relevant Genes 8 Data Analysis: Initial Cuts and PCA • Initial Cuts • Gene are omitted if for any of the samples the red intensity (ri) is less than 20 • From 6567 to 2308 genes • Principal Components Analysis (PCA) • Reduce the dimensionality of data to 10 components – 2308 genes to 10 inputs inputs • This number (10) was found by means of preexperiments 9 Data Analysis: Artificial Neural Network (1/3) • Architecture and Parameters • Linear Perceptron (LP) • 10 inputs representing the PCA components • 4 output nodes – one for each class of tumor (EWS, BL, NB and RMS) • 44 free parameters, including four threshold units • Calibration (training) was performed using JETNET • • • • • =0.7; momentum=0.3 Learning rate decreased after each epoch (0.99) Initial weights randomly chosen from [-r,r] – r=0.1/F Weights updated after every 10 epochs At most 100 epochs 10 Data Analysis: Artificial Neural Network (2/3) • Calibration and Validation • 3-fold cross-validation • 63 labeled samples are randomly shuffled and split into 3 equally sized groups • The network is trained with two of these groups and the other used to validation • This procedure is repeated 3 times • The random shuffling is redone 1250 times • 3750 networks • For validation, the average of the result for the 1250 networks as output – committee • For test samples, the committee is formed with all 3750 networks • 25 samples in the test set 11 Data Analysis: Artificial Neural Network (3/3) • Assessing the quality of classifications • Each sample is classified as belonging to the cancer type corresponding to the largest average committee vote • Rejection of second largest class or samples that do not belong to any of the class • Definition of a distance from a sample to the ideal vote for each cancer type • Based on the validation set, for each type of cancer an empirical distribution of its distance is generated • For a given test sample, the system can reject possible classification based on these probability distributions • OBS: the classification as well as the extraction of important genes converges using less than 100 networks • The only reason 3750 networks were used is to have sufficient statistics for these empirical probability distributions 12 Relevant Gene Extraction • In order to select relevant genes, the authors proposed a sensitivity measure (S) of the outputs (o) with respect to any of the 2308 input variables, summed over the number of samples and outputs • All 3750 networks are involved • They also proposed a measure related for a single output • Thus, they can rank the genes according to their importance for the total classification but also according to their importance for the different disease separately • They explored for 6, 12, 24, 48, 96, 192, 384, 768 and 1536 genes • For each choice training (calibration) was redone 13 Summed Square Error Graph 14 Optimizations of Genes Utilized for Classification • • Using 3,750 trained models, rank all genes according to their significance for the classification Determine the classification error rate using increasing number of these ranked genes 15 Recalibrating the ANNs • • Using only 96 genes, the analysis process was repeated Zero classification error 16 Diagnostic Classification • • 25 test examples (5 non-SRBCTs) If a sample falls outside the 95th percentile of the probability distribution of distances between samples and their ideal output, its diagnosis is rejected 17 Multi-Dimensional Scaling (MDS) • Using 96 genes 18 Hierarchical Clustering of 96 Genes - 93 unique genes (3 IGF2 and 2 MYC) - 13 ESTs - 41 genes have not been reported as associated with these diseases. - Perfect clustering of four categories 19 Expression of FGFR4 on SRBCT Tissue Array • At the protein level, Immunohistochemistry on SRBCT tissue arrays for the expression of fibroblast growth factor receptor 4 (FGFR4) • FGFR4 • Expressed during myogenesis (not in adult muscle) • Potential role in tumor growth • Prevention of terminal differentiation in muscle • Strong cytoplasmic immunostaining for FGFR4 was seen in all 26 RMSs tested. 20 Discussion • Current diagnoses of tumors rely on histology (morpholgy) and immunohistochemistry (protein expression) • Using cDNA microarrays • Multiple markers (robust) • Reveal the underlying genetic aberrations or biological processes • Tumors and cell lines • Cell lines for ANN calibration 21 Reference • J. Khan et al. ”Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks”, Nature Medicine, Vol. 7, Number 6, June 2001 and the references therein. • Analysis Methods Supplement for Nature Medicine, Vol. 7, Number 6, June 2001. • http://medicine.nature.com • M. Ringner, C. Peterson and J. Khan ”Analyzing array data using supervised methods”, Pharmacogenomics, vol. 3, Number 3, 2003. • NIH News Release: Gene Chips Accurately Diagnose Four Complex Childhood Cancers Artificial Intelligence Used With Gene Expression Microarrays for the First Time. • http://www.nih.gov/news/pr/may2001/nhgri-30.htm 22

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Artificial Neural Network