* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Artificial Neural Network
Site-specific recombinase technology wikipedia , lookup
History of genetic engineering wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Pathogenomics wikipedia , lookup
Essential gene wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Public health genomics wikipedia , lookup
Cancer epigenetics wikipedia , lookup
Genome evolution wikipedia , lookup
Microevolution wikipedia , lookup
Metagenomics wikipedia , lookup
Genomic imprinting wikipedia , lookup
Designer baby wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Ridge (biology) wikipedia , lookup
Minimal genome wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Genome (book) wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Gene expression programming wikipedia , lookup
Mir-92 microRNA precursor family wikipedia , lookup
Oncogenomics wikipedia , lookup
Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks From Nature Medicine 7(6) 2001 By Javed Khan et al. (Summarized by Marcílio Souto – ICMC/USPSão Carlos) [email protected] Abstract • Small, round blue-cell tumors (SRBCTs) • Four distinct categories hard to discriminate • cDNA microarray and Artificial Neural Networks (ANNs) • Tumor diagnosis and the identification of candidate targets for therapy 2 The Problem • SRBCTs of childhood • Neuroblastoma (NB) • Rhabdomyosarcoma (RMS) • Non Hodgkin lymphoma (NHL) • The Ewing family of tumors (EWS) • All four distinctions have similar appearances in routine histology • Accurate diagnosis is essential • In clinical practice • Immunohistochemistry: the detection of protein expression • Reverse transcription-PCR: tumor-specific translocation • EWS-FLI1 in EWS and the PAX3-FKHR in ARMS 3 The Approach • Gene-expression profiling using cDNA microarrays • A simultaneous analysis of multiple markers • Multiple categorical distinctions • Artificial neural networks (ANNs) • Diagnosing myocardial infarcts • Diagnosing arrhythmias from electrocardiograms • Interpreting radiographs • Interpreting magnetic resonance images 4 The Experiment • cDNA microarray with 6,567 genes • 63 training examples • Tumor biopsy material • Cell lines • Filtering for a minimal level of expression • 2,308 genes • PCA further reduced the dimensionality. • 10 dominant PCA components were used. (63% of the variance in the data matrix) • Three-fold cross-validation • 3,750 ANNs were constructed (average vote) • No overfitting and zero classification error in the training sample 5 Data Sets Table for the train The number of train samples for cancer I (EWS) 23 The number of train samples for cancer II (RMS) 20 The number of train samples for cancer III (NB) Table I for the test The number of test samples for cancer I (EWS) The number of test samples for cancer II (RMS) 6 12 The number of test samples for cancer III (NB) 6 The number of train samples for cancer IV (BL) 8 The number of test samples for cancer IV (BL) 3 The number of unlabeled samples 0 The number of unlabeled samples (non-SRBCT) 5 Total number of samples for train and validation 63 Total number of test samples 5 25 6 The Schematic View of the Analysis Process 7 Data Analysis • Initial Cuts • Principal Components Analysis • Artificial Neural Network Prediction • Extraction of Relevant Genes 8 Data Analysis: Initial Cuts and PCA • Initial Cuts • Gene are omitted if for any of the samples the red intensity (ri) is less than 20 • From 6567 to 2308 genes • Principal Components Analysis (PCA) • Reduce the dimensionality of data to 10 components – 2308 genes to 10 inputs inputs • This number (10) was found by means of preexperiments 9 Data Analysis: Artificial Neural Network (1/3) • Architecture and Parameters • Linear Perceptron (LP) • 10 inputs representing the PCA components • 4 output nodes – one for each class of tumor (EWS, BL, NB and RMS) • 44 free parameters, including four threshold units • Calibration (training) was performed using JETNET • • • • • =0.7; momentum=0.3 Learning rate decreased after each epoch (0.99) Initial weights randomly chosen from [-r,r] – r=0.1/F Weights updated after every 10 epochs At most 100 epochs 10 Data Analysis: Artificial Neural Network (2/3) • Calibration and Validation • 3-fold cross-validation • 63 labeled samples are randomly shuffled and split into 3 equally sized groups • The network is trained with two of these groups and the other used to validation • This procedure is repeated 3 times • The random shuffling is redone 1250 times • 3750 networks • For validation, the average of the result for the 1250 networks as output – committee • For test samples, the committee is formed with all 3750 networks • 25 samples in the test set 11 Data Analysis: Artificial Neural Network (3/3) • Assessing the quality of classifications • Each sample is classified as belonging to the cancer type corresponding to the largest average committee vote • Rejection of second largest class or samples that do not belong to any of the class • Definition of a distance from a sample to the ideal vote for each cancer type • Based on the validation set, for each type of cancer an empirical distribution of its distance is generated • For a given test sample, the system can reject possible classification based on these probability distributions • OBS: the classification as well as the extraction of important genes converges using less than 100 networks • The only reason 3750 networks were used is to have sufficient statistics for these empirical probability distributions 12 Relevant Gene Extraction • In order to select relevant genes, the authors proposed a sensitivity measure (S) of the outputs (o) with respect to any of the 2308 input variables, summed over the number of samples and outputs • All 3750 networks are involved • They also proposed a measure related for a single output • Thus, they can rank the genes according to their importance for the total classification but also according to their importance for the different disease separately • They explored for 6, 12, 24, 48, 96, 192, 384, 768 and 1536 genes • For each choice training (calibration) was redone 13 Summed Square Error Graph 14 Optimizations of Genes Utilized for Classification • • Using 3,750 trained models, rank all genes according to their significance for the classification Determine the classification error rate using increasing number of these ranked genes 15 Recalibrating the ANNs • • Using only 96 genes, the analysis process was repeated Zero classification error 16 Diagnostic Classification • • 25 test examples (5 non-SRBCTs) If a sample falls outside the 95th percentile of the probability distribution of distances between samples and their ideal output, its diagnosis is rejected 17 Multi-Dimensional Scaling (MDS) • Using 96 genes 18 Hierarchical Clustering of 96 Genes - 93 unique genes (3 IGF2 and 2 MYC) - 13 ESTs - 41 genes have not been reported as associated with these diseases. - Perfect clustering of four categories 19 Expression of FGFR4 on SRBCT Tissue Array • At the protein level, Immunohistochemistry on SRBCT tissue arrays for the expression of fibroblast growth factor receptor 4 (FGFR4) • FGFR4 • Expressed during myogenesis (not in adult muscle) • Potential role in tumor growth • Prevention of terminal differentiation in muscle • Strong cytoplasmic immunostaining for FGFR4 was seen in all 26 RMSs tested. 20 Discussion • Current diagnoses of tumors rely on histology (morpholgy) and immunohistochemistry (protein expression) • Using cDNA microarrays • Multiple markers (robust) • Reveal the underlying genetic aberrations or biological processes • Tumors and cell lines • Cell lines for ANN calibration 21 Reference • J. Khan et al. ”Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks”, Nature Medicine, Vol. 7, Number 6, June 2001 and the references therein. • Analysis Methods Supplement for Nature Medicine, Vol. 7, Number 6, June 2001. • http://medicine.nature.com • M. Ringner, C. Peterson and J. Khan ”Analyzing array data using supervised methods”, Pharmacogenomics, vol. 3, Number 3, 2003. • NIH News Release: Gene Chips Accurately Diagnose Four Complex Childhood Cancers Artificial Intelligence Used With Gene Expression Microarrays for the First Time. • http://www.nih.gov/news/pr/may2001/nhgri-30.htm 22