Download Artificial Neural Network

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Site-specific recombinase technology wikipedia , lookup

History of genetic engineering wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Pathogenomics wikipedia , lookup

Essential gene wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Public health genomics wikipedia , lookup

Cancer epigenetics wikipedia , lookup

Genome evolution wikipedia , lookup

Microevolution wikipedia , lookup

Gene wikipedia , lookup

Metagenomics wikipedia , lookup

Genomic imprinting wikipedia , lookup

Designer baby wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Ridge (biology) wikipedia , lookup

Minimal genome wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Genome (book) wikipedia , lookup

Epigenetics of human development wikipedia , lookup

NEDD9 wikipedia , lookup

Gene expression programming wikipedia , lookup

Mir-92 microRNA precursor family wikipedia , lookup

Oncogenomics wikipedia , lookup

Gene expression profiling wikipedia , lookup

RNA-Seq wikipedia , lookup

Transcript
Classification and diagnostic prediction
of cancers using gene expression
profiling and artificial neural networks
From Nature Medicine 7(6) 2001
By Javed Khan et al.
(Summarized by Marcílio Souto – ICMC/USPSão Carlos)
[email protected]
Abstract
• Small, round blue-cell tumors (SRBCTs)
• Four distinct categories hard to
discriminate
• cDNA microarray and Artificial Neural
Networks (ANNs)
• Tumor diagnosis and the identification of
candidate targets for therapy
2
The Problem
• SRBCTs of childhood
• Neuroblastoma (NB)
• Rhabdomyosarcoma (RMS)
• Non Hodgkin lymphoma (NHL)
• The Ewing family of tumors (EWS)
• All four distinctions have similar appearances in routine
histology
• Accurate diagnosis is essential
• In clinical practice
• Immunohistochemistry: the detection of protein
expression
• Reverse transcription-PCR: tumor-specific
translocation
• EWS-FLI1 in EWS and the PAX3-FKHR in ARMS
3
The Approach
• Gene-expression profiling using cDNA
microarrays
• A simultaneous analysis of multiple markers
• Multiple categorical distinctions
• Artificial neural networks (ANNs)
• Diagnosing myocardial infarcts
• Diagnosing arrhythmias from
electrocardiograms
• Interpreting radiographs
• Interpreting magnetic resonance images
4
The Experiment
• cDNA microarray with 6,567 genes
• 63 training examples
• Tumor biopsy material
• Cell lines
• Filtering for a minimal level of expression
• 2,308 genes
• PCA further reduced the dimensionality.
• 10 dominant PCA components were used. (63% of the
variance in the data matrix)
• Three-fold cross-validation
• 3,750 ANNs were constructed (average vote)
• No overfitting and zero classification error in the
training sample
5
Data Sets
Table for the
train
The number of train
samples for cancer I (EWS)
23
The number of train
samples for cancer II (RMS)
20
The number of train
samples for cancer III (NB)
Table I for the
test
The number of test
samples for cancer I
(EWS)
The number of test
samples for cancer II
(RMS)
6
12
The number of test
samples for cancer III
(NB)
6
The number of train
samples for cancer IV (BL)
8
The number of test
samples for cancer IV
(BL)
3
The number of unlabeled
samples
0
The number of unlabeled
samples (non-SRBCT)
5
Total number of samples for
train and validation
63
Total number of test
samples
5
25
6
The Schematic View of the
Analysis Process
7
Data Analysis
• Initial Cuts
• Principal Components Analysis
• Artificial Neural Network Prediction
• Extraction of Relevant Genes
8
Data Analysis: Initial Cuts and
PCA
• Initial Cuts
• Gene are omitted if for any of the samples the
red intensity (ri) is less than 20
• From 6567 to 2308 genes
• Principal Components Analysis (PCA)
• Reduce the dimensionality of data to 10
components – 2308 genes to 10 inputs inputs
• This number (10) was found by means of preexperiments
9
Data Analysis: Artificial Neural
Network (1/3)
• Architecture and Parameters
• Linear Perceptron (LP)
• 10 inputs representing the PCA components
• 4 output nodes – one for each class of tumor (EWS, BL, NB
and RMS)
• 44 free parameters, including four threshold units
• Calibration (training) was performed using JETNET
•
•
•
•
•
=0.7; momentum=0.3
Learning rate decreased after each epoch (0.99)
Initial weights randomly chosen from [-r,r] – r=0.1/F
Weights updated after every 10 epochs
At most 100 epochs
10
Data Analysis: Artificial Neural
Network (2/3)
• Calibration and Validation
• 3-fold cross-validation
• 63 labeled samples are randomly shuffled and split into
3 equally sized groups
• The network is trained with two of these groups and the
other used to validation
• This procedure is repeated 3 times
• The random shuffling is redone 1250 times
• 3750 networks
• For validation, the average of the result for the 1250
networks as output – committee
• For test samples, the committee is formed with all 3750
networks
• 25 samples in the test set
11
Data Analysis: Artificial Neural
Network (3/3)
• Assessing the quality of classifications
• Each sample is classified as belonging to the cancer type
corresponding to the largest average committee vote
• Rejection of second largest class or samples that do not
belong to any of the class
• Definition of a distance from a sample to the ideal vote for
each cancer type
• Based on the validation set, for each type of cancer an
empirical distribution of its distance is generated
• For a given test sample, the system can reject possible
classification based on these probability distributions
• OBS: the classification as well as the extraction of
important genes converges using less than 100 networks
• The only reason 3750 networks were used is to have
sufficient statistics for these empirical probability
distributions
12
Relevant Gene Extraction
• In order to select relevant genes, the authors proposed a
sensitivity measure (S) of the outputs (o) with respect to any
of the 2308 input variables, summed over the number of
samples and outputs
• All 3750 networks are involved
• They also proposed a measure related for a single output
• Thus, they can rank the genes according to their importance
for the total classification but also according to their
importance for the different disease separately
• They explored for 6, 12, 24, 48, 96, 192, 384, 768 and 1536
genes
• For each choice training (calibration) was redone
13
Summed Square Error Graph
14
Optimizations of Genes Utilized for
Classification
•
•
Using 3,750 trained models, rank all genes according to their
significance for the classification
Determine the classification error rate using increasing number of
these ranked genes
15
Recalibrating the ANNs
•
•
Using only 96 genes, the analysis process was repeated
Zero classification error
16
Diagnostic Classification
•
•
25 test examples (5 non-SRBCTs)
If a sample falls outside the 95th percentile of the probability
distribution of distances between samples and their ideal output,
its diagnosis is rejected
17
Multi-Dimensional Scaling (MDS)
• Using 96 genes
18
Hierarchical Clustering of 96 Genes
- 93 unique genes (3 IGF2 and 2 MYC)
- 13 ESTs
- 41 genes have not been reported as
associated with these diseases.
- Perfect clustering of four categories
19
Expression of FGFR4 on SRBCT
Tissue Array
• At the protein level, Immunohistochemistry on
SRBCT tissue arrays for the expression of
fibroblast growth factor receptor 4 (FGFR4)
• FGFR4
• Expressed during myogenesis (not in adult muscle)
• Potential role in tumor growth
• Prevention of terminal differentiation in muscle
• Strong cytoplasmic immunostaining for FGFR4
was seen in all 26 RMSs tested.
20
Discussion
• Current diagnoses of tumors rely on histology
(morpholgy) and immunohistochemistry
(protein expression)
• Using cDNA microarrays
• Multiple markers (robust)
• Reveal the underlying genetic aberrations or
biological processes
• Tumors and cell lines
• Cell lines for ANN calibration
21
Reference
• J. Khan et al. ”Classification and diagnostic prediction of
cancers using gene expression profiling and artificial
neural networks”, Nature Medicine, Vol. 7, Number 6, June
2001 and the references therein.
• Analysis Methods Supplement for Nature Medicine, Vol. 7,
Number 6, June 2001.
• http://medicine.nature.com
• M. Ringner, C. Peterson and J. Khan ”Analyzing array
data using supervised methods”, Pharmacogenomics,
vol. 3, Number 3, 2003.
• NIH News Release: Gene Chips Accurately Diagnose Four
Complex
Childhood
Cancers
Artificial Intelligence Used With Gene Expression
Microarrays for the First Time.
• http://www.nih.gov/news/pr/may2001/nhgri-30.htm
22