Download Recursive partitioning for tumor classification with gene

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

X-inactivation wikipedia , lookup

Cancer epigenetics wikipedia , lookup

Metagenomics wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

History of genetic engineering wikipedia , lookup

NEDD9 wikipedia , lookup

Quantitative comparative linguistics wikipedia , lookup

Microevolution wikipedia , lookup

Pathogenomics wikipedia , lookup

Public health genomics wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Designer baby wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Essential gene wikipedia , lookup

Genome evolution wikipedia , lookup

Gene wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Oncogenomics wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Genome (book) wikipedia , lookup

Genomic imprinting wikipedia , lookup

Minimal genome wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Ridge (biology) wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene expression programming wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
Recursive Partitioning for
Tumor Classification with Gene
Expression Microarray Data
Heping Zhang, Chang-Yung Yu,
Burton Singer, Momian Xiong
Presented by Weihua Huang
Data used in the article
Expression profiles of 2,000 genes using an Affymetrix
oligonucleotide array in 22 normal and 40 colon cancer
tissues
The response is binary indicating normal or cancer
tissue and the predictor variables are the 2000 genes
Classification Tree Using Recursive Partitioning
Goal:
To partition the feature space into disjoint regions by growing a
tree so that the group in the same region are homogeneous in
terms of response.
Algorithm:
Start with a root node containing the study sample and split it
into smaller and smaller nodes according to whether a particular
selected predictor is above a chosen cutoff value. At each
splitting step, the selected predictor and its corresponding level
are chosen to maximize the reduction in node impurity
ΔI= P(A)I(A) –P(AL)I(AL) –P(AR)I(AR)
Classification Tree using Recursive Partitioning
Node impurity:
One example of node impurity is measured by entropy
function:
- P log(P) - (1-P) log(1-P),
where P is the probability of a tissue being normal within the
node
• Minimum impurity ( =0 )
When all tissues are of the same type within the node ( P = 0 or 1)
• Maximum impurity ( = log2)
When half normal tissues and half cancer tissues are within the
node (P=0.5)
Results From Classification Tree on the Data
Fig 1. Classification tree for tissue types by using expression data from three
genes ( M26383, R15447, M28214)
Another Way to Visualize the Recursive Partitioning
Fig 3. A scatterplot of expression data from R15447 and M28214 for a
subset of tissues (node 3 in Fig. 1).
Results from Recursive partitioning
Quality of the tree-based classification:
Using localized 5-fold cross validation error rate:
•
•
•
The same genes to the same nodes
Randomly divide the 40 cancer tissues into 5 subsamples
of 8, and the 22 normal tissues into 5 subsamples of
4,4,4,5, and 5; four subsamples each from the cancer and
normal tissues were used to choose the cutoff values for
the three splits. The remaining samples were used to
count the misclassified tissues as a result of new cutoff
values.
The error rate is between 6-8% from two runs of cross
validation, which is much better than that obtained by
existing analysis.
Correlation Analysis on Genes
Functional expressions from various genes are
correlated.
Examine the correlation patterns of the three
selected genes in Fig. 1.
Correlation Between the Three Selected Genes and the
Remaining Expression Data
Another Tree Based on a Different Set of Three Genes
Fig. 6. Classification tree for tissue types using expression data from three
genes (R87126, T62947, X15183)
Correlation Matrix Among Genes in Fig.1 and Fig. 6
Advantages of the Classification Tree
1. Efficient with large number of genes
2. Automatically selects valuable and user-friendly
genes as predictors
3. More precise than some other classification
methods such as support vector machine and linear
discriminant analysis
Conclusions:
1. It is likely that the information contained in a
large number of genes can be captured by a
small optimal set of genes without significant
loss of information.
2. The precision of classification of recursive
partitioning is important for clinical application.