Download Recursive partitioning for tumor classification with gene

Recursive Partitioning for Tumor Classification with Gene Expression Microarray Data Heping Zhang, Chang-Yung Yu, Burton Singer, Momian Xiong Presented by Weihua Huang Data used in the article Expression profiles of 2,000 genes using an Affymetrix oligonucleotide array in 22 normal and 40 colon cancer tissues The response is binary indicating normal or cancer tissue and the predictor variables are the 2000 genes Classification Tree Using Recursive Partitioning Goal: To partition the feature space into disjoint regions by growing a tree so that the group in the same region are homogeneous in terms of response. Algorithm: Start with a root node containing the study sample and split it into smaller and smaller nodes according to whether a particular selected predictor is above a chosen cutoff value. At each splitting step, the selected predictor and its corresponding level are chosen to maximize the reduction in node impurity ΔI= P(A)I(A) –P(AL)I(AL) –P(AR)I(AR) Classification Tree using Recursive Partitioning Node impurity: One example of node impurity is measured by entropy function: - P log(P) - (1-P) log(1-P), where P is the probability of a tissue being normal within the node • Minimum impurity ( =0 ) When all tissues are of the same type within the node ( P = 0 or 1) • Maximum impurity ( = log2) When half normal tissues and half cancer tissues are within the node (P=0.5) Results From Classification Tree on the Data Fig 1. Classification tree for tissue types by using expression data from three genes ( M26383, R15447, M28214) Another Way to Visualize the Recursive Partitioning Fig 3. A scatterplot of expression data from R15447 and M28214 for a subset of tissues (node 3 in Fig. 1). Results from Recursive partitioning Quality of the tree-based classification: Using localized 5-fold cross validation error rate: • • • The same genes to the same nodes Randomly divide the 40 cancer tissues into 5 subsamples of 8, and the 22 normal tissues into 5 subsamples of 4,4,4,5, and 5; four subsamples each from the cancer and normal tissues were used to choose the cutoff values for the three splits. The remaining samples were used to count the misclassified tissues as a result of new cutoff values. The error rate is between 6-8% from two runs of cross validation, which is much better than that obtained by existing analysis. Correlation Analysis on Genes Functional expressions from various genes are correlated. Examine the correlation patterns of the three selected genes in Fig. 1. Correlation Between the Three Selected Genes and the Remaining Expression Data Another Tree Based on a Different Set of Three Genes Fig. 6. Classification tree for tissue types using expression data from three genes (R87126, T62947, X15183) Correlation Matrix Among Genes in Fig.1 and Fig. 6 Advantages of the Classification Tree 1. Efficient with large number of genes 2. Automatically selects valuable and user-friendly genes as predictors 3. More precise than some other classification methods such as support vector machine and linear discriminant analysis Conclusions: 1. It is likely that the information contained in a large number of genes can be captured by a small optimal set of genes without significant loss of information. 2. The precision of classification of recursive partitioning is important for clinical application.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Recursive partitioning for tumor classification with gene