Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Microarray Expression Cytogenetic application in classification and prediction of survival times Kate, Jie Hu Introduction and Motivation Currently many research groups are trying to differentiate tumors cells from normal cells based on microarray profiles. This is believed to have a wide application in future clinical diagnosis and treatment. However, few of them take into account the cytogenetic aberration in cancer cells. It is well studied that chromosome instability is common in different type of tumors.1 New discovery shows epigenetic alteration may not just occur locally, but also span large regions of chromosome.2.3. Suggested by both results, we can develop new tumor classification approach that can recognize the difference of expression at the cytogenetic level. We proposed to group genes nearby to form blocks first and treat them as the functional units in our analysis. This cytogenetic expression profile we created may capture many causes of cancers such as translocation, deletion or the over-regulation and-down regulation of the chromosome regions. Another implication of cytogenetic expression profile is its potential application in predicting survival times. Survival times measures the period for a subject from entering the study to the recurrence of cancer or death. Due to its large variation, the prediction of survival times may be more informative than binary outcomes. Nowadays, little research links survival times to expression profiles due to the high-dimensional issue brought by the type of data microarray experiments produce. Experiments usually generate thousands of genes expression values of high collinearity, while the sample sizes are relatively small compared to the number of genes. There are more parameters than samples when building predicting models and thus it makes the estimation of parameters almost impossible. Many pure statistical methods are created to solve this problem like Principle Component Analysis. Here we combine our knowledge of the organization of genes provided by cytogenetic expression profiles with the statistical methods to simplify the model. Through comparison its performance with pure statistical methods, we may find out if expression cytogenetic can be another approach to predict survival outcomes. Aim 1 Classification: Methods: Datasets: In Gene expression omnibus (GEO) website, there are many microarray data sets for different type of tumor cells and corresponding normal cells. Some of them have survival outcomes as well. For example GSE5900/GSE2658, which have Bone Marrow Plasma Cells from Healthy Donors (N=22) and Multiple Myeloma(N=559). The patients with multiple myeloma have censored survival times information in the dataset as well4. Building chromosome blocks: In order to group genes by location, we first need to map each gene onto the chromosome. Then we can merge normalized expression levels of neighboring genes using arithmetic mean. The grouping principle varies, either for every few million base-pairs, for every few thousand genes or by smoothing the closer genes. Selection of Blocks: From last step, we reduced the dimension of data dramatically; but many blocks may have similar expression levels in normal and cancer cells, which give no information to classification. We want to delete these blocks. In the training sets with both normal and tumor cells, we conducted independent t test assuming equal variance on each block. Those blocks with smaller p-value suggest greater variation and thus selected. Construction of templates and Comparison of sample to template In the training set, we can build two templates using selected blocks, one for normal cells and one for tumor cells. They are supposed to have different color patterns. Then we can compare other samples in the test set or samples from another dataset to the templates we created. The Euclidean distances between samples and templates are calculated, which is the sum of the absolute values of difference between the sample and each template’s block. The sample is classified to the template type with smaller difference. Evaluation: From the test set, we can get the false negative and false positive rates. The true error rates can be estimated by the sum of them, which will be a good indicator of our method. We can also compare our performance with other approaches like Linear Discriminate Analysis(LDA) and Support Vector Machine (SVM) Extension: Sometimes doctors may not be certain on the patients’ disease types. There will be a few candidate tumor types the sample may come for. A classification of multiple groups may be required. The procedures are similar to the two group’s scenarios. In the selection step, now we use ANOVA to select the most varied blocks and create templates for each candidate cell type. The most challenging part here is the possibility to do comprising across datasets. The batch effects may cause different microarray results not comparable. Thus we need to manipulate various normalization techniques to solve this problem. Aim 2 Predicting survival outcomes Selection of blocks With the similar procedures as aforementioned, we can order blocks by the extent of variation between normal and tumor cells. Building Models: There are two ways we proposed to build the models. 1. Build the Cox Proportional Hazard Model: λ(t) = λ0(t)exp(β1T1 + β2T2 +· · ·+βkTk), Cox Proportional Hazard model is commonly used in fitting right-censored survival data. λ(t) indicates the hazard of the subject at time t .λ0(t) is the baseline hazard function. However, Cox PH model usually cannot handle too much parameters, so we need to further reduce the dimension of data. Principle of Component Analysis (PCA) can be used here to find the top few directions that generate most variance in data. These directions are treated as Ti and used to build the model. 2. Build additive risk model: λ(t) =λ0(t) +β1T1 + β2T2 +· · ·+βkTk. Combined with Lasso (least absolute shrinkage and selection operator) approach, this model can handle more regression parameters5. Lasso is a technique which shrinks coefficients and produces some coefficients to zero, thus it can yield a sparse model. Since each Ti here corresponds to a block, βi in this model is more interpretable compared to the PCA approach. Evaluation To evaluate the performance of the method, we can use the time-dependent ROC and AUC as the criterion Larger AUC at time t indicates better prediction6. Comparison to other analysis without incorporating biology knowledge The pure statistical approaches were also proposed by Hongzhe Li’s group 7,8 . They proposed to combine PCA and sliced inverse regression or using partial Cox regression method. We will compare our results to their analysis to assess the performance as well. Conclusion and Discussion: After above analysis, we may find out if this approach is better and simpler than other pure statistical classification and prediction methods. The limit of the cytogenetic expression profile method can be discussed. For example, if the disease is caused by a single gene, which may not be diagnosed by the profile. In term of practice, we have to build models for each cancer cell type specifically. There is no common formula we can plug in numbers to get the predicted survival times. References 1. Pollack JR. Chromosome instability leaves its mark. Nat Genet. 2006 Sep; 38(9):973-4. 2. Frigola J, Song J, Stirzaker C, Hinshelwood RA, Peinado MA, Clark SJ. Epigenetic remodeling in colorectal cancer results in coordinate gene suppression across an entire chromosome band. Nat Genet. 2006. 38(5):540-9. 3. Clark SJ. Action at a distance: epigenetic silencing of large chromosomal regions in carcinogenesis. Hum Mol Genet. 2007. 16 Spec No 1:R88-95. 4. Zhan F, Barlogie B, Arzoumanian V, Huang Y et al. Gene-expression signature of benign monoclonal gammopathy evident in multiple myeloma is linked to good prognosis. Blood 2007;109(4):1692-700 5.Shuangge Ma and Jian Huang, Additive risk survival model with microarray data BMC Bioinformatics. 2007; 8: 192 6. Heagerty PJ, Lumley T, Pepe MS.Time-dependent ROC curves for censored survival data and a diagnostic marker.Biometrics. 2000 Jun;56(2):337-44. 7.Li H, Gui J.Partial Cox regression analysis for high-dimensional microarray gene expression data.Bioinformatics. 2004 Aug 4;20 Suppl 1:i208-15. 8.Li L, Li H.Dimension reduction methods for microarrays with application to censored survival data. Bioinformatics. 2004 Dec 12;20(18):3406-12. Epub 2004 Jul 15.