Download Microarray Expression Cytogenetic application in classification and

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Microarray Expression Cytogenetic application in classification and
prediction of survival times
Kate, Jie Hu
Introduction and Motivation
Currently many research groups are trying to differentiate tumors cells from normal
cells based on microarray profiles. This is believed to have a wide application in
future clinical diagnosis and treatment. However, few of them take into account the
cytogenetic aberration in cancer cells. It is well studied that chromosome instability is
common in different type of tumors.1 New discovery shows epigenetic alteration may
not just occur locally, but also span large regions of chromosome.2.3. Suggested by
both results, we can develop new tumor classification approach that can recognize the
difference of expression at the cytogenetic level. We proposed to group genes nearby
to form blocks first and treat them as the functional units in our analysis. This
cytogenetic expression profile we created may capture many causes of cancers such as
translocation, deletion or the over-regulation and-down regulation of the chromosome
regions. Another implication of cytogenetic expression profile is its potential
application in predicting survival times. Survival times measures the period for a
subject from entering the study to the recurrence of cancer or death. Due to its large
variation, the prediction of survival times may be more informative than binary
outcomes. Nowadays, little research links survival times to expression profiles due to
the high-dimensional issue brought by the type of data microarray experiments
produce. Experiments usually generate thousands of genes expression values of high
collinearity, while the sample sizes are relatively small compared to the number of
genes. There are more parameters than samples when building predicting models and
thus it makes the estimation of parameters almost impossible. Many pure statistical
methods are created to solve this problem like Principle Component Analysis. Here
we combine our knowledge of the organization of genes provided by cytogenetic
expression profiles with the statistical methods to simplify the model. Through
comparison its performance with pure statistical methods, we may find out if
expression cytogenetic can be another approach to predict survival outcomes.
Aim 1 Classification:
Methods:
Datasets: In Gene expression omnibus (GEO) website, there are many microarray
data sets for different type of tumor cells and corresponding normal cells. Some of
them have survival outcomes as well. For example GSE5900/GSE2658, which have
Bone Marrow Plasma Cells from Healthy Donors (N=22) and Multiple
Myeloma(N=559). The patients with multiple myeloma have censored survival times
information in the dataset as well4.
Building chromosome blocks:
In order to group genes by location, we first need to
map each gene onto the chromosome. Then we can merge normalized expression
levels of neighboring genes using arithmetic mean. The grouping principle varies,
either for every few million base-pairs, for every few thousand genes or by smoothing
the closer genes.
Selection of Blocks: From last step, we reduced the dimension of data dramatically;
but many blocks may have similar expression levels in normal and cancer cells, which
give no information to classification. We want to delete these blocks. In the training
sets with both normal and tumor cells, we conducted independent t test assuming
equal variance on each block. Those blocks with smaller p-value suggest greater
variation and thus selected.
Construction of templates and Comparison of sample to template
In the training set, we can build two templates using selected blocks, one for normal
cells and one for tumor cells. They are supposed to have different color patterns. Then
we can compare other samples in the test set or samples from another dataset to the
templates we created. The Euclidean distances between samples and templates are
calculated, which is the sum of the absolute values of difference between the sample
and each template’s block. The sample is classified to the template type with smaller
difference.
Evaluation:
From the test set, we can get the false negative and false positive rates. The true error
rates can be estimated by the sum of them, which will be a good indicator of our
method. We can also compare our performance with other approaches like Linear
Discriminate Analysis(LDA) and Support Vector Machine (SVM)
Extension:
Sometimes doctors may not be certain on the patients’ disease types. There will be a
few candidate tumor types the sample may come for. A classification of multiple
groups may be required. The procedures are similar to the two group’s scenarios. In
the selection step, now we use ANOVA to select the most varied blocks and create
templates for each candidate cell type. The most challenging part here is the
possibility to do comprising across datasets. The batch effects may cause different
microarray results not comparable. Thus we need to manipulate various normalization
techniques to solve this problem.
Aim 2 Predicting survival outcomes
Selection of blocks
With the similar procedures as aforementioned, we can order blocks by the extent of
variation between normal and tumor cells.
Building Models:
There are two ways we proposed to build the models.
1. Build the Cox Proportional Hazard Model: λ(t) = λ0(t)exp(β1T1 + β2T2
+· · ·+βkTk), Cox Proportional Hazard model is commonly used in fitting
right-censored survival data. λ(t) indicates the hazard of the subject at time t .λ0(t)
is the baseline hazard function. However, Cox PH model usually cannot handle
too much parameters, so we need to further reduce the dimension of data.
Principle of Component Analysis (PCA) can be used here to find the top few
directions that generate most variance in data. These directions are treated as Ti
and used to build the model.
2. Build additive risk model: λ(t) =λ0(t) +β1T1 + β2T2 +· · ·+βkTk.
Combined with
Lasso (least absolute shrinkage and selection operator) approach, this model can
handle more regression parameters5. Lasso is a technique which shrinks
coefficients and produces some coefficients to zero, thus it can yield a sparse
model. Since each Ti here corresponds to a block, βi in this model is more
interpretable compared to the PCA approach.
Evaluation
To evaluate the performance of the method, we can use the time-dependent ROC and
AUC as the criterion Larger AUC at time t indicates better prediction6.
Comparison to other analysis without incorporating biology knowledge
The pure statistical approaches were also proposed by Hongzhe Li’s group
7,8
. They
proposed to combine PCA and sliced inverse regression or using partial Cox
regression method. We will compare our results to their analysis to assess the
performance as well.
Conclusion and Discussion: After above analysis, we may find out if this approach
is better and simpler than other pure statistical classification and prediction methods.
The limit of the cytogenetic expression profile method can be discussed. For example,
if the disease is caused by a single gene, which may not be diagnosed by the profile.
In term of practice, we have to build models for each cancer cell type specifically.
There is no common formula we can plug in numbers to get the predicted survival
times.
References
1. Pollack JR. Chromosome instability leaves its mark. Nat Genet. 2006 Sep;
38(9):973-4.
2. Frigola J, Song J, Stirzaker C, Hinshelwood RA, Peinado MA, Clark SJ. Epigenetic
remodeling in colorectal cancer results in coordinate gene suppression across an entire
chromosome band. Nat Genet. 2006. 38(5):540-9.
3. Clark SJ. Action at a distance: epigenetic silencing of large chromosomal regions in
carcinogenesis. Hum Mol Genet. 2007. 16 Spec No 1:R88-95.
4. Zhan F, Barlogie B, Arzoumanian V, Huang Y et al. Gene-expression signature of
benign monoclonal gammopathy evident in multiple myeloma is linked to good
prognosis. Blood 2007;109(4):1692-700
5.Shuangge Ma and Jian Huang, Additive risk survival model with microarray data
BMC Bioinformatics. 2007; 8: 192
6. Heagerty PJ, Lumley T, Pepe MS.Time-dependent ROC curves for censored
survival data and a diagnostic marker.Biometrics. 2000 Jun;56(2):337-44.
7.Li H, Gui J.Partial Cox regression analysis for high-dimensional microarray gene
expression data.Bioinformatics. 2004 Aug 4;20 Suppl 1:i208-15.
8.Li L, Li H.Dimension reduction methods for microarrays with application to
censored survival data. Bioinformatics. 2004 Dec 12;20(18):3406-12. Epub 2004 Jul
15.