Download Using PROC PLS (Partial Least Squares) to Build a Gene Prognosis Profile

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Gene therapy wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Public health genomics wikipedia , lookup

BRCA mutation wikipedia , lookup

Pharmacogenomics wikipedia , lookup

Microevolution wikipedia , lookup

Genome (book) wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Gene expression programming wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Designer baby wikipedia , lookup

Mir-92 microRNA precursor family wikipedia , lookup

NEDD9 wikipedia , lookup

Gene expression profiling wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Oncogenomics wikipedia , lookup

RNA-Seq wikipedia , lookup

Transcript
NESUG 2007
Statistics and Data Analysis
A Gene Prognosis Profile using Partial Least Square (PLS)
Chenwei Liu, MS, Gordon Whiteley, Ph.D.
Clinical Proteomics Reference Lab, SAIC Frederick Inc., Gaithersburg, MD
Gordon R. Whiteley, Director, Advanced Technology Program (ATP), Clinical
Proteomics Reference Laboratory (CPRL), SAIC-Frederick, Inc., NCI-Frederick,
Frederick, MD
ABSTRACT
In a typical clinical study based on a microarray gene expression experiment, there are often more genes
than subject samples, and many genes are correlated. Partial least square regression, though not originally
designed for classification purposes, can be used to build a classifier to predict outcomes based on the high
dimensional correlated gene expression data. In this article, using a publicly available breast cancer study
data, we show the process of using PROC PLS in SAS to construct a metastasis risk classifier from a
training dataset. The classifier is further used to assign patients from an independent validation set into
high- and low- risk groups based on their gene expressions. The result confirms that patients in the two
predicted risk groups show significant difference in their survival outcome.
INTRODUCTION
With the maturing of DNA microarray technology, many gene expression based applications in the cancer
diagnostics field are becoming more acceptable to physicians and the regulatory agencies. On February 14
2007, the FDA approved the first gene expression based in vitro prognostics tool, MammaPrint, in US
allowing it to proceed to the market to help doctors assess young breast cancer patients’ risk of distant
metastasis. This approval will certainly encourage more clinical studies involving microarray technology,
and more gene data analysis methods and tools will be developed as a result. In a DNA microarray
experiment dataset, the number of subjects (observations) is usually much smaller than the number of genes
(variables), and many genes are correlated, i.e., they are not independent. Theoretically, statistical methods
having an independency assumption cannot be used directly in the above scenario. Many machine learning
methods such as the Support Vector Machine, C5.0 Decision Tree and Neural Networks have been used to
find gene expression patterns related to disease [1], Partial Least Square is also a good method for
exploring information hidden in the huge amount of gene data. While Ordinary least squares regression has
the goal of minimizing sample response prediction error and seeking linear functions of the predictors that
explain as much variation in each response as possible, Partial Least Square has the additional goal of
accounting for variation in the predictors under the assumption that directions in the predictor space that are
well sampled should provide better prediction for new observations when the predictors are highly
correlated [2]. Also, the Partial Least Square method has been implemented in SAS and is easy to use. It is
a simple approach to first use PROC PLS to analyze the data and see if a gene expression difference exist
among patient groups of interest in a study. Then other methods may be brought in to refine models and
build more powerful tools. In this paper we show the detail of using PROC PLS for classification
applications in SAS.
EXAMPLE: GENE EXPRESSION OF YOUNG BREAST CANCER PATIENTS
Breast cancer is the most common cancer among women, except for non-melanoma skin cancers, and it is
the second leading cause of cancer death in women, exceeded only by lung cancer. According to the
American Cancer Society, an estimated 178,480 new cases of invasive breast cancer will be diagnosed
among women in the United States this year and over 40,000 women are expected to die from the disease.
Death rates from breast cancer continue to decline, with larger decreases in women younger than 50. It is
believed that early detection through screening based on various technologies, refined prognosis, increased
awareness, as well as improved treatment have contributed to the death rate decrease. Microarray
technology is widely used to provide a better screening tool and refine prognosis to provide physicians with
accurate information and guide tailored treatment for patient. Using DNA microarray analysis on primary
breast tumors of 117 young patients, Van’t Veer et al reported a 70-gene prognosis profile associated with
the risk of early development of distant metastasis in patients with lymph-node negative breast cancer [3].
1
NESUG 2007
Statistics and Data Analysis
Van de Vijver et al further evaluated the 70-gene classifier on a series of 295 consecutive patients with
stage I or stage II breast cancer and made those 295 patients gene expression data available to the public
[4]. The log ratio column in the dataset represents the normalized expression level of a gene. The dataset
consists of log ratios of over 20000 genes for 295 patients who were diagnosed with breast cancer between
1984 and 1995 and treated by modified radical mastectomy or breast-conserving surgery, followed by
radiotherapy at the hospital of the Netherlands Cancer Institute. The median follow up time was 7.2 years
and the median survival time was 3.8 years for those 295 patients. We first used the following macro to
randomly split the 295 patients into two independent set, 70% training set and 30% validation set. The
all295 is the input data set, and 1224 is a seed for the random process. Since a random split is relatively
easy to code in SAS, the detail of the macro is omitted here.
%ran_split(all295, eventmeta,train, valid, 0.7,1224,idvar=sampleID,
stratify=yes);
BUILDING A CLASSIFER FROM THE TRAINING SET
There are many ways to select genes that might be associated with clinical outcomes. It is out of the scope
of this article to discuss and compare the difference methods of feature selection. As a demonstration about
using of PROC PLS, we used the 70 genes identified in the original paper of Van’t Veer et al as our
predictor variables. The number of factors to extract depends on the training data. However, the extraction
of excess factors must be avoided in order to prevent overfitting. The PLS procedure enabled us to choose
the number of extracted factors by cross validation. Various methods of cross validation are available,
including one-at-a-time validation, and splitting the data into blocks.
proc pls data= train cv=split ;
model eventmeta = &pls_mvars ;
run;
cv=split option sets every 7th ( default ) observation aside as validation set. This number can be optimized
according to the size of the sample set. The above pls_mvars macro variable includes all the 70 genes. The
output is as follows:
The PLS Procedure
Split-sample Validation for the Number of Extracted Factors
Number of
Extracted
Factors
Root
Mean
PRESS
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
1.079415
1.025679
1.017247
1.050493
1.073688
1.081337
1.090805
1.103451
1.113767
1.12401
1.128974
1.139266
1.143572
1.149865
1.154781
1.15846
2
NESUG 2007
Statistics and Data Analysis
Minimum root mean PRESS
Minimizing number of factors
1.0172
2
PREDICT RISK OUTCOME IN THE VALIDATION SET
It is recommended to extract 2 factors for the model based on the training data. The following code brings
in the independent validation, specifies 2 factors to extract and output the predicted value to an output data
set pred.
data all;
set train valid(rename=(eventmeta=meta));
run;
proc pls data=all nfac=2 ;
model eventmeta= &pls_mvars;
output out=pred p=p_meta;
run;
It is necessary to mention, when combining the training set with the validation set, the above “rename”
option is important, because we want the model to predict the outcome of the observations from the “valid”
set. If observations from the validation set have the eventmeta values, they will affect the PROC PLS
model building process and a new model is rebuilt and we want to avoid this.
In the output dataset, the predicted values of response p_meta range from 0 to 1, we could chose difference
cutoff value to dichotomize them, thus tuning the sensitivity and specificity. The following data set and two
PROC FREQs were then used to see how well the model works on the training set and the validation set.
data pred;
set pred(keep=SampleID EventMeta TimeSurvival Meta P_meta) ;
if P_meta>0.37 then pls=1;
else pls=0;
run;
proc freq data=pred(where=(eventmeta>.));
table eventmeta*pls/nopercent nocol;
run;
proc freq data=pred(where=(eventmeta=.));
table meta*pls/nopercent nocol;
run;
The output is shown in the following Table 1. and Table 2. As seen in Table 1, when we chose the above
0.37 cutoff value, on the training set, the sensitivity is 73.24% and specificity is 76.12. They are quite
balanced. On the validation set, the sensitivity is 63.33%, while the specificity is 75.44%. It is reasonable
that sensitivity, specificity or both are not as high as those on the training set. Increasing the sensitivity by
lowering the cutoff value results in a decrease in specificity.
SURVIVAL DIFFERENCE OF THE PREDICTED RISK GROUPS
When the full clinical data is not available but survival data is provided, the survival data can be utilized.
The classifier will divide patients based on gene data into high and low risk groups. The survival
information can then be used to confirm the risk classification. The following is an example:
proc lifetest data=pred(where=(eventmeta=.))
method=KM plots=(s) ;
time timesurvival*eventdeath(0);
strata pls;
run;
3
NESUG 2007
Statistics and Data Analysis
The Kaplan-Meier curve plot is shown in Figure 1. The predicted high risk group ( pls=1, red) has
significantly lower survival rate than the predicted low risk group(pls=0, black), as shown by a log rank test
p-value <0.0001. The survival result verifies that the gene expression difference we found from the training
set is truly related the biological difference.
Table 1. Training set result.
Table 2. Validation set result.
Table of EVENTmeta by pls
EventMeta
Table of meta by pls
pls
Frequency
Row Pct
Meta
pls
Frequency
Row Pct
0
1
Total
0
102
76.12
32
23.88
134
0
43
75.44
14
24.56
57
1
19
26.76
52
73.24
71
1
11
36.67
19
63.33
30
121
84
205
54
33
87
Total
Total
0
1 Total
Figure 1. Kaplan-Meier Curve of Patients by Risk Group (---High Risk, ---Low Risk)
1. 00
0. 75
0. 50
0. 25
0. 00
0. 0
2. 5
5. 0
7. 5
10. 0
12. 5
15. 0
TI M
Esur vi val
STRATA:
pl s=0
Censor ed pl s=0
4
pl s=1
Censor ed pl s=1
17. 5
20. 0
NESUG 2007
Statistics and Data Analysis
CONCLUSION
The SAS procedure PLS allow for building a classifier using many (greater than number of observations)
variables as input predictors. The number of factors to be extracted can be determined by cross validation.
The sensitivity and specificity of the model can be tuned by choose cutoff values when dichotomizing the
predicted response value.
REFERENCES
[1] Brown et al. Knowledge-based analysis of microarray gene expression data by using support
vector machines. Proc Natl Acad Sci U S A. 2000 January 4; 97(1): 262–267.
[2] SAS/STAT manual, http://support.sas.com/onlinedoc/912/docMainpage.jsp.
[3] Van’t Veer LJ, Dai H, Van de Vijver MJ, He YD, Hart AA, Mao M, et al. Gene expression
profiling predicts clinical outcome of breast cancer. Nature 2002; 415:530-6.
[4] Van de Vijver, M.J., He, Y.D., et al. (2002). A Gene-expression signature as a predictor of
survival in breast cancer. New England Journal of Medicine 347, 1999-2009.
ACKNOWLEDGMENTS
SAS is a Registered Trademark of the SAS Institute, Inc. of Cary, North Carolina.
MammaPrint is a Registered Trademark of Agendia BV, Amsterdam, Netherlands.
This project has been funded in whole or in part with federal funds from the National Cancer Institute,
National Institutes of Health, under contract N01-CO-12400. The content of this publication does not
necessarily reflect the views or policies of the Department of Health and Human Services, nor does
mention of trade names, commercial products, or organizations imply endorsement by the U.S.
Government.
This Research was supported [in part] by the Intramural Research Program of the NIH, National Cancer
Institute, Center for Cancer Research.
CONTACT INFORMATION
Your comments and questions are valued and encouraged. Contact the authors at
Chenwei Liu, [email protected]
Dr. Gordon Whiteley, [email protected]
5