Download Word file (37 KB )

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Transposable element wikipedia , lookup

Gene nomenclature wikipedia , lookup

Epistasis wikipedia , lookup

Pharmacogenomics wikipedia , lookup

Short interspersed nuclear elements (SINEs) wikipedia , lookup

X-inactivation wikipedia , lookup

NEDD9 wikipedia , lookup

Epigenetics in learning and memory wikipedia , lookup

Metagenomics wikipedia , lookup

Twin study wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Gene desert wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Heritability of IQ wikipedia , lookup

Oncogenomics wikipedia , lookup

Public health genomics wikipedia , lookup

Pathogenomics wikipedia , lookup

History of genetic engineering wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Essential gene wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Genome evolution wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Gene expression programming wikipedia , lookup

Gene wikipedia , lookup

Microevolution wikipedia , lookup

Genomic imprinting wikipedia , lookup

Genome (book) wikipedia , lookup

RNA-Seq wikipedia , lookup

Designer baby wikipedia , lookup

Minimal genome wikipedia , lookup

Ridge (biology) wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
1
Nature manuscript F08651
Supplementary Information van ‘t Veer et al.
Method of unsupervised clustering
In our two-dimensional cluster analysis, gene clustering and experiment (tumour)
clustering are performed independently without interfering between the two dimensions using an
agglomerative hierarchical clustering algorithm [J. Hartigan, Clustering Algorithms, (John Wiley
& Sons, New York, 1975)]. For gene (or experiment) clustering, the distance metric (also known
as dissimilarity measure) is defined between a pair of genes (or a pair of experiments) i and j as:
Dij  1   ij ,
(1)
where ij is the error-weighted correlation coefficient between two genes (or two experiments) i
and j across all experiments (or all genes) k = 1, …, N:
 ij 
N
r

  rik  jk



ik 
jk 
k 1
r

  rik    jk
 ik  k 1  jk 
k 1
N
2 N
2
.
(2)
In this equation, rik is the logarithmic transcriptional expression level measured relative to
a baseline condition with an error ik for gene (or under experiment) i, under experiment (or for
gene) k. The summation runs over all experiments k = 1, …, Nexps (or all genes k = 1, …, Ngene).
The colour display encodes the logarithm of these expression changes, where red is
upregulation, green is downregulation, black represents no change, and gray represents non
available. Each row in the display represents an experiment condition pair, and each column
2
corresponds to a gene. The rows and columns are displayed in the order given by the clustering
output trees in the two dimensions.
Not all genes are retained in the clustering analysis, only 5,000 significant genes with
more than two-fold regulation and significance of regulation p < 0.01 in more than 5 experiments
were kept. This focusses the attention to the most informative genes, yet does not bias the
clustering result towards any a priori assumptions as to mechanism.
Method of supervised classification
We calculated the correlation between the prognostic category (metastasis vs. nometastasis) and the logarithmic expression ratio across all 78 samples for each individual gene in
the 5,000 significant genes. The distribution of the correlation coefficients is shown in red in the
histogram in Figure S1 (a). Genes with a greater correlation coefficient (Pearson Coefficient) are
likely candidates for reporting prognosis, i.e., short interval to metastasis or no metastasis. In
order to evaluate the significance of each correlation coefficient with respect to a null hypothesis
that such correlation coefficient can be found by chance, we used a permutation technique to
generate Monte-Carlo data that randomises the association between gene expression data of the
78 tumour samples and their prognostic categories. The blue histogram in Figure S1 (a) shows the
distribution of correlation coefficients obtained from one Monte-Carlo trial for such a null
hypothesis. 10,000 such Monte-Carlo simulations were generated. Subsequently, genes that have
the correlation coefficient either larger than 0.3 (“correlated genes”) or less than –0.3 (“anticorrelated genes”) were selected both in the real data and the Monte-Carlo data. We found that
231 genes fulfilled this criterion in the real data set, where the number of genes that fulfilled the
same criterion in the Monte-Carlo data is much smaller and varies from run to run. The frequency
distribution of the number of genes that satisfies this criterion for 10,000 Monte-Carlo runs is
3
displayed in Figure S1 (b). The probability of finding 231 genes or more with a correlation of at
least +/-0.3 with outcome purely by chance is estimated to be 0.3% based on 10,000 Monte-Carlo
trails as shown in Figure S1 (b). It is noted that on average, there would be 36 genes selected by
chance.
The significance for each of the 231 genes as a prognostic reporter was evaluated by a
metric similar to the “Fisher” statistic. The “Fisher” metric for each gene is plotted in Figure S2
(a). The confidence level of each gene in the candidate list was estimated with respect to a null
hypothesis derived from the actual data set using the random permutation technique. The p-value
estimation from this distribution for each gene in the candidate list is shown in Figure S2 (b).
We used the method of “leave-one-out” for cross validation. Specifically, at one time, we
took one sample out and used the remaining 77 samples to define a classifier based on the set of
231 discriminating genes. Then we predicted the outcome of the one sample we left out in the
first place. The prediction of the left out sample is based on its correlation coefficient to the
“good prognosis” template and “poor prognosis” template, where the “good” and “poor”
templates are the average expression patterns of clinically “good” and “poor” samples within the
77 samples. The correlation coefficient is calculated using the selected reporter genes. We
repeated this procedure until each of the 78 samples was left out once. We finally counted in how
many cases the predictions were correct and in how many cases the predictions were incorrect.
The performance of the classifier is measured by the error rates of type 1 (false negative) and type
2 (false positive) for this selected gene set. We repeated the above performance evaluation
procedure based on the leave-one-out cross validation when we added 5 more marker genes each
time from the top of the candidate list until all the 231 genes were used as discriminating genes.
The performance as a function of the number of marker genes is shown in Figure S3. The number
of wrong predictions of type 1 and type 2 errors change dramatically with the number of marker
4
genes employed. The combined error rate reaches the minimum when we use 70 marker genes
from the top of our candidate list. Therefore, we consider this set of 70 genes as the optimal set of
marker genes that can be used to classify patients in “sporadic” group into two prognostic
subgroups: “good prognosis” group and “poor prognosis” group. It is interesting to point out that
the accuracy in predicting the prognosis of “sporadic” breast cancer patients is quite low when we
use just few marker genes. The accuracy improves with the increasing number of marker genes
until the optimal number of marker genes is reached (~70 genes). However, beyond the optimal
number of marker genes, the accuracy becomes worse, due to the introduction of noise.
Performance cross-validation
Since the 231 reporters were defined using all 78 samples, the cross-validation in the
previous section may have the potential of over-fitting through an information leak. To address
this problem in the cross-validation, we constructed a cross-validation procedure that has no
information leak, and involves no optimization. The procedure is as follows: (1) leave one sample
out, (2) define reporters based on the remaining 77 samples among the set of ~5000 significant
genes, (3) use the reporters to predict the outcome of the one sample that was left out in step (1),
(4) repeat steps (1)-(3) exhaustively for all 78 samples.
The above procedure essentially created 78 classifiers based on 78 sets of reporters. The
reporters in each “leave-one-out” case are defined as genes with a |correlation| > 0.3 in the
remaining 77 samples, where the correlation is calculated between the gene expression and the
outcome of 77 samples. In this process, the sample left out is not involved in the reporter
selection, therefore, the procedure does not have any information leak. The prediction is based on
the correlation coefficient of the reporter expression pattern for the left out sample with the
templates defined by the remaining samples.
5
The average number of reporters from these 78 classifiers is 238+/-23. Figure S4 presents
the frequency of the original 231 genes and the union of other genes found in these 78 classifiers.
We found that the vast majority of the original 231 reporter genes is commonly shared by the 78
classifiers. In particular, 180 of the original 231 reporters appeared 74 times or more in those 78
classifiers.
Using this cross-validation process, at the threshold range which mis-classifies 3 “poor
prognosis patients” as “good prognosis patients”, 17 to 19 (average 18) “good prognosis patients”
are mis-classified as “poor” (Figure S5). This method results in an odds ratio of 15 (95% CI 4-56,
Fisher’s exact test p-value of 4.1E-6). This should be compared to the situation where the
reporter genes are fixed to the original 231 genes: then, 15 to 16 “good prognosis patients” are
mis-classified as “poor”(Figure S6), which gives odds ratios from 18 to 20. The differences in the
number of misclassifications and odds ratios represent a possible information leak.
It should be pointed out that this cross-validation process, apart from having no
information leak in the reporter selection, also didn't optimize the number of reporter genes as
was done in the definition of our classifier. Hence the odds ratio of 15 obtained from this process
may be on the conservative side.
Multivariate logistic fit
To evaluate the added prognostic value of the microarray gene expression profiling in
addition to the clinical parameters, the microarray parameter and the clinical parameters are
combined to form a ‘complete’ multivariate model by the logistic regression (see, for example,
“S-PLUS 2000 Guide to Statistics, Vol.1”, P.301). The clinical parameters used for the current
modelling are the tumour grade, oestrogen receptor (ER) status, progesteron receptor (PR) status,
6
tumor size, patient age, and angioinvasion. To avoid quoting an optimistic number from the
microarray data, the correlation coefficient to the “good prognosis” templates from the
conservative cross-validation (previous section of this material) is used in the logistic regression.
In order to calculate the odds ratio from these multivariate logistic regression coefficients,
all the input parameters, including that of the microarray, were converted to a binary format (0 or
1) as follows:
Parameter
grade
ER
PR
size (mm)
age
angioinvasion
Microarray
Correlation
0
1,2
<=10
<=10
<=20
<=40
0
<= 0.54
1
3
>10
>10
>20
>40
1
> 0.54
The odds ratio for each parameter was calculated using the multivariate logistic regression
coefficient: OR = exp( logistic coefficient), and the 95% confidence interval as: CI = exp( logistic
coefficient +/- 1.96 * std error). The results are displayed in the following table:
Parameter
grade
ER
PR
size (mm)
age
angioinvasion
Microarray
Correlation
Logistic Coefficient Std. Error Odds ratio
-0.08
0.79
1.1
0.5
0.94
1.7
-0.75
0.93
2.1
-1.26
0.66
3.5
1.4
0.79
4
-1.55
0.74
4.7
2.87
0.85
17.6
95% CI
[0.2 5.1]
[0.3 10.4]
[0.3 13.1]
[1.0 12.8]
[0.9 19.1]
[1.1 20.1]
[3.3 93.7]
7
Legends to the Figures in Supplementary Information.
Figure S1. (a) Histogram of the correlation coefficients of the gene expression ratio of each
significant gene with the prognostic category (metastases within 5 years or metastases free for > 5
years), shown in red. The blue distribution is obtained from one Monte-Carlo run where the
association of the gene expression and the prognostic category were randomised. The magnitude
of correlation or anti-correlation of 231 genes is greater than 0.3. (b) Frequency distribution of the
number of genes that satisfy the same criterion for 10,000 Monte-Carlo runs. The mean is 36 and
p(n>231) is 0.3% and p(n>231/2) = 3.3%.
Figure S2. (a) The Fisher metric for each gene on the discriminating reporter candidate list. (b)
The p-value obtained from the Monte-Carlo runs indicates the probability that the gene is selected
as a discriminating gene by chance. The gene orders in (a) and (b) are identical.
Figure S3. The classification error rates for type 1 and type 2 as a function of the number of
discriminating genes used in the classifier. Y axis is the number of tumours classified wrong, X
axis is the number of reporters. Note that the optimal combined error rate is reached with 70
discriminating marker genes. The red stars represent type 1 errors (false negative in predicting
metastasis) and blue circles are type 2 (false positive) errors.
Figure S4. The frequency distribution of the original 231 reporter genes and the union of other
genes found in the 78 no-information-leak cross-validation classifiers. Each cross-validation
classifier selects its own reporter genes without the left out sample.
8
Figure S5. The error rates (same definition as Figure S3) versus the threshold in the correlation
coefficient to the “good prognosis” template, as determined in the no-information-leak crossvalidation. At the thresholds where the number of false negatives is 3, the number of false
positives is 17 to 19 (average 18).
Figure S6. The error rates (same definition as Figure S3) versus the threshold in the correlation
coefficient to the “good prognosis” template, as determined in the cross-validation using the 231
reporter genes selected from all 78 samples. At the thresholds where the number of false
negatives is 3, the number of false positives is 15 to16.
Table S1. List of 117 patients with clinical information.
The header for each column is self-explanatory.
Table S2. List of 231 prognosis reporter genes including the optimal set of 70 markers. The
header for each column is self-explanatory. The 70 optimal marker genes are the 70 genes with
the highest absolute correlation coefficients.
Table S3. List of 2460 ER status reporter genes including the optimal set of 550 markers. The
550 optimal marker genes are the 550 genes with the highest absolute correlation coefficients.
The header for each column is self-explanatory.
9
Table S4. List of 430 BRCA1 reporter genes including the optimal set of 100 markers. The 100
optimal marker genes are the 100 genes with the highest absolute correlation coefficients. The
header for each column is self-explanatory.