Download Supplementary Materials - Clinical Cancer Research

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Supplemental Table 1:A.
Characteristics of the antibodies used for immunohistochemical
measures B.Clinical characteristics of the CIT cohort of patients.
Supplemental Table 2A: Gene expression related statistics in the CIT cohort.
For each gene (rows) several related statistics calculated on the CIT cohort are given (columns): log2
geometric mean of expression measures (prefix GM.) within each histological subtype, log2 gene
expression fold change (prefix FC.) between two histological subtypes, moderate t-test (limma) pvalues and q-values, as well as AUC, derived from comparisons between basaloid and non basaloid
tumors. Columns AK to AV: The expression of each gene was dichotomized using a rule “if expression
operator cutoff then 1 else 0” (operator being either < or >), in order to discriminate basaloid and
non-basaloid tumors, yielding related statistics (TP, TN, FP, FN, sensitivity, specificity, PPV, NPV,
accuracy, fisher test). NB: Genes yielding a fisher test pvalue > .001 were filtered.
Abbreviations: BAS= basaloid, pBAS= pure basaloid, mBAS= mixed basaloid, SCC= non-basaloid SCC,
SCCwd= well differentiated SCC, SCCpd= poorly differentiated, vs= versus, TP= number of true
positives, TN= number of true negatives, FP= number of false positives, FN= number of false
negatives, AUC= Area Under (ROC) Curve, PPV= positive predictive value, NPV= negative predictive
value.
Supplemental Table 2B: immunohistochemistry quickscore measures of SOX4
and IVL and prediction of the basaloid satus
Immunohistochemistry measures of SOX4 and IVL are given as quick scores (QS) for the 66 samples
of the training series and the 35 samples of the validation series. The prediction of the basaloid
status based on these measures is shown (column IHC.prediction).
Supplemental Table 3: Gene signatures and pathways deregulation analysis in
histological subtypes.
Matrix giving the relative index of deregulation of a series of signatures and pathways in each
histological subgroup (see column name prefix: basaloid = BAS; pure basaloid= pBAS; mixed
basaloid= mBAS; non-basaloid SCC= SCC; well differentiated SCC= SCCwd; poorly differentiated SCC=
SCCpd). Rows correspond to genes signatures / pathways (n=14725). Columns correspond to
deregulation index* based on either (i) all measured genes, (ii) underexpressed** genes (column
1
name suffix: _DOWN), (iii) overexpressed genes*** (column name suffix: _UP). *: the deregulation
index corresponds to a rank across all pathways/signatures (based on four methods – see Methods):
the lower the rank, the more underlying genes are differentially expressed. Of note, pathways with
high ranks (e.g. > 10000) may show differential expression in the group of interest as compared to
other groups, but at a lower significance than pathways with small ranks (e.g. < 1000). **:
underexpressed genes are here defined by a mean expression in the group of interest lower than the
mean expression in the rest of the samples –without referring to any statistical test. ***:
overexpressed genes are here defined by a mean expression in the group of interest higher than the
mean expression in the rest of the samples –without any statistical test. The pathways included in
Figure 2 are shown in Column ‘Figure 2’.
Supplemental Table 4: DNA copy number aberrations related statistics in the
CIT cohort.
(A) Frequencies and Fisher test p-values associated to each GISTIC significant region of DNA copy
number gains and losses in basaloid tumors compared to SCC tumors. (B) Frequencies and Fisher test
p-values associated to DNA copy number gains and losses of all genes as estimated by GISTIC in
basaloid tumors compared to SCC tumors. Genes for which the minimal Fisher test pvalue was
superior to 0.005 were filtered. (C) Frequencies and Fisher test p-values associated to DNA copy
number gains and losses of all markers (probes) in basaloid tumors compared to non-basaloid SCC
tumors. The last two columns correspond to global Fisher tests comparing all fours subtypes. Genes
for which the minimal Fisher test pvalue was superior to 0.001 were filtered.
Supplemental Table 5: 139-gene centroids used for prediction of the CIT
molecular subtypes.
Supplemental Table 6: Molecular subtype prediction and survival data for
each sample of the CIT and 8 validation datasets.
(A) For all samples (rows) from the CIT cohort and the 8 public datasets, several variables (columns)
are filled: "CIT molecular cluster" = cluster of the consensus partition (K=4) for the CIT dataset; "CIT
2
subtype predicted" = molecular subtype as established with our 139 genes predictor; "Wilkerson
subtype predicted” = predicted molecular subtype using the predictor from Wilkerson et al.; “Stage”=
histological staging; “OS event” and “OS delay” = overall survival data; “Histology” = subhistology
(CIT dataset only); “Periendoalveolar contingent”= presence of an alveolar cells contingent (CIT
dataset only).
(B) Contingency tables: (i) between CIT clusters and subhistologies (ii) between CIT clusters and
endoalveolar feature (iii) between CIT clusters and CIT predicted subtypes (using the 139-gene
predictor) in the CIT discovery cohort (iv) between Wilkerson predicted subtypes and
histology subtypes (v) between CIT predicted subtypes (using the 139-gene predictor) and series
(discovery + 8 public series) (vi)between CIT predicted subtypes and Wilkerson predicted subtypes in
the validation cohorts.
NB: tables (i) to (iv) are based on the CIT dataset, tables (v) and (vi) are obtained using all datasets
(CIT dataset + 8 public datasets).
Supplemental Table 7: Gene signatures and pathways enrichment analysis in
molecular subtypes in the CIT cohort and 4 validation datasets.
Matrix giving the relative index of deregulation of a series of signatures and pathways in each CIT
molecular subtype, for five series (CIT, Wilkerson, Lee, Raponi, Roepman). NB: we discarded the
series where some CIT molecular subtypes had to few corresponding samples (n<5). Rows
correspond to gene signatures / pathways (n=14725). Columns correspond to deregulation index*
based on either (i) all measured genes, (ii) underexpressed** genes (column name suffix: _DOWN),
(iii) overexpressed genes*** (column name suffix: _UP). Each analysis was repeated independently
in the five series.*: the deregulation index corresponds to a rank across all pathways/signatures
(based on four methods – see Methods): the lower the rank, the more underlying genes are
differentially expressed. Of note, pathways with high ranks (e.g. > 10000) may show differential
expression in the group of interest as compared to other groups, but at a lower significance than
pathways with small ranks (e.g. < 1000). **: underexpressed genes are here defined by a mean
expression in the group of interest lower than the mean expression in the rest of the samples –
without referring to any statistical test. ***: overexpressed genes are here defined by a mean
3
expression in the group of interest higher than the mean expression in the rest of the samples –
without any statistical test. The pathways included in Figure 6 are shown in Column ‘Figure 6’.
Supplemental Information
RNA and DNA extraction
Total RNA was extracted with TRIzol reagent (Invitrogen), according to the manufacturer’s
instructions, and DNA was extracted with phenol-chloroform by the use of standard procedures. The
integrity of the extracts was verified on an Agilent 2100 Bioanalyser (Agilent Technologies).
TP53 sequencing
Mutations were analysed using genomic DNA isolated from paraffin-embedded archived tissue
sections. Briefly, DNA was extracted by standard QIAamp DNA extraction Kit (QIAGEN, Hilden,
Germany). TP53 mutations (Exons 4 to 8) were detected by direct sequencing with BigDye®
Terminator v1.1 Cycle Sequencing Kit (Applied Biosystems) using primers and conditions described in
the IARC TP53 standard sequencing protocol (http://p53.iarc.fr/ProtocolsAndTools.aspx). Each PCR
product was bidirectionally sequenced and analyzed using a 16-capillary automated sequencer (ABI
PRISM® 3100 Genetic Analyzer, Applied Biosystems). All sequence variations were confirmed by
running a second, independent PCR and sequencing analysis of the same sample.
Microarray procedures
Microarray analyses were performed with 3 µg of total RNA as starting material and 10 µg of cRNA
per hybridization (GeneChip Fluidics Station 400; Affymetrix, Santa Clara, CA). The total RNAs were
amplified and labeled following the one-cycle target labeling protocol (http://www.affymetrix.com).
The labeled cRNAs were hybridized to HG-U133 plus 2.0 Affymetrix GeneChip arrays (Affymetrix).
The chips were scanned with an Affymetrix GeneChip Scanner 3000 and subsequent images analyzed
by the use of GCOS 1.4 (Affymetrix). Data were normalized using Robust Multi-array Average (RMA)
method, implemented in the R package affy [Irizarry 2003] and probe sets log2 intensities were then
averaged per HUGO Gene Symbol. Data were deposited to ArrayExpress under accession E-MTAB2435.
Public gene expression datasets
Height expression profiling public datasets of lung SCC were collected from GEO
(http://www.ncbi.nlm.nih.gov/geo/ ; GSE3141, GSE14814, GSE17710, GSE3593, GSE4573, GSE11969,
GSE8894) or supplemental data of the related article ([Roepman 2009]). They corresponded to 533
expression profiles published in [Bild 2006] (hgu133plus2 arrays), [Zhu 2010] (hgu133a arrays),
[Wilkerson 2010] (Agilent44k arrays), [Potti 2006] (hgu133a arrays), [Raponi 2006] (hgu133a arrays),
[Takeushi 2006] (Agilent44k arrays), [Roepman 2009] (Agilent44k arrays) and [Lee 2008]
(hgu133plus2 arrays). Affymetrix series were normalized with RMA method; for other series the
furnished normalized data were directly used. Probes (non Affymetrix chips) /probe sets (Affymetrix
chips) were then averaged per HUGO Gene Symbol.
SNP array hybridization and preprocessing
Hybridization was performed by Integragen (Evry, France), according to manufacturer protocols
(Illumina, San Diego, CA). Raw fluorescent signals were imported and normalized into Illumina
BeadStudio software as previously described [Peiffer 2006] to obtain the log R ratio (LRR) and B
4
Allele Frequency (BAF) for each SNP. A supplemental normalization procedure tQN [Staff 2008] was
applied to correct dye biais. Genomic profiles were then segmented using the circular binary
segmentation algorithm (DNAcopy package, Bioconductor) [Venkatran 2007] to LRR and BAF data
separately, as previously described [Staff 2008],[Popova 2009]. The absolute copy numbers and
allelic status were then determined per segments using the Genome Alteration Print (GAP) method
[Popova 2009]; moreover this method was used to estimate the diploid cells rate within each tumor
sample. To identify regions of interest, GISTIC2.0 software [Mermel 2011]
(http://genepattern.broadinstitute.org/) was used with default parameters and a confidence level
set at 0.9. Sexual chromosomes were excluded from the analysis. Region with a Qvalue < 0.25 were
considered significant. Data have been deposited to ArrayExpress under accession E-MTAB-2435.
Quick score determination
The Quick Score is obtained from immunohistochemical measures by multiplying the percentage of
positive cells (P) by the intensity (I): QS= P x I, yielding a value ranging from 0 to 300.
Statistical analyses
Statistical analyses were performed using R software V2.14.0 (http://www.R-project.org) including R
packages from Bioconductor version 2.9 [www.bioconductor.org] and private R packages.
Differential expression analysis
To identify differential expression, we used moderate T-tests (limma R package). False
Discovery Rate (FDR) estimates were obtained using Benjamini & Hochberg method. The H1
proportion was estimated using B. Storey method.
Pathways enrichment analysis
Pathways analysis based on samples groups comparisons was performed using a consensus
of four methods (GSA [Efron 2002]: R package GSA ; globaltest [Goeman 2004]: R package
globaltest ; SAM-GS [Dinu 2007]: original R code ; Tukey [Goeman 2007]: original R code).
Given two samples groups to be compared and a pathway of interest, each method will yield
a p-value: the lower the p-value, the more the genes from pathway are differentially
expressed between the samples groups. To aggregate the results of the four methods, given
a list of pathways, we first sort the list of pathways for each method; we then calculate for
each pathway the mean rank across the four methods: the final order is based on this mean
rank. Pathways and related genes were retrieved from the literature [Ben-Porath 2008] and
several repositories: MSigDB (www.broadinstitute.org/gsea/msigdb/), Gene Ontology
(www.geneontology.org),
KEGG
(www.genome.jp/kegg/pathway.html),
Biocarta
(http://www.biocarta.com/genes/allpathways.asp), Stanford Microarray Database (http://genomewww.stanford.edu/microarray/), NCI / Nature curated pathways (http://pid.nci.nih.gov/browse_pathways.shtml).
Unsupervised clustering analysis
Consensus clustering analysis was performed as previously reported [Cairo 2008], yielding
consensus dendrograms and consensus partitions of mRNA expression profiles of tumor
samples. Briefly, 24 dendrograms were obtained by hierarchical clustering using 3 different
linkages (Ward, Complete, Average) and 8 lists of genes corresponding to the top 40% to
0.5% (8 different thresholds) most varying genes (assessed via a robust coefficient of
variation). For a given number k of clusters, each of the 24 dendrograms yields a raw
partition in k clusters; the 24 raw partitions in k clusters are then used to build a consensus
dendrogram which itself is cut in k clusters, thus yielding a consensus partition in k clusters.
To choose the final number k of clusters we calculated the gap statistic [Tibshirani 2001] on
consensus partitions in k = 2 to 8 clusters, and selected k=4 as it yielded the best gap statistic.
Principal component analysis (PCA) was performed using the stats R package.
5
Prediction of CIT subtypes (microarray-based): In order to classify all publicly available
microarray SCC series according to our four subtypes we built a centroid-based classifier
using diagonal linear discriminant analysis (DLDA) algorithm (see definition in Sup_Table_5);
this predictor was trained using the CIT series (n=93 SCC). The AUC criterion (Area Under
Curve in a ROC curve, PresenceAbsence R package) was used to identify genes discriminating
samples groups.
Prediction of Wilkerson subtypes (microarray-based): In order to classify all samples
according to Wilkerson's subtypes, we used the published centroid-based classifier
[Wilkerson 2010].
Survival analysis
Survival curves were obtained using Kaplan-Meier estimates and differences between
survival curves were assessed using the log-rank test (R package survival).
Ectopic expression of Testis Specific/Placental Specific (TSPS) genes
In a recent study [Rousseaux 2013], the ectopic expression in lung cancers of TSPS genes
(n=26) was shown associated to a poor prognosis. To assess whether this abnormal
expression of TSPS genes was related or not to the basaloid histology, we developed a per
sample TSPS expression score as follows: for each of the 26 TSPS genes (given in Sup_Table_9
of [Rousseaux 2013]) we performed a z-score transformation of the expressions (i.e.
subtracting the mean, diving by the standard-error) and dichotomized the z-score
transformed values using the 99th percentile (P99) of the standard Gaussian distribution as
threshold (1 if z-score >P99, 0 otherwise). Then the dichotomized values were summed over
the 26 genes for each sample. This TSPS score was then compared to histological subgroups
using a chi2 test (see table below).
Chi2 p-value = 0.00049
Other SCC
Pure basaloid SCC
TSPS score=0
48
8
TSPS score=1
15
6
TSPS score>1
6
10
Supplemental references
1.
2.
3.
4.
5.
Ben-Porath I, Thomson MW, Carey VJ, Ge R, Bell GW, Regev A et al. An embryonic stem cell-like
gene expression signature in poorly differentiated aggressive human tumors. Nat Genet 2008;40:499–
507.
Bild AH, Yao G, Chang JT, Wang Q, Potti A, Chasse D et al. Oncogenic pathway signatures in human
cancers as a guide to targeted therapies. Nature 2006;439:353–357.
Cairo S, Armengol C, de Reyniès A, Wei Y, Thomas E, Renard CA et al. Hepatic stem-like phenotype
and interplay of Wnt/beta-catenin and Myc signaling in aggressive childhood liver cancer. Cancer Cell
2008;14:471–484.
Dinu I, Potter JD, Mueller T, Liu Q, Adewale AJ, Jhangri GS et al. Improving gene set analysis of
microarray data by SAM-GS. BMC Bioinformatics 2007;8:242.
Efron B, Tibshirani R. Empirical bayes methods and false discovery rates for microarrays. Genet
Epidemiol 2002;23:70–86.
6
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
Goeman JJ, Bühlmann P. Analyzing gene expression data in terms of gene sets: methodological issues.
Bioinformatics 2007;23:980–987.
Goeman JJ, van de Geer SA, de Kort F, van Houwelingen HC. A global test for groups of genes: testing
association with a clinical outcome. Bioinformatics 2004;20:93–99.
Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP. Summaries of Affymetrix GeneChip
probe level data. Nucleic Acids Res 2003;31:e15.
Lee ES, Son DS, Kim SH, Lee J, Jo J, Han J et al. Prediction of recurrence-free survival in
postoperative non-small cell lung cancer patients by using an integrated model of clinical information
and gene expression. Clin Cancer Res 2008;14:7397–7404.
Mermel CH, Schumacher SE, Hill B, Meyerson ML, Beroukhim R, Getz G. GISTIC2.0 facilitates
sensitive and confident localization of the targets of focal somatic copy-number alteration in human
cancers. Genome Biol 2011;12:R41.
Peiffer DA, Le JM, Steemers FJ, Chang W, Jenniges T, Garcia F et al. High-resolution genomic
profiling of chromosomal aberrations using Infinium whole-genome genotyping. Genome Res
2006;16:1136–1148.
Popova T, Manié E, Stoppa-Lyonnet D, Rigaill G, Barillot E, Stern MH. Genome Alteration Print
(GAP): a tool to visualize and mine complex cancer genomic profiles obtained by SNP arrays. Genome
Biol 2009;10:R128.
Potti A, Mukherjee S, Petersen R, Dressman HK, Bild A, Koontz J et al. A genomic strategy to refine
prognosis in early-stage non-small-cell lung cancer. N Engl J Med 2006;355:570–580.
Raponi M, Zhang Y, Yu J, Chen G, Lee G, Taylor JMG et al. Gene expression signatures for predicting
prognosis of squamous cell and adenocarcinomas of the lung. Cancer Res 2006;66:7466–7472.
Roepman P, Jassem J, Smit EF, Muley T, Niklinski J, van de Velde T et al. An immune response
enriched 72-gene prognostic profile for early-stage non-small-cell lung cancer. Clin Cancer Res
2009;15 :284–290.
Rousseaux S, Debernardi A, Jacquiau B, Vitte AL, Vesin A, Nagy-Mignotte H et al. Ectopic activation
of germline and placental genes identifies aggressive metastasis-prone lung cancers. Sci Transl Med
2013;5:186ra66.
Staaf J, Lindgren D, Vallon-Christersson J, Isaksson A, Göransson H, Juliusson G et al. Segmentationbased detection of allelic imbalance and loss-of-heterozygosity in cancer cells using whole genome
SNP arrays. Genome Biol 2008(a); 9:R136.
Staaf J, Vallon-Christersson J, Lindgren D, Juliusson G, Rosenquist R, Höglund et al. Normalization of
Illumina Infinium whole-genome SNP data improves copy number estimates and allelic intensity ratios.
BMC Bioinformatics 2008(b);9:409.
Takeuchi T, Tomida S, Yatabe Y, Kosaka T, Osada H, Yanagisawa K. et al. Expression profile-defined
classification of lung adenocarcinoma shows close relationship with underlying major genetic changes
and clinicopathologic behaviors. J Clin Oncol 2006;24:1679–1688.
Tibshirani R, Walther G, Hastie T. Estimating the number of clusters in a data set via the gap statistic.
Journal of the Royal Statistical Society 2001;63:411–423.
Venkatraman ES, Olshen AB. A faster circular binary segmentation algorithm for the analysis of array
CGH data. Bioinformatics 2007;23:657–663.
Wilkerson MD, Yin X, Hoadley KA, Liu Y, Hayward MC, Cabanski CR et al. Lung squamous cell
carcinoma mRNA expression subtypes are reproducible, clinically important, and correspond to normal
cell types. Clin Cancer Res 2010;16:4864–4875.
Zhu CQ, Ding K, Strumpf D, Weir BA, Meyerson M, Pennell N et al. Prognostic and predictive gene
signature for adjuvant chemotherapy in resected non-small-cell lung cancer. J Clin Oncol
2010;28 :4417–4424.
7