Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Supplemental Table 1:A. Characteristics of the antibodies used for immunohistochemical measures B.Clinical characteristics of the CIT cohort of patients. Supplemental Table 2A: Gene expression related statistics in the CIT cohort. For each gene (rows) several related statistics calculated on the CIT cohort are given (columns): log2 geometric mean of expression measures (prefix GM.) within each histological subtype, log2 gene expression fold change (prefix FC.) between two histological subtypes, moderate t-test (limma) pvalues and q-values, as well as AUC, derived from comparisons between basaloid and non basaloid tumors. Columns AK to AV: The expression of each gene was dichotomized using a rule “if expression operator cutoff then 1 else 0” (operator being either < or >), in order to discriminate basaloid and non-basaloid tumors, yielding related statistics (TP, TN, FP, FN, sensitivity, specificity, PPV, NPV, accuracy, fisher test). NB: Genes yielding a fisher test pvalue > .001 were filtered. Abbreviations: BAS= basaloid, pBAS= pure basaloid, mBAS= mixed basaloid, SCC= non-basaloid SCC, SCCwd= well differentiated SCC, SCCpd= poorly differentiated, vs= versus, TP= number of true positives, TN= number of true negatives, FP= number of false positives, FN= number of false negatives, AUC= Area Under (ROC) Curve, PPV= positive predictive value, NPV= negative predictive value. Supplemental Table 2B: immunohistochemistry quickscore measures of SOX4 and IVL and prediction of the basaloid satus Immunohistochemistry measures of SOX4 and IVL are given as quick scores (QS) for the 66 samples of the training series and the 35 samples of the validation series. The prediction of the basaloid status based on these measures is shown (column IHC.prediction). Supplemental Table 3: Gene signatures and pathways deregulation analysis in histological subtypes. Matrix giving the relative index of deregulation of a series of signatures and pathways in each histological subgroup (see column name prefix: basaloid = BAS; pure basaloid= pBAS; mixed basaloid= mBAS; non-basaloid SCC= SCC; well differentiated SCC= SCCwd; poorly differentiated SCC= SCCpd). Rows correspond to genes signatures / pathways (n=14725). Columns correspond to deregulation index* based on either (i) all measured genes, (ii) underexpressed** genes (column 1 name suffix: _DOWN), (iii) overexpressed genes*** (column name suffix: _UP). *: the deregulation index corresponds to a rank across all pathways/signatures (based on four methods – see Methods): the lower the rank, the more underlying genes are differentially expressed. Of note, pathways with high ranks (e.g. > 10000) may show differential expression in the group of interest as compared to other groups, but at a lower significance than pathways with small ranks (e.g. < 1000). **: underexpressed genes are here defined by a mean expression in the group of interest lower than the mean expression in the rest of the samples –without referring to any statistical test. ***: overexpressed genes are here defined by a mean expression in the group of interest higher than the mean expression in the rest of the samples –without any statistical test. The pathways included in Figure 2 are shown in Column ‘Figure 2’. Supplemental Table 4: DNA copy number aberrations related statistics in the CIT cohort. (A) Frequencies and Fisher test p-values associated to each GISTIC significant region of DNA copy number gains and losses in basaloid tumors compared to SCC tumors. (B) Frequencies and Fisher test p-values associated to DNA copy number gains and losses of all genes as estimated by GISTIC in basaloid tumors compared to SCC tumors. Genes for which the minimal Fisher test pvalue was superior to 0.005 were filtered. (C) Frequencies and Fisher test p-values associated to DNA copy number gains and losses of all markers (probes) in basaloid tumors compared to non-basaloid SCC tumors. The last two columns correspond to global Fisher tests comparing all fours subtypes. Genes for which the minimal Fisher test pvalue was superior to 0.001 were filtered. Supplemental Table 5: 139-gene centroids used for prediction of the CIT molecular subtypes. Supplemental Table 6: Molecular subtype prediction and survival data for each sample of the CIT and 8 validation datasets. (A) For all samples (rows) from the CIT cohort and the 8 public datasets, several variables (columns) are filled: "CIT molecular cluster" = cluster of the consensus partition (K=4) for the CIT dataset; "CIT 2 subtype predicted" = molecular subtype as established with our 139 genes predictor; "Wilkerson subtype predicted” = predicted molecular subtype using the predictor from Wilkerson et al.; “Stage”= histological staging; “OS event” and “OS delay” = overall survival data; “Histology” = subhistology (CIT dataset only); “Periendoalveolar contingent”= presence of an alveolar cells contingent (CIT dataset only). (B) Contingency tables: (i) between CIT clusters and subhistologies (ii) between CIT clusters and endoalveolar feature (iii) between CIT clusters and CIT predicted subtypes (using the 139-gene predictor) in the CIT discovery cohort (iv) between Wilkerson predicted subtypes and histology subtypes (v) between CIT predicted subtypes (using the 139-gene predictor) and series (discovery + 8 public series) (vi)between CIT predicted subtypes and Wilkerson predicted subtypes in the validation cohorts. NB: tables (i) to (iv) are based on the CIT dataset, tables (v) and (vi) are obtained using all datasets (CIT dataset + 8 public datasets). Supplemental Table 7: Gene signatures and pathways enrichment analysis in molecular subtypes in the CIT cohort and 4 validation datasets. Matrix giving the relative index of deregulation of a series of signatures and pathways in each CIT molecular subtype, for five series (CIT, Wilkerson, Lee, Raponi, Roepman). NB: we discarded the series where some CIT molecular subtypes had to few corresponding samples (n<5). Rows correspond to gene signatures / pathways (n=14725). Columns correspond to deregulation index* based on either (i) all measured genes, (ii) underexpressed** genes (column name suffix: _DOWN), (iii) overexpressed genes*** (column name suffix: _UP). Each analysis was repeated independently in the five series.*: the deregulation index corresponds to a rank across all pathways/signatures (based on four methods – see Methods): the lower the rank, the more underlying genes are differentially expressed. Of note, pathways with high ranks (e.g. > 10000) may show differential expression in the group of interest as compared to other groups, but at a lower significance than pathways with small ranks (e.g. < 1000). **: underexpressed genes are here defined by a mean expression in the group of interest lower than the mean expression in the rest of the samples – without referring to any statistical test. ***: overexpressed genes are here defined by a mean 3 expression in the group of interest higher than the mean expression in the rest of the samples – without any statistical test. The pathways included in Figure 6 are shown in Column ‘Figure 6’. Supplemental Information RNA and DNA extraction Total RNA was extracted with TRIzol reagent (Invitrogen), according to the manufacturer’s instructions, and DNA was extracted with phenol-chloroform by the use of standard procedures. The integrity of the extracts was verified on an Agilent 2100 Bioanalyser (Agilent Technologies). TP53 sequencing Mutations were analysed using genomic DNA isolated from paraffin-embedded archived tissue sections. Briefly, DNA was extracted by standard QIAamp DNA extraction Kit (QIAGEN, Hilden, Germany). TP53 mutations (Exons 4 to 8) were detected by direct sequencing with BigDye® Terminator v1.1 Cycle Sequencing Kit (Applied Biosystems) using primers and conditions described in the IARC TP53 standard sequencing protocol (http://p53.iarc.fr/ProtocolsAndTools.aspx). Each PCR product was bidirectionally sequenced and analyzed using a 16-capillary automated sequencer (ABI PRISM® 3100 Genetic Analyzer, Applied Biosystems). All sequence variations were confirmed by running a second, independent PCR and sequencing analysis of the same sample. Microarray procedures Microarray analyses were performed with 3 µg of total RNA as starting material and 10 µg of cRNA per hybridization (GeneChip Fluidics Station 400; Affymetrix, Santa Clara, CA). The total RNAs were amplified and labeled following the one-cycle target labeling protocol (http://www.affymetrix.com). The labeled cRNAs were hybridized to HG-U133 plus 2.0 Affymetrix GeneChip arrays (Affymetrix). The chips were scanned with an Affymetrix GeneChip Scanner 3000 and subsequent images analyzed by the use of GCOS 1.4 (Affymetrix). Data were normalized using Robust Multi-array Average (RMA) method, implemented in the R package affy [Irizarry 2003] and probe sets log2 intensities were then averaged per HUGO Gene Symbol. Data were deposited to ArrayExpress under accession E-MTAB2435. Public gene expression datasets Height expression profiling public datasets of lung SCC were collected from GEO (http://www.ncbi.nlm.nih.gov/geo/ ; GSE3141, GSE14814, GSE17710, GSE3593, GSE4573, GSE11969, GSE8894) or supplemental data of the related article ([Roepman 2009]). They corresponded to 533 expression profiles published in [Bild 2006] (hgu133plus2 arrays), [Zhu 2010] (hgu133a arrays), [Wilkerson 2010] (Agilent44k arrays), [Potti 2006] (hgu133a arrays), [Raponi 2006] (hgu133a arrays), [Takeushi 2006] (Agilent44k arrays), [Roepman 2009] (Agilent44k arrays) and [Lee 2008] (hgu133plus2 arrays). Affymetrix series were normalized with RMA method; for other series the furnished normalized data were directly used. Probes (non Affymetrix chips) /probe sets (Affymetrix chips) were then averaged per HUGO Gene Symbol. SNP array hybridization and preprocessing Hybridization was performed by Integragen (Evry, France), according to manufacturer protocols (Illumina, San Diego, CA). Raw fluorescent signals were imported and normalized into Illumina BeadStudio software as previously described [Peiffer 2006] to obtain the log R ratio (LRR) and B 4 Allele Frequency (BAF) for each SNP. A supplemental normalization procedure tQN [Staff 2008] was applied to correct dye biais. Genomic profiles were then segmented using the circular binary segmentation algorithm (DNAcopy package, Bioconductor) [Venkatran 2007] to LRR and BAF data separately, as previously described [Staff 2008],[Popova 2009]. The absolute copy numbers and allelic status were then determined per segments using the Genome Alteration Print (GAP) method [Popova 2009]; moreover this method was used to estimate the diploid cells rate within each tumor sample. To identify regions of interest, GISTIC2.0 software [Mermel 2011] (http://genepattern.broadinstitute.org/) was used with default parameters and a confidence level set at 0.9. Sexual chromosomes were excluded from the analysis. Region with a Qvalue < 0.25 were considered significant. Data have been deposited to ArrayExpress under accession E-MTAB-2435. Quick score determination The Quick Score is obtained from immunohistochemical measures by multiplying the percentage of positive cells (P) by the intensity (I): QS= P x I, yielding a value ranging from 0 to 300. Statistical analyses Statistical analyses were performed using R software V2.14.0 (http://www.R-project.org) including R packages from Bioconductor version 2.9 [www.bioconductor.org] and private R packages. Differential expression analysis To identify differential expression, we used moderate T-tests (limma R package). False Discovery Rate (FDR) estimates were obtained using Benjamini & Hochberg method. The H1 proportion was estimated using B. Storey method. Pathways enrichment analysis Pathways analysis based on samples groups comparisons was performed using a consensus of four methods (GSA [Efron 2002]: R package GSA ; globaltest [Goeman 2004]: R package globaltest ; SAM-GS [Dinu 2007]: original R code ; Tukey [Goeman 2007]: original R code). Given two samples groups to be compared and a pathway of interest, each method will yield a p-value: the lower the p-value, the more the genes from pathway are differentially expressed between the samples groups. To aggregate the results of the four methods, given a list of pathways, we first sort the list of pathways for each method; we then calculate for each pathway the mean rank across the four methods: the final order is based on this mean rank. Pathways and related genes were retrieved from the literature [Ben-Porath 2008] and several repositories: MSigDB (www.broadinstitute.org/gsea/msigdb/), Gene Ontology (www.geneontology.org), KEGG (www.genome.jp/kegg/pathway.html), Biocarta (http://www.biocarta.com/genes/allpathways.asp), Stanford Microarray Database (http://genomewww.stanford.edu/microarray/), NCI / Nature curated pathways (http://pid.nci.nih.gov/browse_pathways.shtml). Unsupervised clustering analysis Consensus clustering analysis was performed as previously reported [Cairo 2008], yielding consensus dendrograms and consensus partitions of mRNA expression profiles of tumor samples. Briefly, 24 dendrograms were obtained by hierarchical clustering using 3 different linkages (Ward, Complete, Average) and 8 lists of genes corresponding to the top 40% to 0.5% (8 different thresholds) most varying genes (assessed via a robust coefficient of variation). For a given number k of clusters, each of the 24 dendrograms yields a raw partition in k clusters; the 24 raw partitions in k clusters are then used to build a consensus dendrogram which itself is cut in k clusters, thus yielding a consensus partition in k clusters. To choose the final number k of clusters we calculated the gap statistic [Tibshirani 2001] on consensus partitions in k = 2 to 8 clusters, and selected k=4 as it yielded the best gap statistic. Principal component analysis (PCA) was performed using the stats R package. 5 Prediction of CIT subtypes (microarray-based): In order to classify all publicly available microarray SCC series according to our four subtypes we built a centroid-based classifier using diagonal linear discriminant analysis (DLDA) algorithm (see definition in Sup_Table_5); this predictor was trained using the CIT series (n=93 SCC). The AUC criterion (Area Under Curve in a ROC curve, PresenceAbsence R package) was used to identify genes discriminating samples groups. Prediction of Wilkerson subtypes (microarray-based): In order to classify all samples according to Wilkerson's subtypes, we used the published centroid-based classifier [Wilkerson 2010]. Survival analysis Survival curves were obtained using Kaplan-Meier estimates and differences between survival curves were assessed using the log-rank test (R package survival). Ectopic expression of Testis Specific/Placental Specific (TSPS) genes In a recent study [Rousseaux 2013], the ectopic expression in lung cancers of TSPS genes (n=26) was shown associated to a poor prognosis. To assess whether this abnormal expression of TSPS genes was related or not to the basaloid histology, we developed a per sample TSPS expression score as follows: for each of the 26 TSPS genes (given in Sup_Table_9 of [Rousseaux 2013]) we performed a z-score transformation of the expressions (i.e. subtracting the mean, diving by the standard-error) and dichotomized the z-score transformed values using the 99th percentile (P99) of the standard Gaussian distribution as threshold (1 if z-score >P99, 0 otherwise). Then the dichotomized values were summed over the 26 genes for each sample. This TSPS score was then compared to histological subgroups using a chi2 test (see table below). Chi2 p-value = 0.00049 Other SCC Pure basaloid SCC TSPS score=0 48 8 TSPS score=1 15 6 TSPS score>1 6 10 Supplemental references 1. 2. 3. 4. 5. Ben-Porath I, Thomson MW, Carey VJ, Ge R, Bell GW, Regev A et al. An embryonic stem cell-like gene expression signature in poorly differentiated aggressive human tumors. Nat Genet 2008;40:499– 507. Bild AH, Yao G, Chang JT, Wang Q, Potti A, Chasse D et al. Oncogenic pathway signatures in human cancers as a guide to targeted therapies. Nature 2006;439:353–357. Cairo S, Armengol C, de Reyniès A, Wei Y, Thomas E, Renard CA et al. Hepatic stem-like phenotype and interplay of Wnt/beta-catenin and Myc signaling in aggressive childhood liver cancer. Cancer Cell 2008;14:471–484. Dinu I, Potter JD, Mueller T, Liu Q, Adewale AJ, Jhangri GS et al. Improving gene set analysis of microarray data by SAM-GS. BMC Bioinformatics 2007;8:242. Efron B, Tibshirani R. Empirical bayes methods and false discovery rates for microarrays. Genet Epidemiol 2002;23:70–86. 6 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. Goeman JJ, Bühlmann P. Analyzing gene expression data in terms of gene sets: methodological issues. Bioinformatics 2007;23:980–987. Goeman JJ, van de Geer SA, de Kort F, van Houwelingen HC. A global test for groups of genes: testing association with a clinical outcome. Bioinformatics 2004;20:93–99. Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP. Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res 2003;31:e15. Lee ES, Son DS, Kim SH, Lee J, Jo J, Han J et al. Prediction of recurrence-free survival in postoperative non-small cell lung cancer patients by using an integrated model of clinical information and gene expression. Clin Cancer Res 2008;14:7397–7404. Mermel CH, Schumacher SE, Hill B, Meyerson ML, Beroukhim R, Getz G. GISTIC2.0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers. Genome Biol 2011;12:R41. Peiffer DA, Le JM, Steemers FJ, Chang W, Jenniges T, Garcia F et al. High-resolution genomic profiling of chromosomal aberrations using Infinium whole-genome genotyping. Genome Res 2006;16:1136–1148. Popova T, Manié E, Stoppa-Lyonnet D, Rigaill G, Barillot E, Stern MH. Genome Alteration Print (GAP): a tool to visualize and mine complex cancer genomic profiles obtained by SNP arrays. Genome Biol 2009;10:R128. Potti A, Mukherjee S, Petersen R, Dressman HK, Bild A, Koontz J et al. A genomic strategy to refine prognosis in early-stage non-small-cell lung cancer. N Engl J Med 2006;355:570–580. Raponi M, Zhang Y, Yu J, Chen G, Lee G, Taylor JMG et al. Gene expression signatures for predicting prognosis of squamous cell and adenocarcinomas of the lung. Cancer Res 2006;66:7466–7472. Roepman P, Jassem J, Smit EF, Muley T, Niklinski J, van de Velde T et al. An immune response enriched 72-gene prognostic profile for early-stage non-small-cell lung cancer. Clin Cancer Res 2009;15 :284–290. Rousseaux S, Debernardi A, Jacquiau B, Vitte AL, Vesin A, Nagy-Mignotte H et al. Ectopic activation of germline and placental genes identifies aggressive metastasis-prone lung cancers. Sci Transl Med 2013;5:186ra66. Staaf J, Lindgren D, Vallon-Christersson J, Isaksson A, Göransson H, Juliusson G et al. Segmentationbased detection of allelic imbalance and loss-of-heterozygosity in cancer cells using whole genome SNP arrays. Genome Biol 2008(a); 9:R136. Staaf J, Vallon-Christersson J, Lindgren D, Juliusson G, Rosenquist R, Höglund et al. Normalization of Illumina Infinium whole-genome SNP data improves copy number estimates and allelic intensity ratios. BMC Bioinformatics 2008(b);9:409. Takeuchi T, Tomida S, Yatabe Y, Kosaka T, Osada H, Yanagisawa K. et al. Expression profile-defined classification of lung adenocarcinoma shows close relationship with underlying major genetic changes and clinicopathologic behaviors. J Clin Oncol 2006;24:1679–1688. Tibshirani R, Walther G, Hastie T. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society 2001;63:411–423. Venkatraman ES, Olshen AB. A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics 2007;23:657–663. Wilkerson MD, Yin X, Hoadley KA, Liu Y, Hayward MC, Cabanski CR et al. Lung squamous cell carcinoma mRNA expression subtypes are reproducible, clinically important, and correspond to normal cell types. Clin Cancer Res 2010;16:4864–4875. Zhu CQ, Ding K, Strumpf D, Weir BA, Meyerson M, Pennell N et al. Prognostic and predictive gene signature for adjuvant chemotherapy in resected non-small-cell lung cancer. J Clin Oncol 2010;28 :4417–4424. 7