Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Efficient Outlier Identification in Lung Cancer Study Shibing Deng Pfizer, Inc. Outline Background and motivation COPA statistics Existing methods A new method Comparison of COPA statistics Application to lung cancer data What is COPA Statistics? COPA = Cancer Outlier Profile Analysis Statistics designed to identify outliers in cancer gene expression profile 7 ooooo o ooooooo o oooooo o ooo + ooooo o ooo oooo o o oo oo oo o o oooo oo ooo o oooo o Normal (n=37) Tumor (n=95) 6 5 log2(Intensity) 8 9 ACTL8 (FC=1.27 FDR=0.031) Outliers + Group Motivation Differential gene expression(DGE) is widely used to identify over/under-expressed cancer genes. It assumes two distinguish populations: tumor and normal However, cancer is not a homogenous disease Genetically diverse Oncogene has hetergeneous activation pattern DGE may happen only in a subset of samples. COPA identifies DGE in a subset of cancer patients Example of Cancer Heterogeneity Molecular Subsets of Lung Adenocarcinoma Pao W, Hutchinson K. 2012 Mar 6;18(3):349-51 COPA Methods in Literature Original COPA method Tomlins et al 2005 Outlier Sum (OS) Tibshirani and Hastie 2007 Outlier Robust T (ORT) Wu 2007 Likelihood Ratio Statistic (LRS) Hu 2008 Notation n1 = # of normal samples n2 = # of tumor samples n = n1+ n2 is the total # of samples Xij is the expression value for sample i and gene j For gene j (for simplicity index j is not shown below) : x1 x2 x3 … xn1 Normal samples (n1) Xn1+1 Xn1+2 … Xi … Tumor samples (n2) Xn The Original COPA Method Tomlins et al (2005) proposed the original COPA method. Standardize each gene based on median and MAD Define COPA stats as the rth (r = 75, 90, 95) percentile of tumor samples Limitations: 1) Fixed r r= 90th percentile, can only detect outliers with expression levels greater than those of 90% of the tumor samples Not efficient in differentiating the number of outliers 2) MAD is calculated over all samples Outliers can affect estimate of MAD Outlier Sum (OS) Standardize each gene Median centering Scale on MAD based on normal samples xij xij median j MAD j1 Define OS statistic as sum of standardized data from outliers which is sum Improvement over COPA: 1.Outliers are defined based on data distribution (not fixed) 2.Take account of the number of outliers 3.Better scaling factor – MAD1 5 6 7 i n1 8 OS j xij I [ xij q75 ( xj ) IQR ( xj )] 9 defined as data above Q3+IQR Outlier Robust T (ORT) Similar to OS Different centering (normal group median) and scaling factors (pooled MAD) xij xij median j1 median(| xij median j1 |i n1 , | xij median j 2 |i n1 ) Define ORT as ORT j xij I [ xij q75 ( xkj : k n1) IQR ( xkj : k n1)] i n1 Outlier threshold is based on normal group data only Likelihood Ratio Statistic (LRS) Outlier => a change-point problem Groups normal and tumor samples separately, and sort them within each group in ascending order x(1) x(2) x(3) … x(n1) X(n1+1) X(n1+2) … X(i) … X(n) Tumor sample (n2) Normal sample (n1) Separate all the samples into two groups at k-th tumor sample, k= n1+1,n1+2,…,n-1, and form a two-sample t statistic xi k xi k tk sk where Define with sk sˆ ŝ is sample standard deviation LRS max (t k ) n1 k n 1 k n 1 k Comments on LRS A maximum t statistic Does not provide an explicit definition of outliers Every gene provides a max(t) Need a significance measure (p value) to define outliers A New Method – Maximum Square Difference (MSD) Similar to LRS, instead of using a t statistic, we can use a squared difference 2 ( x x ) ik d k2 i k sk s with sk sˆ Define 1 k n 1 k and s SE for all samples MSD max (d k2 ) n1k n More sensitive when the number of outliers is small. Comparison of the Methods - ROC Comparisons of the methods were evaluated based on simulation using ROC curves. When n1=n2=20, we simulate 8000 null genes from standard normal. We also simulate 2000 up-regulated genes with the number of up-regulated samples (out of 20) k = 2,5,10 and 15 from N(2,1) Based on the percentiles of copa statistic from the null genes, we define the detection threshold for false positive rate (FPR). The true positive rate (TPR) is TPR = Prob(copa>=threshold | up-regulated genes) ROC (n1=n2=20, k=2 and 5) 0.8 0.6 TPR TPR 0.6 0.8 1.0 k=5 1.0 k=2 0.2 0.4 COPA ORT OS LRS T MSD 0.0 0.0 0.2 0.4 COPA ORT OS LRS T MSD 0.0 0.2 0.4 0.6 FPR 0.8 1.0 0.0 0.2 0.4 0.6 FPR 0.8 1.0 ROC (n1=n2=50,k=5,10) 0.8 0.6 TPR TPR 0.6 0.8 1.0 k = 10 1.0 k=5 0.2 0.4 COPA ORT OS LRS T MSD 0.0 0.0 0.2 0.4 COPA ORT OS LRS T MSD 0.0 0.2 0.4 0.6 FPR 0.8 1.0 0.0 0.2 0.4 0.6 FPR 0.8 1.0 Comparison of the Methods - FDR Comparison of methods can also be evaluated based on false discovery rate (FDR). Simulate n1=n2= 20, 50 samples with 10000 genes, among which 2000 are up-regulated in k tumor samples. For each detection threshold of copa statistic, FDR is the proportion of false positives among all positives. FDR = # of False Positives / All claimed positives =sum(copa >= c | null genes)/sum(copa>=c | all genes) A plot of FDR vs positive rate is created FDR : n1=n2=20, k=2, 5 k=5 0.4 COPA ORT OS LRS T MSD 0.0 0.2 0.2 FDR 0.4 COPA ORT OS LRS T MSD 0.0 FDR 0.6 0.6 0.8 0.8 k=2 0.0 0.2 0.4 0.6 0.8 Positive Fraction of genes declared positive 1.0 0.0 0.2 0.4 0.6 0.8 Positive Fraction of genes declared positive 1.0 FDR : n1=n2=50, k=5, 10 k=5 0.6 COPA ORT OS LRS T MSD 0.2 0.4 FDR 0.2 0.4 COPA ORT OS LRS T MSD 0.0 0.0 FDR 0.6 0.8 0.8 k = 10 0.0 0.2 0.4 0.6 0.8 Fraction of genes declared positive Positive 1.0 0.0 0.2 0.4 0.6 0.8 Fraction of genesPositive declared positive 1.0 Comparison of the Methods Summary Our new MSD method performs the best when there is small percent (≤ 20%) of tumor samples differentially expressed (DE) outliers. For moderate number of DE samples (20-50%), LRS performs better in ROC. For large number of DE samples (>50% tumors), t stats becomes more efficient. When relatively large number (>30%) of DE samples exist, MSD,LRS, ORT and T have comparable FDR. Assess Significance The distributions of all COPA statistics are not known Analytic solution was not easily available Permutation test does not generate the correct null distribution. Simulation: Simulate COPA statistics under the null and derive the null distribution based on relatively large number of simulations, say, n=10000. Distribution of COPA statistics Simulated null for n1=n2=20. 1 2 3 4 0 10 15 OS Statistic (non-zero) ORT LRS 0.4 0.6 20 0.0 0.2 Density 0.10 0.00 Density 5 COPA Statistic 0.20 0 0.0 0.1 0.2 0.3 0.4 0.4 Density 0.8 OS 0.0 Density Original COPA 0 10 20 30 ORT Statistic (non-zero) 40 1 2 3 LRS Statistic 4 5 MSD Distribution Simulated under the null, 10000 genes, n1=n2=20, data from N(0,1) 0.000 Fitted dash line is a noncentral Chi-square density function for MSD and a normal distribution for y. 0.015 The figures display the pdf of both MSD and y=sqrt(MSD). Density MSD Density 0 50 0.30 0.15 2 2 0.00 ~ ( ), Square Root of MSD Density 2 2 1 150 MSD Statistic y ~ N ( , 2 ) MSD 100 2 4 6 8 Sqrt(MSD) 10 12 MSD Distribution – Parameters Both and are functions of n1 and n2, as well as underlying gene expression distribution. If assume gene expression follows a N(0,1) distribution, then MSD parameter will be (n1,n2), 2(n1,n2). Plots show is driven by n2, and is driven by n2/n1 ratio. n2 n2 n1 n1 Outlier Identification COPA, OS and ORT define outlier samples in their methods. MSD and LRS do not provide an explicit definition of outliers The following procedure can be used for MSD (or LRS) outlier identification Calculate MSD for all genes Estimate p value of MSD based on simulated null Calculate FDR based on Benjamini-Hochberg method Define outliers as the samples above the max(MSD) sample index and with FDR<0.05 Application – Lung Cancer Data One of the drivers in NSCLC is EML4-ALK fusion (Soda et al 2007). ALK fusion was associated with high ALK gene expression (Zhang et al 2010) The prevalence of ALK fusion in NSCLC is about 5%. Xalkori® is a highly effective ALK inhibitor in treating NSCLC patients with ALK fusion. NSCLC Expression Data The Cancer Genome Atlas (TCGA) has expression data generated from 57 normal lung samples and 355 lung adenocarcinoma samples. Expression data were obtained using RNAseq. ALK Gene Expression No significant difference using t-test 2 1 0 -1 Expression levels [log2(Intensity)] 3 ALK Gene Expression in Normal and Tumor NSCLC Patients 1 (n = 57) 2 (n = 353) Group ALK Outlier Analysis LRS method failed to find any outliers, MSD identified 16 outliers (4.5%) 2.0 1.5 1.0 0.5 0.0 -0.5 -1.0 -1.5 Expression levels (median centered and scaled) Waterfall plots of tumor vs normal expression levels ALK Normal Tumor Outlier ALK Gene Fusion ALK gene has 29 exons The break point of fusion is between E19 and E20. Junction upstream of ALK Normal ALK transcript EML4-ALK fusion 16 17 18 EML4 or other partner 19 downstream of ALK 20 21 22 23 20 21 22 23 RNAseq ALK Exon Expression RNAseq provide ways to measure exon level expression. Exon 20-29 showed high expression, Exon 1-19 had very low expression, an indication of fusion event. Fusion Samples Among the 16 outliers samples, 7 samples showed fusion characteristics in exon expression. Fusion Samples vs. Outlier Samples Of all 355 tumor samples, 8 showed fusion characteristics from exon expression (marked by “+”), they are in the top 20 samples in ALK mRNA expression. Summary We proposed a new cancer outlier analysis method MSD and compared it to existing methods. MSD was shown to be more sensitive in detecting outliers when the prevalence of outliers was small (<20%). References Tomlins SA, Rhodes DR, Perner S, Dhanasekaran SM, Mehra R, Sun XW, Varambally S, Cao X, Tchinda J, Kuefer R, Lee C, Montie JE, Shah RB, Pienta KJ, Rubin MA, Chinnaiyan AM., (2005), Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer. Science. 2005 Oct 28;310(5748):644-8. Tibshirani R and Hastie, T, 2006, Outlier sums for differential gene expression analysis, Biostatistics 2007;8:2-8. Wu B. (2007), Cancer outlier differential gene expression detection. Biostatistics 2007;8:566-75. Hu, J, 2008, Cancer outlier detection based on likelihood ratio test, Bioinformatics (2008) 24(19): 2193-2199 Soda M, Choi YL, Enomoto M, Takada S,Yamashita Y, Ishikawa S, Fujiwara S, Watanabe H, Kurashina K, Hatanaka H, et al.: Identification of the transforming EML4-ALK fusion gene in non-small-cell lung cancer. Nature 2007, 448:561-566. Zhang X, Zhang S,Yang X,Yang J, Zhou Q, et al. (2010) Fusion of EML4 and ALK is associated with development of lung adenocarcinomas lacking EGFR and KRAS mutations and is correlated with ALK expression. Mol Cancer 9: 188 Acknowledgements Fred Immermann Pfizer Oncology Research Unit at La Jolla, CA Computational Biology Asia Omics Project Team