Download x (n1)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Efficient Outlier Identification in Lung
Cancer Study
Shibing Deng
Pfizer, Inc.
Outline
 Background and motivation
 COPA statistics
 Existing methods
 A new method
 Comparison of COPA statistics
 Application to lung cancer data
What is COPA Statistics?
 COPA = Cancer Outlier Profile Analysis
 Statistics designed to identify outliers in cancer gene
expression profile
7
ooooo
o
ooooooo
o
oooooo
o
ooo
+
ooooo
o
ooo
oooo
o
o
oo
oo
oo
o
o
oooo
oo
ooo
o
oooo
o
Normal (n=37)
Tumor (n=95)
6
5
log2(Intensity)
8
9
ACTL8 (FC=1.27 FDR=0.031)
Outliers
+
Group
Motivation
 Differential gene expression(DGE) is widely used to identify
over/under-expressed cancer genes.
 It assumes two distinguish populations: tumor and normal
 However, cancer is not a homogenous disease
 Genetically diverse
 Oncogene has hetergeneous activation pattern
DGE may happen only in a subset of samples.
 COPA identifies DGE in a subset of cancer patients
Example of Cancer Heterogeneity
 Molecular Subsets of Lung Adenocarcinoma
Pao W, Hutchinson K. 2012 Mar 6;18(3):349-51
COPA Methods in Literature
 Original COPA method
 Tomlins et al 2005
 Outlier Sum (OS)
 Tibshirani and Hastie 2007
 Outlier Robust T (ORT)
 Wu 2007
 Likelihood Ratio Statistic (LRS)
 Hu 2008
Notation
 n1 = # of normal samples
 n2 = # of tumor samples
 n = n1+ n2 is the total # of samples
 Xij is the expression value for sample i and gene j
For gene j (for simplicity index j is not shown below) :
x1
x2 x3 … xn1
Normal samples (n1)
Xn1+1
Xn1+2
…
Xi
…
Tumor samples (n2)
Xn
The Original COPA Method
 Tomlins et al (2005) proposed the original COPA method.
 Standardize each gene based on median and MAD
 Define COPA stats as the rth (r = 75, 90, 95) percentile of tumor
samples
 Limitations:
1) Fixed r
 r= 90th percentile, can only detect outliers with expression levels greater
than those of 90% of the tumor samples
 Not efficient in differentiating the number of outliers
2) MAD is calculated over all samples
 Outliers can affect estimate of MAD
Outlier Sum (OS)
 Standardize each gene
 Median centering
 Scale on MAD based on normal samples
xij 
xij  median j
MAD j1
 Define OS statistic as sum of standardized data from outliers which is
sum
Improvement over COPA:
1.Outliers are defined based on data distribution (not fixed)
2.Take account of the number of outliers
3.Better scaling factor – MAD1
5
6
7
i  n1
8
OS j   xij  I [ xij  q75 ( xj )  IQR ( xj )]
9
defined as data above Q3+IQR
Outlier Robust T (ORT)
 Similar to OS
 Different centering (normal group median) and scaling factors
(pooled MAD)
xij 
xij  median j1
median(| xij  median j1 |i  n1 , | xij  median j 2 |i  n1 )
 Define ORT as
ORT j   xij  I [ xij  q75 ( xkj : k  n1)  IQR ( xkj : k  n1)]
i  n1
 Outlier threshold is based on normal group data only
Likelihood Ratio Statistic (LRS)
 Outlier => a change-point problem
 Groups normal and tumor samples separately, and sort them within
each group in ascending order
x(1) x(2) x(3) … x(n1) X(n1+1) X(n1+2) … X(i)
…
X(n)
Tumor sample (n2)
Normal sample (n1)
 Separate all the samples into two groups at k-th tumor sample, k=
n1+1,n1+2,…,n-1, and form a two-sample t statistic
xi  k  xi  k
tk 
sk
where
 Define
with sk  sˆ
ŝ is sample standard deviation
LRS  max (t k )
n1 k  n
1
k
 n 1 k
Comments on LRS
 A maximum t statistic
 Does not provide an explicit definition of outliers
 Every gene provides a max(t)
 Need a significance measure (p value) to define outliers
A New Method –
Maximum Square Difference (MSD)
 Similar to LRS, instead of using a t statistic, we can use a squared
difference
2
(
x

x
)
ik
d k2  i  k
sk  s
with sk  sˆ
 Define
1
k
 n 1 k and s  SE for all samples
MSD  max (d k2 )
n1k  n
 More sensitive when the number of outliers is small.
Comparison of the Methods - ROC
 Comparisons of the methods were evaluated based on simulation
using ROC curves.
 When n1=n2=20, we simulate 8000 null genes from standard
normal. We also simulate 2000 up-regulated genes with the
number of up-regulated samples (out of 20) k = 2,5,10 and 15
from N(2,1)
 Based on the percentiles of copa statistic from the null genes, we
define the detection threshold for false positive rate (FPR). The
true positive rate (TPR) is
TPR = Prob(copa>=threshold | up-regulated genes)
ROC (n1=n2=20, k=2 and 5)
0.8
0.6
TPR
TPR
0.6
0.8
1.0
k=5
1.0
k=2
0.2
0.4
COPA
ORT
OS
LRS
T
MSD
0.0
0.0
0.2
0.4
COPA
ORT
OS
LRS
T
MSD
0.0
0.2
0.4
0.6
FPR
0.8
1.0
0.0
0.2
0.4
0.6
FPR
0.8
1.0
ROC (n1=n2=50,k=5,10)
0.8
0.6
TPR
TPR
0.6
0.8
1.0
k = 10
1.0
k=5
0.2
0.4
COPA
ORT
OS
LRS
T
MSD
0.0
0.0
0.2
0.4
COPA
ORT
OS
LRS
T
MSD
0.0
0.2
0.4
0.6
FPR
0.8
1.0
0.0
0.2
0.4
0.6
FPR
0.8
1.0
Comparison of the Methods - FDR
 Comparison of methods can also be evaluated based on false
discovery rate (FDR).
 Simulate n1=n2= 20, 50 samples with 10000 genes, among which
2000 are up-regulated in k tumor samples.
 For each detection threshold of copa statistic, FDR is the
proportion of false positives among all positives.
FDR = # of False Positives / All claimed positives
=sum(copa >= c | null genes)/sum(copa>=c | all genes)
 A plot of FDR vs positive rate is created
FDR : n1=n2=20, k=2, 5
k=5
0.4
COPA
ORT
OS
LRS
T
MSD
0.0
0.2
0.2
FDR
0.4
COPA
ORT
OS
LRS
T
MSD
0.0
FDR
0.6
0.6
0.8
0.8
k=2
0.0
0.2
0.4
0.6
0.8
Positive
Fraction of genes
declared positive
1.0
0.0
0.2
0.4
0.6
0.8
Positive
Fraction of genes
declared positive
1.0
FDR : n1=n2=50, k=5, 10
k=5
0.6
COPA
ORT
OS
LRS
T
MSD
0.2
0.4
FDR
0.2
0.4
COPA
ORT
OS
LRS
T
MSD
0.0
0.0
FDR
0.6
0.8
0.8
k = 10
0.0
0.2
0.4
0.6
0.8
Fraction of genes
declared positive
Positive
1.0
0.0
0.2
0.4
0.6
0.8
Fraction of genesPositive
declared positive
1.0
Comparison of the Methods Summary
 Our new MSD method performs the best when there is small
percent (≤ 20%) of tumor samples differentially expressed (DE) outliers.
 For moderate number of DE samples (20-50%), LRS performs
better in ROC.
 For large number of DE samples (>50% tumors), t stats becomes
more efficient.
 When relatively large number (>30%) of DE samples exist,
MSD,LRS, ORT and T have comparable FDR.
Assess Significance
 The distributions of all COPA statistics are not known
 Analytic solution was not easily available
 Permutation test does not generate the correct null
distribution.
 Simulation:
Simulate COPA statistics under the null and derive the null
distribution based on relatively large number of simulations, say,
n=10000.
Distribution of COPA statistics
 Simulated null for n1=n2=20.
1
2
3
4
0
10
15
OS Statistic (non-zero)
ORT
LRS
0.4
0.6
20
0.0
0.2
Density
0.10
0.00
Density
5
COPA Statistic
0.20
0
0.0 0.1 0.2 0.3 0.4
0.4
Density
0.8
OS
0.0
Density
Original COPA
0
10
20
30
ORT Statistic (non-zero)
40
1
2
3
LRS Statistic
4
5
MSD Distribution
 Simulated under the null, 10000 genes, n1=n2=20, data from
N(0,1)
0.000
Fitted dash line is a noncentral Chi-square density
function for MSD and a
normal distribution for y.
0.015
The figures display the pdf of
both MSD and y=sqrt(MSD).
Density
MSD Density
0
50
0.30
0.15
2
 2

0.00
~  ( ),
Square Root of MSD
Density

2
2
1
150
MSD Statistic
y ~ N ( , 2 )
MSD
100
2
4
6
8
Sqrt(MSD)
10
12
MSD Distribution – Parameters
 Both  and  are functions of n1 and n2, as well as underlying
gene expression distribution. If assume gene expression follows a
N(0,1) distribution, then MSD parameter will be (n1,n2),
2(n1,n2).
 Plots show  is driven by n2, and  is driven by n2/n1 ratio.
n2
n2
n1
n1
Outlier Identification
 COPA, OS and ORT define outlier samples in their methods.
 MSD and LRS do not provide an explicit definition of outliers
 The following procedure can be used for MSD (or LRS) outlier
identification
 Calculate MSD for all genes
 Estimate p value of MSD based on simulated null
 Calculate FDR based on Benjamini-Hochberg method
 Define outliers as the samples above the max(MSD) sample index and with
FDR<0.05
Application – Lung Cancer Data
 One of the drivers in NSCLC is EML4-ALK fusion (Soda et
al 2007).
 ALK fusion was associated with high ALK gene expression
(Zhang et al 2010)
 The prevalence of ALK fusion in NSCLC is about 5%.
 Xalkori® is a highly effective ALK inhibitor in treating
NSCLC patients with ALK fusion.
NSCLC Expression Data
 The Cancer Genome Atlas (TCGA) has expression data
generated from 57 normal lung samples and 355 lung
adenocarcinoma samples.
 Expression data were obtained using RNAseq.
ALK Gene Expression
 No significant difference using t-test
2
1
0
-1
Expression levels [log2(Intensity)]
3
ALK Gene Expression in Normal and Tumor NSCLC Patients
1 (n = 57)
2 (n = 353)
Group
ALK Outlier Analysis
 LRS method failed to find any outliers, MSD identified 16
outliers (4.5%)
2.0
1.5
1.0
0.5
0.0
-0.5
-1.0
-1.5
Expression levels (median centered and scaled)
Waterfall plots of tumor vs normal expression levels
ALK
Normal
Tumor
Outlier
ALK Gene Fusion
 ALK gene has 29 exons
 The break point of fusion is between E19 and E20.
Junction
upstream of ALK
Normal ALK transcript
EML4-ALK fusion
16
17
18
EML4 or other partner
19
downstream of ALK
20
21
22
23
20
21
22
23
RNAseq ALK Exon Expression
 RNAseq provide ways to measure exon level expression.
 Exon 20-29 showed high expression, Exon 1-19 had very low expression, an
indication of fusion event.
Fusion Samples
Among the 16 outliers
samples, 7 samples showed
fusion characteristics in exon
expression.
Fusion Samples vs. Outlier Samples
 Of all 355 tumor samples, 8 showed fusion characteristics from exon
expression (marked by “+”), they are in the top 20 samples in ALK mRNA
expression.
Summary
 We proposed a new cancer outlier analysis method MSD and
compared it to existing methods.
 MSD was shown to be more sensitive in detecting outliers
when the prevalence of outliers was small (<20%).
References

Tomlins SA, Rhodes DR, Perner S, Dhanasekaran SM, Mehra R, Sun XW, Varambally S, Cao X,
Tchinda J, Kuefer R, Lee C, Montie JE, Shah RB, Pienta KJ, Rubin MA, Chinnaiyan AM., (2005),
Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer. Science. 2005
Oct 28;310(5748):644-8.

Tibshirani R and Hastie, T, 2006, Outlier sums for differential gene expression analysis, Biostatistics
2007;8:2-8.

Wu B. (2007), Cancer outlier differential gene expression detection. Biostatistics 2007;8:566-75.

Hu, J, 2008, Cancer outlier detection based on likelihood ratio test, Bioinformatics (2008) 24(19):
2193-2199

Soda M, Choi YL, Enomoto M, Takada S,Yamashita Y, Ishikawa S, Fujiwara S, Watanabe H, Kurashina
K, Hatanaka H, et al.: Identification of the transforming EML4-ALK fusion gene in non-small-cell
lung cancer. Nature 2007, 448:561-566.

Zhang X, Zhang S,Yang X,Yang J, Zhou Q, et al. (2010) Fusion of EML4 and ALK is associated with
development of lung adenocarcinomas lacking EGFR and KRAS mutations and is correlated with
ALK expression. Mol Cancer 9: 188
Acknowledgements
 Fred Immermann
 Pfizer Oncology Research Unit at La Jolla, CA
 Computational Biology
 Asia Omics Project Team
Related documents