Download Combining Whole-exome and RNA-Seq Data Improves the Quality

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cell-free fetal DNA wikipedia , lookup

Genomic library wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

Cancer epigenetics wikipedia , lookup

Minimal genome wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Gene wikipedia , lookup

Gene expression programming wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Genomic imprinting wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Koinophilia wikipedia , lookup

Mutagen wikipedia , lookup

Saethre–Chotzen syndrome wikipedia , lookup

Pathogenomics wikipedia , lookup

BRCA mutation wikipedia , lookup

Genome (book) wikipedia , lookup

Gene expression profiling wikipedia , lookup

Genomics wikipedia , lookup

Metagenomics wikipedia , lookup

Population genetics wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Genome evolution wikipedia , lookup

Microevolution wikipedia , lookup

Epistasis wikipedia , lookup

Oncogenomics wikipedia , lookup

Frameshift mutation wikipedia , lookup

Mutation wikipedia , lookup

Point mutation wikipedia , lookup

RNA-Seq wikipedia , lookup

Transcript
Combining whole-exome and RNA-Seq data improves the
quality of PDX mutation profiles
1
1
1
Manuel Landesfeind , Bruno Zeitouni , Anne-Lise Peille , Vincent Vuaroqueaux
1
Oncotest - Charles River, Am Flughafen 12-14, 79108 Freiburg, Germany
1
2
Introduction
Sequenom / Sanger
Here we investigated whether our WES analysis correctly detected mutations
previously identified by Sanger sequencing. Afterwards, we compared
mutation analyses carried out at the genomic and transcriptomic levels and
investigated the observed similarities and discrepancies.
Figure 1: Distribution of cancer types
in the available PDX collection by
lineage (n=339). Data from Oncotest
Compendium v1.6.
Kidney (25)
Stomach (33)
Pancreas (45)
HUMAN-SPECIFIC MAPPING DATA (BAM)
RNA-Seq
QC
Table 1: Filtering thresholds for NGS mutation analysis.
Variant calling
Filtering parameter
Variant annotation
50
Variant filtering
25
Lung
Breast
Pancreas
Melanoma
Stomach
Alternative allele coverage
min. 3 reads
Alternative allele frequency
min. 5%
Functional impact
Kidney Others
MUTATIONS: SNPs and INDELs (VCF)
Total
“moderate” or “high”
Number of tools reporting variant (WES only)
Figure 3: PDX-specific mutation analysis pipeline for WES and RNASeq data.
• On average, less than 14% of all mutations are detected
by both WES and RNA-Seq analyses (Fig. 8).
• Main reasons for not retrieving a mutation are :
• Non-expression of gene in RNA-Seq
• Mono-allelic expression of wild-type in RNA-Seq
• Low coverage in WES
• Proportion of genes with mutation detected at both levels
reached 54.6% for cancer-related genes (Fig. 9).
Discrepancies between WES and RNA-Seq for these
cancer-related genes are due to similar causes as for the
overall observation (Fig. 10).
Figure 7: Boxplot for proportions of mutations types reported by
WES and RNA-Seq analyses (n=92 per mutation type and data
type). Y-axis limits differ between plots due to the high proportion of
missense mutations.
Mutation rate WES
16
12
40
8
4
Filtered in WES
(2.8% / 84.6%)
30
Wild-type
allele only
(0.5% / 15.4%)
Covered in WES
(3.3% / 12.2%)
Low coverage in WES
(23.9% / 87.7%)
20
RNA-Seq specific
(27.2%)
10
Breast (10)
Colon (47)
SCLC (5)
WES
100
RNA-Seq
25
20
75
15
50
10
25
5
0
0
Missense
Codon
Codon
deletion insertion
Frame
shift
Splice
site
Common
(13.8%)
NSCLC (40)
Figure 6: Mutation rate detected in WES and RNA-Seq data across analyzed
cancer types. The mutation rate is higher in WES than in RNA-Seq for 71 models. The
mutation rate is calculated from mutations found in target regions of WES and in
expressed transcripts of RNA-Seq respectively. Top-right figure displays linear
regression analysis.
Proportion of mutations (%)
• The proportion of mutation types is not different between
WES and RNA-Seq (Fig. 7).
30
50
Mutation rate (per Mb)
• Mutation rate detected in WES is usually higher than
detected by RNA-Seq (Fig. 6).
20
Mutation rate
RNA-Seq
• Good correlation between mutation rates (number of
mutations per Mb) detected by WES and RNA-Seq
analysis (Spearman correlation coefficient = 0.79).
10
Start lost
Nonsense
Stop lost
Figure 8: Comparison
of
overlap
of
mutations detected in
WES and RNA-Seq
analyses. Percentages
are averaged for 92
single model analyses.
For each category, the
first percentage (bold
font) specifies the
fraction of all exonic
mutations while the
second
percentage
specifies the proportion
with respect to the
category above.
WES specific
(58.9%)
• Our WES analysis retrieved > 95% of these mutations (Fig. 4)
• The 4.4% mutations missed by WES are mainly due to pipeline
filtering or not being targeted by the enrichment kit (Fig. 5)
Filtered in RNA-Seq
(1.3% / 7.4%)
Covered
in RNA-Seq
(17.6% / 29.9%)
• 509 additional mutations were reported by WES at positions not
analyzed by Sanger sequencing
• Regarding the genomic regions analyzed by WES and Sanger, a
high overall detection quality is achieved by our pipeline (Tab. 2)
Table 2: Quality metrics for WES mutation detection.
min. 10 bases
Sensitivity
Precision
95.6%
99.8%
Common
RNA-Seq specific
WES specific
APC
TP53
KRAS
PIK3CA
NOTCH1
TGFBR2
ALK
SMAD4
PTEN
NOTCH2
ERBB4
BRAF
ERBB3
EGFR
ATM
MSH6
MLH1
Gene expressed
(7.6% / 18.5%)
Homozygous
in WES
(0.5% / 2.8%)
Sanger
510
Colon
478
22
No coverage at exon
(3)
Position of mutation in
STK11 not targeted
by Agilent Kit v1 38mb
(6)
Conclusions
Analysis of cancer-related genes using our WES
pipeline accurately identified mutations that were
previously detected by Sanger sequencing.
Breast
WES analysis investigated larger genomic ranges and
thereby generated more comprehensive mutation
profiles.
Only a small proportion of genomic mutations are
retrieved on the transcriptomic level mainly due to nonexpression of gene or expression of wild-type allele.
RNA-Seq analysis detected mutations in genomic
regions that are poorly enriched in WES.
TP53
KEAP1
SMARCA4
U2AF1
KRAS
RB1
CDKN2A
NF1
ALK
STK11
PTEN
ERBB2
BRAF
ARID1A
KDR
NFE2L2
MET
MAG
KIF5B
ERBB4
SLC34A2
SETD2
ERBB3
Figure 10: Pie chart showing
distribution of reasons for not
retrieving mutations in cancerrelated genes using RNA-Seq data.
Data from WES specific mutations
exposed in Fig. 9.
Figure 5: Pie chart Filtered
(8)
showing
distribution
of
reasons for not
retrieving
mutations in WES.
5
KMT2C
TP53
FLG
BRCA1
LRP2
ATM
RYR3
RYR2
MAP3K1
LRP1
GATA3
ETV6
ERBB2
Low coverage in RNA-Seq
(41.3% / 70.1%)
Gene not expressed
(33.7% / 81.5%)
WES
KRAS mutations not found in pancreatic models
with high mouse stroma content (5)
Our results demonstrate the importance of transcript
mutation analysis which is closer to the protein level
and enhances understanding of cancer biology.
NSCLC
Mono-allelic
expression of
wild-type
(15.8% / 89.8%)
Figure 4: Venndiagram of 500
mutations from
Sanger and 988
mutations from
WES.
• WES analysis reported one additional mutation that was not
confirmed by Sanger sequencing
max. 5 nucleotides
Figure 9: Mutation landscape
of lineage-specific cancerrelated genes for all analyzed
models. Color indicates the
analysis method by which the
mutated gene was reported. For
201 genes, mutations are found
on both transcriptome and
genome. 141 genes are
mutated on genome level only
and 27 genes on transcriptome
level only. Non-expressed
genes are marked by “x”.
Evaluation of WES mutation detection pipeline
• We investigated as reference data, 25 genes in 284 models
containing 500 mutations assessed by Sanger sequencing
all 3
Median length of mapped segments showing alternative
(RNA-Seq only)
Comparison of mutation profiles betw een WES and RNA -Seq data
RNA-Seq
max. 2 times
Homopolymer run length (RNA-Seq only)
Figure 2: Proportion of PDX models characterized by
different methods. Data from Oncotest Compendium v1.6.
WES
Threshold value
Found in resequencing project
0
Colon
Breast (17)
Read Mapping on Mouse
mm10 genome
Removal of mouse reads
75
Lung (74)
Melanoma (22)
4
Profiled models (%)
Others (64)
WES
Read Mapping on Human
hg19 genome
Methods: PDX-specific bioinformatics pipelines (Fig. 3) comprises mapping of
paired-end reads to human and mouse reference genomes using BWA and
TopHat for WES and RNA-Seq data respectively. Reads that map better to
mouse are removed. Human-specific duplicate reads are discarded. After base
quality score recalibration, WES Variant detection is performed utilizing GATKLite, Samtools, Freebayes. Samtools only for RNA-Seq data. All variants are
annotated by SnpEff to predict their impact at the protein level. The filtering of
variants is based on the variant coverage statistics, the predicted effect in the
protein and by discarding known polymorphisms from population centric
annotation databases such as dbSNP, Hapmap, or 1000 genomes (Table 1).
QC
SEQUENCING DATA (FASTQ)
100
Colon (59)
3
Mutation profiling technologies and data analysis
Materials:
• Mutation screening using Sequenom Oncocarta panels 1-3
targeting mutations in 39 genes, validation of mutations by
Sanger sequencing in 25 genes in 284 models
• WES data for 300 models, DNA extraction using Agilent
SureSelect All Human Exon kits (v1, v4, or v5) and
sequencing with expected average-of-coverage of 100X
• 92 models analyzed with RNA-Seq at minimum of 50M
paired-end reads
Patient-derived xenograft tumor models (PDX) are an important tool for anticancer agent testing. Oncotest has developed a unique collection of models
covering all major lineages (Fig. 1). In the context of targeted therapy
development, an accurate and comprehensive molecular characterization is
crucial for model selection and for biomarker identification. We recently
analyzed our PDX collection by whole-exome sequencing (WES) and RNASeq for the identification of mutations at both the genomic and transcriptomic
levels.
1
SCLC
Filtered
(9)
Mono-allelic
expression
of wild-type
(58)
Gene not
expressed
(41)
Low coverage but
gene expressed
(33)
We believe that the combined molecular
characterization profiles from WES and RNA-Seq data
increase the quality of our PDX collection and aid the
selection of models for cancer therapy studies as well
as for biomarker investigations.