Download Mutation detection using whole genome sequencing

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Saethre–Chotzen syndrome wikipedia , lookup

Zinc finger nuclease wikipedia , lookup

Epistasis wikipedia , lookup

Mitochondrial DNA wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Extrachromosomal DNA wikipedia , lookup

BRCA mutation wikipedia , lookup

United Kingdom National DNA Database wikipedia , lookup

Genealogical DNA test wikipedia , lookup

Copy-number variation wikipedia , lookup

Deoxyribozyme wikipedia , lookup

SNP genotyping wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Cancer epigenetics wikipedia , lookup

Epigenomics wikipedia , lookup

Comparative genomic hybridization wikipedia , lookup

NUMT wikipedia , lookup

History of genetic engineering wikipedia , lookup

Mutagen wikipedia , lookup

Human genome wikipedia , lookup

DNA sequencing wikipedia , lookup

Pathogenomics wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Non-coding DNA wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Microsatellite wikipedia , lookup

Cell-free fetal DNA wikipedia , lookup

Microevolution wikipedia , lookup

Helitron (biology) wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Frameshift mutation wikipedia , lookup

Genome evolution wikipedia , lookup

Genome editing wikipedia , lookup

Human Genome Project wikipedia , lookup

Point mutation wikipedia , lookup

Mutation wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Exome sequencing wikipedia , lookup

Oncogenomics wikipedia , lookup

Designer baby wikipedia , lookup

Genomic library wikipedia , lookup

RNA-Seq wikipedia , lookup

Metagenomics wikipedia , lookup

Genomics wikipedia , lookup

Transcript
Mutation detection using whole genome
sequencing
2016 Winter School in Mathematical and
Computational Biology
Ann-Marie Patch
4th July 2016
Mutation cataloguing aids understanding of cancer genetics
International Cancer Genome Consortium projects
Our Projects
Ovarian
Pancreatic
Melanoma
ICGC patient summary of mutations identified
ICGC data portal: https://dcc.icgc.org/
Chromosomes
Coding small mutations with
amino acid change
SNP array track that shows copy
number gain in red and loss in
green and regions of loss of
heterozygosity
Structural variants in centre
Circos, Krzywinski et al 2009
Mutation Detection success depends on previous steps
Generalised sequencing workflow
Sample
preparation
•DNA
•RNA
•miRNA
•BAC
Library
preparation
•Fragmentation
•Size selection
• Target
enrichment
•Indexing
Sequence
generation
•Platform
• Sequence
length
Initial data
processing
Data
analysis
•Base calling
• Quality
assessment
• De novo
assembly
• Alignment
• Mutation
detection
•Annotation
• Biological
interpretation
Whole genome paired-end sequencing process recap
- library preparation
Genomic DNA
Fragmented DNA
Whole genome paired-end sequencing process recap
- library preparation
Genomic DNA
Fragmented DNA
Clean-up DNA fragments
Consistent fragment size distribution
Whole genome paired-end sequencing process recap
- library preparation
Clean-up DNA fragments
Adaptors added
Sequence reads produced from both
ends of each fragment
The distance from the ends of the reads
should follow the DNA size distribution
HiSeq ~300 bp
HiSeq XTen ~ 400
Library production is key for successful mutation
detection
FFPE (formalin fixed paraffin embedded)
DNA samples have a high degree of
fragmentation.
This produces a shorter TLEN so think about
the read length of the sequencing or you
could end up paying to sequence the
adapters
Fragment length median 150bp
Adapter 1
Adapter 2
Read 1
Read 2
HiSeq X Ten sequencing 2x 150bp
Paired-end sequence alignment to a reference genome
Reference genome
I
II
Coverage depth
I
I
Paired-end sequences
mapped to genome
Examining how the mapping position and content of the pairs of
reads vary across the reference genome allows us to determine
mutations and structural rearrangements
II
Cancer sample sequencing involves the parallel analysis of
at least two samples for each patient
Inherited genome sample
Germline variants
Seen in both samples
Germline
Tumour genome sample
Data
Analysis
• Mutation
detection
•Annotation
• Biological
interpretation
Tumour
This sample contains a mixed population of normal and tumour cells
Also subpopulations of different tumour cells
Somatic mutations
Specific to tumour
sample
Two independent, but matched, sequencing datasets
aligned to reference genome
I
II
I
I
II
Normal/Germline DNA:
Tumour DNA:
Aim: to identify the changes that only occur in the tumour cells
Detection software pinpoints differences in your sample
compared to a reference
I
II
I
I
Germline
SNV
II
Normal/Germline DNA:
**
**
**
**
**
*
Somatic
SNV
Mutations are recorded as positional
information and read counts
generated by detection software
Somatic
translocation
Somatic
deletion
Somatic
amplification
Choosing mutation calling software
Choice can be guided by
Type of data
The biological question
Available computing resources
Past experience
Related literature
QIMR DNA mutation detection
• Substitutions
• qSNP – QIMR
• GATK – Haplotype caller – Broad
• Small insertions and deletions
• Pindel - Sanger
• GATK – Haplotype caller - Broad
• Large structural variations
• qSV – QIMR
• Copy number aberrations
• Titan – Shah (U of British Columbia)
not up to date
http://seqanswers.com/wiki/Software/list
Visualising a germline single nucleotide variant
Paired-end HiSeq data for Ovarian Cancer patient
Chromosome 11
Grey blocks
base pairs
matching
reference
(100bp)
Small coloured
blocks indicate a
change from the
reference
Tumour data
The reference base
is a G
Germline data
There is an A
present in some of
the reads (the
green colour)
Reference sequence
Robinson et al 2011
Considering error and bias in mutation calling
Diploid organism
expected bi-allelic proportion 50% (ratio 1:1)
Allele proportions
Sample
Coverage
Reference
allele %
Alternate
allele %
Other
allele %
Bi-allelic ratio
Germline
60
G=43%
A=55%
T=2%
1:1.3
Tumour
56
G=64%
A=36%
-
1:0.56
Sequencing error
Highly skewed
representation
in tumour
samples
Low frequency variants may be clinically relevant
5 bp germline deletion
Normal control
sample
Metastasis 1
Metastasis 2
BRCA2 exon 9
BRCA2 exon 10
Ovarian cancer patient with a deleterious germline BRCA2 deletion
Six deletion reversion mutations identified within BRCA2 from
a single rapid autopsy case
5 bp germline deletion
Normal control
sample
Low
frequency
reversion
deletions
High frequency
reversion
deletion
Metastasis 1
Metastasis 2
BRCA2 exon 9
Evidence of 2
exon deletion
BRCA2 exon 10
Different reversion deletions could be identified in differing
proportions at multiple metastasis sites
Events 1-3 and 9 are found
in many abdominal
deposits where as 5, 7 and
14 are only identified in
one
Patch et al Nature 2015
Sources of bias in sequencing data
Changes in expected proportions can be due to:
Sample purity/integrity and heterogeneity
Stochastic sampling/low coverage depth
Capture or enrichment bias
Alignment/mapping strategy
Sequencing error
How should we determine a good call from error?
Ensure coverage is not a limiting factor
In germline sequencing most homozygous SNVs are detected at a 15X average depth but an
average depth of 33X is required to reproducibly detect the same proportion of heterozygous
SNVs.
Bentley, et al. Nature (2008).
Cancer projects routinely use 30X for normal control and 60X for tumour
Uninformative reads impact on the final coverage and ultimately the mutation detection
sensitivity
Sims et al Nat Rev Genet. 2014
Filtering of results from mutation detection tools is
necessary
Example for sample purity = 64%
Raw
somatic
Filtered
somatic
Raw
Germline
Filtered
Germline
qSNP
298,388
6,632
4,180,630
3,698,034
GATK
224,839
9,722
4,945,990
4,069,314
Expected number of somatic mutations ~6,000 ( 2 per Mb)
And Germline variants ~3,000,000 ( 1000 per Mb or 0.1%)
qSNP
GATK
in-house, rules-based heuristic tool sensitive (Kassahn et al 2013)
(unified genotyper) a Bayesian tool (McKenna et al 2010)
Filtering of results from mutation detection tools is
necessary
Example for sample purity = 64%
Raw
somatic
Filtered
somatic
Raw
Germline
Filtered
Germline
qSNP
298,388
6,632
4,180,630
3,698,034
GATK
224,839
9,722
4,945,990
4,069,314
report between 2-4%
qSNP
GATK
report between 84-88%
in-house, rules-based heuristic tool sensitive (Kassahn et al 2013)
(unified genotyper) a Bayesian tool (McKenna et al 2010)
Strategy for identifying somatic substitution mutations
Control of quality of variant calls through input filtering
mapping quality for reads >10
maximum number of mismatches in read <=3
minimum consecutive matched bases in a read >=34
duplicate reads removed
Tumour
data
Germline
data
Somatic variant
Tumour T=63% C=37%
Normal T=100%
Strategy for identifying somatic substitution mutations
Control of quality of variant calls through input filtering
mapping quality for reads >10
maximum number of mismatches in read <=3
minimum consecutive matched bases in a read >=34
duplicate reads removed
Tumour
data
Germline
data
Somatic variant
Tumour T=63% C=37%
Normal T=100%
Somatic variant calls are made when the
minimum number of reads with the variant
minimum coverage in tumour and normal sample
maximum variant count for a given coverage in the matched
normal
threshold proportion of variant call qualities at that position
Strategy for identifying somatic substitution mutations
Control of quality of variant calls through input filtering
mapping quality for reads >10
maximum number of mismatches in read <=3
minimum consecutive matched bases in a read >=34
duplicate reads removed
Tumour
data
Germline
data
Somatic variant
Tumour T=63% C=37%
Normal T=100%
Somatic variant calls are made when the
minimum number of reads with the variant
minimum coverage in tumour and normal sample
maximum variant count for a given coverage in the matched
normal
threshold proportion of variant call qualities at that position
Potential weakness in calls annotated
Variant seen in unfiltered bam of matched normal
Position of variant within 5 bp of ends of reads
Variant not seen in sequencing reads of both directions
Variant seen in germline of another patient
Number of novel starts for reads supporting variant is low
Position of variant in relation to repetitive sequences
Detection – examination – verification - modify
We have used a cyclical feedback approach to inform the filtering strategy and
improve our mutation calling
Detect mutations
Identify patterns and
modify filtering
strategies
Examine
Manual IGV review
Independent Verification
•PCR and capillary sequencing
•PCR and deep MiSeq sequencing
•SOLiD sequencing
•mRNA sequencing
This approach has been key for the detection of small insertions and deletions as
sequencing errors and alignment biases are often exaggerated for indels
Large genomic structural variants need different detection
strategies
Ovarian cancer genomes have high instability and are highly rearranged
Structural variants underlie copy number changes
Low resolution
Deletion
Reference
Sample
Spectral Karyotype from HGS OvCa Cell line Ouellet et al 2008 BMC Cancer
Duplication/Insertion
Translocation
There are 4 main methods for SV detection in WG sequencing
Most well known tools only use one detection method
but there are a few multi-method tools now available
Alkan, Coe and Eichler 2011
Visualising structural variants
Sub microscopic homozygous deletion in a tumour sample
Chromosome 13:
1.3Kb somatic deletion
including exon 17 of
RB1 gene
Tumour
Germline
Robinson et al 2011
Discordantly mapped read pairs mark rearrangements
>1.3kb insert size
reference
Read pairs too
far apart
Tumour
Read pairs too
close together
Germline
Read pairs in wrong
orientation
Detection tools identify clusters of read pairs with similar
characteristics e.g. BreakDancer Chen et al 2009
Insert size estimation is key for detection with
discordantly mapped read pairs
DNA fragment size distribution
Paired-end reads
Insert size depends on
DNA fragmentation step
~300 bp
Aligned pairs insert size
Log count
300bp median
Typical aligned read-pair insert size distribution
visualised by qProfiler
Normally mapped
reads
Base pairs
Changes in coverage support rearrangements
Tumour
Germline
Clear drop in coverage
over the region in the
tumour sample
Allele frequency
CN by Coverage
Changes in coverage can be interpreted as copy number and
can mark rearrangement breakpoints
Deletion
Fewer reads
mapped
More reads
mapped
Genomic position
Titan (Ha et al 2014 Genome Res)
Tools are available that identify copy number variants from read depth partitioning and GC
content and mapability correction plus allele frequency analysis
Clusters of soft clipping indicate rearrangement break points
Alignment software that performs soft clipping
can reveal exact positions of the break points
Split reads and assembled contigs reveal microhomology
Further realignment of the clipped sequences reveals split reads
Reads with soft clipping and unmapped reads can be assembled into contigs that span
breakpoints
Patterns of microhomology can be obtained from these data
CREST Wang et al 2012
qSV : Detecting Somatic Structural Variants
qSV detects 3 types of supporting evidence
Resolves all lines of evidence to identify breakpoints to base pair resolution
Felicity Newell
http://sourceforge.net/projects/adamajava/
Associating structural variants with proximal genes
Structural variants break points are annotated with genes features
SVs may promote tumour development
Oncogenes can be amplified by rearrangements resulting in gain of function
Amplification of HER2 (ERBB2) in
pancreatic adenocarcinoma
Gene 1
Gene 2
Duplication of Gene 2
© QIMR Berghofer
Institute | 38Med
ChouMedical
et alResearch
2013 Genome
Cancer molecular subtyping with cohort studies
Take a group of samples with the same disease and look for the same gene/pathway
being altered - Molecular subtyping
Disease specific cohort
Single patient
X 100’s
80%
15%
4%
1%
Cancer molecular subtyping with cohort studies
Molecular subtyping can be performed using, and by integrating, different data sources
Mutations
Gene expression
Methylation
Copy number
Mutational signatures
Structural rearrangements
450 pancreatic cancers
100 WGS pancreatic cancers
Waddell et al Nature 2015
Bailey et al Nature 2016
Pan-cancer molecular subtyping
The Cancer Genome Atlas Pan-Cancer analysis project
Nature Genetics 45, 1113–1120 (2013)
Pan-Cancer analysis to identify common molecular features
of tumours
International consortia make pan-cancer studies possible extending the cohort
approach
Disease specific cohort
Single patient
X 100’s
X 1000’s
Multiple cohorts
Pan-cancer studies are
indicating that existing
treatment options can be
repurposed for other
cancer types
Personalised
treatment
selection
Acknowledgements:
Medical genomics:
Genome informatics:
Nicola Waddell
Katia Nones
Stephen Kazakoff
John Pearson
Conrad Leonard
Oliver Holmes
Qinying Xu
Scott Wood
Andrew Biankin
David Chang
Peter Bailey
Jianmin Wu
Jeremy Humphris
Mark Pinese
Angela Chou
Mark Cowley
Sean Grimmond
Amber Johns
Anthony Gill
Scott Mead
Skye Simpson
Marc Jones
David Bowtell
Dariush Etemadmoghadam
Elizabeth Christie
Dale Garsed
Joshy George
Sian Fereday
Laura Galletta
Kathryn Alsop
Nadia Traficante
Anna deFazio
Catherine Kennedy
Yoke-Eng Chiew
Jillian Hung
APGI collaborators
http://www.pancreaticcancer.net.au/apgi/collaborators
Including: John Fawcett, O’Rourke, Andrew Barbour,
Henry Tang, Kelly Slater, Nik Zeps
Australian Government
National Health and Medical Research Council
Clinicians and patients
Thank you
Email:
[email protected]
www.qimrberghofer.edu.au