* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Mutation detection using whole genome sequencing
Saethre–Chotzen syndrome wikipedia , lookup
Zinc finger nuclease wikipedia , lookup
Mitochondrial DNA wikipedia , lookup
Cre-Lox recombination wikipedia , lookup
Extrachromosomal DNA wikipedia , lookup
BRCA mutation wikipedia , lookup
United Kingdom National DNA Database wikipedia , lookup
Genealogical DNA test wikipedia , lookup
Copy-number variation wikipedia , lookup
Deoxyribozyme wikipedia , lookup
SNP genotyping wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Cancer epigenetics wikipedia , lookup
Epigenomics wikipedia , lookup
Comparative genomic hybridization wikipedia , lookup
History of genetic engineering wikipedia , lookup
Human genome wikipedia , lookup
DNA sequencing wikipedia , lookup
Pathogenomics wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Non-coding DNA wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Microsatellite wikipedia , lookup
Cell-free fetal DNA wikipedia , lookup
Microevolution wikipedia , lookup
Helitron (biology) wikipedia , lookup
Bisulfite sequencing wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Frameshift mutation wikipedia , lookup
Genome evolution wikipedia , lookup
Genome editing wikipedia , lookup
Human Genome Project wikipedia , lookup
Point mutation wikipedia , lookup
Whole genome sequencing wikipedia , lookup
Exome sequencing wikipedia , lookup
Oncogenomics wikipedia , lookup
Designer baby wikipedia , lookup
Genomic library wikipedia , lookup
Mutation detection using whole genome sequencing 2016 Winter School in Mathematical and Computational Biology Ann-Marie Patch 4th July 2016 Mutation cataloguing aids understanding of cancer genetics International Cancer Genome Consortium projects Our Projects Ovarian Pancreatic Melanoma ICGC patient summary of mutations identified ICGC data portal: https://dcc.icgc.org/ Chromosomes Coding small mutations with amino acid change SNP array track that shows copy number gain in red and loss in green and regions of loss of heterozygosity Structural variants in centre Circos, Krzywinski et al 2009 Mutation Detection success depends on previous steps Generalised sequencing workflow Sample preparation •DNA •RNA •miRNA •BAC Library preparation •Fragmentation •Size selection • Target enrichment •Indexing Sequence generation •Platform • Sequence length Initial data processing Data analysis •Base calling • Quality assessment • De novo assembly • Alignment • Mutation detection •Annotation • Biological interpretation Whole genome paired-end sequencing process recap - library preparation Genomic DNA Fragmented DNA Whole genome paired-end sequencing process recap - library preparation Genomic DNA Fragmented DNA Clean-up DNA fragments Consistent fragment size distribution Whole genome paired-end sequencing process recap - library preparation Clean-up DNA fragments Adaptors added Sequence reads produced from both ends of each fragment The distance from the ends of the reads should follow the DNA size distribution HiSeq ~300 bp HiSeq XTen ~ 400 Library production is key for successful mutation detection FFPE (formalin fixed paraffin embedded) DNA samples have a high degree of fragmentation. This produces a shorter TLEN so think about the read length of the sequencing or you could end up paying to sequence the adapters Fragment length median 150bp Adapter 1 Adapter 2 Read 1 Read 2 HiSeq X Ten sequencing 2x 150bp Paired-end sequence alignment to a reference genome Reference genome I II Coverage depth I I Paired-end sequences mapped to genome Examining how the mapping position and content of the pairs of reads vary across the reference genome allows us to determine mutations and structural rearrangements II Cancer sample sequencing involves the parallel analysis of at least two samples for each patient Inherited genome sample Germline variants Seen in both samples Germline Tumour genome sample Data Analysis • Mutation detection •Annotation • Biological interpretation Tumour This sample contains a mixed population of normal and tumour cells Also subpopulations of different tumour cells Somatic mutations Specific to tumour sample Two independent, but matched, sequencing datasets aligned to reference genome I II I I II Normal/Germline DNA: Tumour DNA: Aim: to identify the changes that only occur in the tumour cells Detection software pinpoints differences in your sample compared to a reference I II I I Germline SNV II Normal/Germline DNA: ** ** ** ** ** * Somatic SNV Mutations are recorded as positional information and read counts generated by detection software Somatic translocation Somatic deletion Somatic amplification Choosing mutation calling software Choice can be guided by Type of data The biological question Available computing resources Past experience Related literature QIMR DNA mutation detection • Substitutions • qSNP – QIMR • GATK – Haplotype caller – Broad • Small insertions and deletions • Pindel - Sanger • GATK – Haplotype caller - Broad • Large structural variations • qSV – QIMR • Copy number aberrations • Titan – Shah (U of British Columbia) not up to date http://seqanswers.com/wiki/Software/list Visualising a germline single nucleotide variant Paired-end HiSeq data for Ovarian Cancer patient Chromosome 11 Grey blocks base pairs matching reference (100bp) Small coloured blocks indicate a change from the reference Tumour data The reference base is a G Germline data There is an A present in some of the reads (the green colour) Reference sequence Robinson et al 2011 Considering error and bias in mutation calling Diploid organism expected bi-allelic proportion 50% (ratio 1:1) Allele proportions Sample Coverage Reference allele % Alternate allele % Other allele % Bi-allelic ratio Germline 60 G=43% A=55% T=2% 1:1.3 Tumour 56 G=64% A=36% - 1:0.56 Sequencing error Highly skewed representation in tumour samples Low frequency variants may be clinically relevant 5 bp germline deletion Normal control sample Metastasis 1 Metastasis 2 BRCA2 exon 9 BRCA2 exon 10 Ovarian cancer patient with a deleterious germline BRCA2 deletion Six deletion reversion mutations identified within BRCA2 from a single rapid autopsy case 5 bp germline deletion Normal control sample Low frequency reversion deletions High frequency reversion deletion Metastasis 1 Metastasis 2 BRCA2 exon 9 Evidence of 2 exon deletion BRCA2 exon 10 Different reversion deletions could be identified in differing proportions at multiple metastasis sites Events 1-3 and 9 are found in many abdominal deposits where as 5, 7 and 14 are only identified in one Patch et al Nature 2015 Sources of bias in sequencing data Changes in expected proportions can be due to: Sample purity/integrity and heterogeneity Stochastic sampling/low coverage depth Capture or enrichment bias Alignment/mapping strategy Sequencing error How should we determine a good call from error? Ensure coverage is not a limiting factor In germline sequencing most homozygous SNVs are detected at a 15X average depth but an average depth of 33X is required to reproducibly detect the same proportion of heterozygous SNVs. Bentley, et al. Nature (2008). Cancer projects routinely use 30X for normal control and 60X for tumour Uninformative reads impact on the final coverage and ultimately the mutation detection sensitivity Sims et al Nat Rev Genet. 2014 Filtering of results from mutation detection tools is necessary Example for sample purity = 64% Raw somatic Filtered somatic Raw Germline Filtered Germline qSNP 298,388 6,632 4,180,630 3,698,034 GATK 224,839 9,722 4,945,990 4,069,314 Expected number of somatic mutations ~6,000 ( 2 per Mb) And Germline variants ~3,000,000 ( 1000 per Mb or 0.1%) qSNP GATK in-house, rules-based heuristic tool sensitive (Kassahn et al 2013) (unified genotyper) a Bayesian tool (McKenna et al 2010) Filtering of results from mutation detection tools is necessary Example for sample purity = 64% Raw somatic Filtered somatic Raw Germline Filtered Germline qSNP 298,388 6,632 4,180,630 3,698,034 GATK 224,839 9,722 4,945,990 4,069,314 report between 2-4% qSNP GATK report between 84-88% in-house, rules-based heuristic tool sensitive (Kassahn et al 2013) (unified genotyper) a Bayesian tool (McKenna et al 2010) Strategy for identifying somatic substitution mutations Control of quality of variant calls through input filtering mapping quality for reads >10 maximum number of mismatches in read <=3 minimum consecutive matched bases in a read >=34 duplicate reads removed Tumour data Germline data Somatic variant Tumour T=63% C=37% Normal T=100% Strategy for identifying somatic substitution mutations Control of quality of variant calls through input filtering mapping quality for reads >10 maximum number of mismatches in read <=3 minimum consecutive matched bases in a read >=34 duplicate reads removed Tumour data Germline data Somatic variant Tumour T=63% C=37% Normal T=100% Somatic variant calls are made when the minimum number of reads with the variant minimum coverage in tumour and normal sample maximum variant count for a given coverage in the matched normal threshold proportion of variant call qualities at that position Strategy for identifying somatic substitution mutations Control of quality of variant calls through input filtering mapping quality for reads >10 maximum number of mismatches in read <=3 minimum consecutive matched bases in a read >=34 duplicate reads removed Tumour data Germline data Somatic variant Tumour T=63% C=37% Normal T=100% Somatic variant calls are made when the minimum number of reads with the variant minimum coverage in tumour and normal sample maximum variant count for a given coverage in the matched normal threshold proportion of variant call qualities at that position Potential weakness in calls annotated Variant seen in unfiltered bam of matched normal Position of variant within 5 bp of ends of reads Variant not seen in sequencing reads of both directions Variant seen in germline of another patient Number of novel starts for reads supporting variant is low Position of variant in relation to repetitive sequences Detection – examination – verification - modify We have used a cyclical feedback approach to inform the filtering strategy and improve our mutation calling Detect mutations Identify patterns and modify filtering strategies Examine Manual IGV review Independent Verification •PCR and capillary sequencing •PCR and deep MiSeq sequencing •SOLiD sequencing •mRNA sequencing This approach has been key for the detection of small insertions and deletions as sequencing errors and alignment biases are often exaggerated for indels Large genomic structural variants need different detection strategies Ovarian cancer genomes have high instability and are highly rearranged Structural variants underlie copy number changes Low resolution Deletion Reference Sample Spectral Karyotype from HGS OvCa Cell line Ouellet et al 2008 BMC Cancer Duplication/Insertion Translocation There are 4 main methods for SV detection in WG sequencing Most well known tools only use one detection method but there are a few multi-method tools now available Alkan, Coe and Eichler 2011 Visualising structural variants Sub microscopic homozygous deletion in a tumour sample Chromosome 13: 1.3Kb somatic deletion including exon 17 of RB1 gene Tumour Germline Robinson et al 2011 Discordantly mapped read pairs mark rearrangements >1.3kb insert size reference Read pairs too far apart Tumour Read pairs too close together Germline Read pairs in wrong orientation Detection tools identify clusters of read pairs with similar characteristics e.g. BreakDancer Chen et al 2009 Insert size estimation is key for detection with discordantly mapped read pairs DNA fragment size distribution Paired-end reads Insert size depends on DNA fragmentation step ~300 bp Aligned pairs insert size Log count 300bp median Typical aligned read-pair insert size distribution visualised by qProfiler Normally mapped reads Base pairs Changes in coverage support rearrangements Tumour Germline Clear drop in coverage over the region in the tumour sample Allele frequency CN by Coverage Changes in coverage can be interpreted as copy number and can mark rearrangement breakpoints Deletion Fewer reads mapped More reads mapped Genomic position Titan (Ha et al 2014 Genome Res) Tools are available that identify copy number variants from read depth partitioning and GC content and mapability correction plus allele frequency analysis Clusters of soft clipping indicate rearrangement break points Alignment software that performs soft clipping can reveal exact positions of the break points Split reads and assembled contigs reveal microhomology Further realignment of the clipped sequences reveals split reads Reads with soft clipping and unmapped reads can be assembled into contigs that span breakpoints Patterns of microhomology can be obtained from these data CREST Wang et al 2012 qSV : Detecting Somatic Structural Variants qSV detects 3 types of supporting evidence Resolves all lines of evidence to identify breakpoints to base pair resolution Felicity Newell http://sourceforge.net/projects/adamajava/ Associating structural variants with proximal genes Structural variants break points are annotated with genes features SVs may promote tumour development Oncogenes can be amplified by rearrangements resulting in gain of function Amplification of HER2 (ERBB2) in pancreatic adenocarcinoma Gene 1 Gene 2 Duplication of Gene 2 © QIMR Berghofer Institute | 38Med ChouMedical et alResearch 2013 Genome Cancer molecular subtyping with cohort studies Take a group of samples with the same disease and look for the same gene/pathway being altered - Molecular subtyping Disease specific cohort Single patient X 100’s 80% 15% 4% 1% Cancer molecular subtyping with cohort studies Molecular subtyping can be performed using, and by integrating, different data sources Mutations Gene expression Methylation Copy number Mutational signatures Structural rearrangements 450 pancreatic cancers 100 WGS pancreatic cancers Waddell et al Nature 2015 Bailey et al Nature 2016 Pan-cancer molecular subtyping The Cancer Genome Atlas Pan-Cancer analysis project Nature Genetics 45, 1113–1120 (2013) Pan-Cancer analysis to identify common molecular features of tumours International consortia make pan-cancer studies possible extending the cohort approach Disease specific cohort Single patient X 100’s X 1000’s Multiple cohorts Pan-cancer studies are indicating that existing treatment options can be repurposed for other cancer types Personalised treatment selection Acknowledgements: Medical genomics: Genome informatics: Nicola Waddell Katia Nones Stephen Kazakoff John Pearson Conrad Leonard Oliver Holmes Qinying Xu Scott Wood Andrew Biankin David Chang Peter Bailey Jianmin Wu Jeremy Humphris Mark Pinese Angela Chou Mark Cowley Sean Grimmond Amber Johns Anthony Gill Scott Mead Skye Simpson Marc Jones David Bowtell Dariush Etemadmoghadam Elizabeth Christie Dale Garsed Joshy George Sian Fereday Laura Galletta Kathryn Alsop Nadia Traficante Anna deFazio Catherine Kennedy Yoke-Eng Chiew Jillian Hung APGI collaborators http://www.pancreaticcancer.net.au/apgi/collaborators Including: John Fawcett, O’Rourke, Andrew Barbour, Henry Tang, Kelly Slater, Nik Zeps Australian Government National Health and Medical Research Council Clinicians and patients Thank you Email: [email protected] www.qimrberghofer.edu.au