Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Detection of somatic mutations: A data mining and a computational approach Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013 Somatic single nucleotide variants (sSNV) • Play major role in tumorigenesis and cancer development • Aim 1: Literature mining Mutations in COSMIC 745,924 • Catalogue of Somatic Mutations In Cancer (COSMIC): the most comprehensive catalogue today • Aim 2: Tumor-specific mutations in tumor-normal pairs 405,271 340,585 10,647 V1 (2004) V60 (7/2012) V61 (9/2012) V62 (11/2012) 2 Classes of somatic mutations • Point mutation: • Coding • Silent • Missense • Nonsense • Noncoding (UTR, ncRNA, miRNA…) • Intronic • Intergenic • Small scale mutation: • Small insertions • Small deletions • Large scale mutation: rearrangements • Intrachromosomal • Deletion • Invertion • Duplication • Interchromosomal • Translocation • Insertion Aim 1: Mining COSMIC For Protein Domain Interaction 4 History of COSMIC The Evolution of the Cosmos started with the Big Bang! http://en.wikipedia.org/wiki/Big_Bang Yet, another COSMIC • History of the Catalogue Of Somatic Mutations In Cancer (Wellcome Trust Sanger Institute) V1 (2004) V64 (2013) 913,166 424,394 COSMIC V1 (4th February, 2004) Genes 10,647 847,698 57,444 Mutations Tumours Comparison V1 vs. V64 COSMIC V64 (26th March, 2013) Advantages and Disadvantages • Bimonthly updates • Manual curated data, removed low quality data • Consistent vocabulary (histology and tissue) • Mutation maps to single version of gene (no alternative splicing) • FREE availability!!! • Curation bias • Many positive results, few negative results • Other quality issues: experimental error, missing mutations • Interpretation of mutation frequency Typical workflow Histogram Distribution Specific aims • Map somatic mutations (SM) in COSMIC to protein structural model • Identify SM in pocket region of protein • Use statistical analysis to score SM in the context of cancer (specificity, sensitivity) Dataset and preprocessing step • Data are downloaded from COSMIC version 62 via Biomart interface as TSV file (http://cancer.sanger.ac.uk/biomart/martview/) • Use R to clean the data (i.e remove duplicates) and import to a SQLite database • Database contained 776,917 mutations and 15 variables: 1. 2. 3. 4. 5. 6. 7. 8. Gene.Name CDS.Mutation.Syntax AA.Mutation.Syntax Zygosity Primary.Site Primary.Histology In.Cancer.Census Tumour.Source 9. 10. 11. 12. 13. 14. 15. Genomic.Coordinates.GRCh37 CDS.Mutation.Type AA.Mutation.Type Somatic.status Validation.status Entrez.Gene.ID COSMIC.Sample.ID Protein pocket region • Li et al developed algorithm to identify functional pocket regions in protein Vast majority of disease-associated SNPs are located in Pockets. (Tseng and Li, PNAS, 2011) A case study: KRAS About 64% of SM in KRAS is located on the functional pocket region Yu et al (Nature Biotechnology, 2012) also reported about 65% of disease associated in-frame mutations are located on the interaction surfaces of proteins associated with the diseases. Aim 2: Tumor-specific mutations in tumor-normal pairs 15 Outline • Challenges in detecting somatic single nucleotide variants (sSNV) • GATK pipeline for calling sSNV • Installing and running MuTect • MuTect output • Summary 16 Detecting sSNV in cancer: challenge #1 Many sSNV occur at very low frequency in genome (0.1 to 100 mutations per megabase) 17 Slide adapted from Mike Lawrence, TCGA Annual Symposium Detecting sSNV in cancer: challenge #2 C. Tri-clonal tumor Tumors are impure (i.e. contain normal contaminating cells) and heterogeneous (i.e. contain sub-clones) 18 Slide adapted from Christopher Miller, TCGA Annual Symposium and Mardis Elaine GATK pipeline GATK Best Practices: http://www.broadinstitute.org/gatk/guide/topic?name=best-practices NGS: Resources • SEQanswers (http://seqanswers.com/) • SEQanswers software list (http://seqanswers.com/wiki/Software/list • Galaxy (https://main.g2.bx.psu.edu/) • NGS Catalog (http://bioinfo.mc.vanderbilt.edu/NGS/) Slide adapted from Peilin Jia, PhD Two types of error • USER ERRORS: • Due to wrong command line or incorrect user input files • Please do not post this error to the GATK forum • RUNTIME ERRORS: • Due to the program code • Do post this error to the GATK forum (together with the trace file) USER ERROR • ##### ERROR ----------------------------------------------------------------------------------------• ##### ERROR A USER ERROR has occurred (version 2.2-25-g2a68eab): • ##### ERROR The invalid arguments or inputs must be corrected before the GATK can proceed • ##### ERROR Please do not post this error to the GATK forum • ##### ERROR • ##### ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments. • ##### ERROR Visit our website and forum for extensive documentation and answers to • ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk • ##### ERROR • ##### ERROR MESSAGE: SAM/BAM file SAMFileReader{/scratch/vuongh/Lungevity_Project/GATK/bwa/13_karosorted_R G_MarkDup_Realigned_Recal.bam} is malformed: read starts with deletion. Cigar: 9D18M15I38M26S. Although the SAM spec technically permits such reads, this is often indicative of malformed files. If you are sure you want to use this file, re-run your analysis with the extra option: -rf BadCigar BEST OF RUNTIME ERROR • ##### ERROR ----------------------------------------------------------------------------------------• ##### ERROR A GATK RUNTIME ERROR has occurred (version 2.4-7g5e89f01): • ##### ERROR • ##### ERROR Please visit the wiki to see if this is a known problem • ##### ERROR If not, please post the error, with stack trace, to the GATK forum • ##### ERROR Visit our website and forum for extensive documentation and answers to • ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk • ##### ERROR • ##### ERROR MESSAGE: START (0) > (-1) STOP -- this should never happen -- call Mauricio! MuTect: a highly sensitive and specific sSNV caller • Distinct Features • Focus on identifying low allelic fraction mutations due to tumor heterogeneity, normal contaminating cell, sub-clones • Use Bayesian model with allelic fraction as parameter yield high sensitivity • Carefully tuned , elaborated set of filters yield high specificity 24 Overview of the detection of a somatic point mutation using MuTect Bayesian model 25 Variant Filter Panel of Normal Filter Cibulskis, K. et al.Nat Biotechnology (2013).doi:10.1038/nbt.2514 Benchmarking mutation-detection methods Advantages: High sensitivity at low allelic fraction (f=0.1) High specificity achieved by filters Cibulskis, K. et al.Nat Biotechnology (2013).doi:10.1038/nbt.2514 26 Filter options • • • • • • • Strand bias Proximal gap Poor mapping Triallelic site Strand bias Clustered position Observed in Control Panel of normal samples 27 Good Bad Jia et al. PLoS ONE 7(6): e38470 Installing MuTect • Installation (Linux) • Version 1.1.4 available for download at http://www.broadinstitute.org/cancer/c ga/mutect_download (must register an account at Broad) • Can also be built from source available for download at http://www.nature.com/nbt/journal/v31 /n3/extref/nbt.2514-S3.zip 28 Preparing input • Resources: • COSMIC VCF file: use b37_cosmic_v54_120711.vcf • dbSNP VCF file: use dbsnp_132_b37.leftAligned.vcf.gz • Human reference fasta: downloaded from GATK reference bundle, use Homo_sapiens_assembly19.fasta, *.fai, *.dict files • Inputs: • Tumor bam file and matched normal bam file from read alignment tool output (e.g. BWA, Tophat) • Bam files needed to be sorted and indexed. • Recommendation: corrected for local indels realignment, marked for PCR duplicates according to GATK best practice variant detection 29 Running MuTect • Command line with all default parameter java -Xmx4g -jar /scratch/vuongh/mutect_latest/muTect-1.1.4.jar \ --analysis_type MuTect \ --reference_sequence /ref/Homo_sapiens_assembly19.fasta \ -cosmic /ref/hg19_cosmic_v54_120711.vcf \ -dbsnp /ref/dbsnp_132_b37.leftAligned.vcf \ --input_file:normal /Huy-RNAseq/1/accepted_hits.sorted.RG.bam \ --input_file:tumor /Huy-RNAseq/2/accepted_hits.sorted.RG.bam \ --out /out/1_2_cal_stats.out \ --vcf /out/1_2_mutation.vcf \ -cov /out/1_2_coverage.wig.txt \ --enable_extended_output Notes: • Put all resource files (COSMIC, dbSNP and reference fasta) in folder ref 30 • Normal bam file and index in folder 1, turmor bam and index in folder 2. • Output call stats and vcf file of mutation candidates in folder out Result • Test data: RNA-seq data from squamous cell lung cancer patients (tumor/normal pair) • Total run time: 6 hours on 8 Intel Nehalem CPUs (2.4 GHz) and, processed 65.1 million reads per sample • View the result with Excel 31 Example of Mutect output contig position ref_allele alt_allele t_lod_fstar tumor_f 1 14470 G A 8.631487 0.272727 contaminant_ lod failure_reasons judgem ent -0.096458 normal_lod,alt_allele_in_normal,poor_m apping_region_alternate_allele_mapq REJECT 1 14542 A G 4.993144 0.076923 -0.228097 fstar_tumor_lod,possible_contamination, normal_lod,alt_allele_in_normal REJECT 1 14574 A G 4.82618 0.071429 -0.245647 fstar_tumor_lod,possible_contamination REJECT T 137.96602 6 0.714286 -0.429894 1 14653 C normal_lod,alt_allele_in_normal REJECT fstar_tumor_lod,possible_contamination, alt_allele_in_normal,poor_mapping_regi on_alternate_allele_mapq REJECT KEEP 1 1 14673 139393 G G C T 5.07638 0.030769 8.97833 0.3 2.317242 -0.087734 1 788867 C T 7.335518 0.285714 -0.061414 KEEP 1 1 1321326 1498692 C T G C 7.495658 0.333333 6.681093 0.2 -0.052641 -0.087736 KEEP KEEP 1 1498813 T C 6.706235 0.166667 -0.105281 KEEP Keep: 1143 (0.5%) %Reject: 213000 (99.5%) 32 Distribution of keep versus reject calls Density plot with cutoff threshold = 6.3 density • Most reject calls are high allelic fraction sSNV • Keep most of the lowallelic fraction sSNV • Mono-clonal ??? 33 Allelic fraction f Variant annotation (Annovar) Effect Variant annotation nonsynonymous CLSTN1:NM_014944:exon2:c.C163T:p.L55F,CLSTN1:NM_ SNV 001009566:exon2:c.C163T:p.L55F, stopgain SNV MASP2:NM_006610:exon10:c.T1236A:p.C412X, nonsynonymous VPS13D:NM_018156:exon63:c.G11985C:p.L3995F,VPS13 SNV D:NM_015378:exon64:c.G12060C:p.L4020F, nonsynonymous SNV DHRS3:NM_004753:exon6:c.G852C:p.E284D, nonsynonymous SNV RSC1A1:NM_006511:exon1:c.C1741T:p.L581F, RAP1GAP:NM_001145657:exon9:c.T297A:p.H99Q,RAP1 nonsynonymous GAP:NM_001145658:exon8:c.T489A:p.H163Q,RAP1GAP: SNV NM_002885:exon8:c.T297A:p.H99Q, stopgain SNV HSPG2:NM_005529:exon41:c.C5053T:p.R1685X, nonsynonymous SNV RPL11:NM_000975:exon2:c.C7G:p.Q3E, nonsynonymous SNV RPL11:NM_000975:exon2:c.A8C:p.Q3P, RPS6KA1:NM_002953:exon22:c.G2207A:p.X736X,RPS6K synonymous SNV A1:NM_001006665:exon21:c.G2234A:p.X745X, Display 10 out of 432 genes Chr Start End Ref Alt 1 1 9833381 9833381 G 11090294 11090294 A A T 1 12475169 12475169 G C 1 12628426 12628426 C G 1 15988104 15988104 C T 1 1 21940577 21940577 A 22186457 22186457 G T A 1 24019099 24019099 C G 1 24019100 24019100 A C 1 26900691 26900691 G A 34 Summary • MuTect is a highly sensitive and specific tool for somatic SNVs calling • Designed to detect low allelic fraction somatic mutations in as few as 10% of cancer cells • Easy to install and run on all OS • Work on all NGS data • Limitations: • Computational intensive • Can’t call indels 35 THANK YOU 36