Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Introduction to Variant Analysis of Exome- and Amplicon sequencing data Lecture by: Date: Training: Extended version see: Dr. Christian Rausch 29 May 2015 TraIT Galaxy Training http://tinyurl.com/o74uehq Focus of lecture and practical part • Lecture: from NGS data to Variant analysis • Hands on training: we will analyze NGS read data of a panel of cancer genes (Illumina TruSeq Amplicon - Cancer Panel) of prostate cancer cell line VCaP. • Analysis software tools will be run interactively through Galaxy, “a web-based platform for data intensive biomedical research” Image source: A survey of tools for variant analysis of next-generation genome sequencing data, Pabinger et al., Brief Bioinform (2013) doi: 10.1093/bib/bbs086 Workflow: QC & Mapping reads Input reads (fastq files) Quality check with FastQC Not OK? Quality- & Adaptertrimming OK? Map reads to reference genome using e.g. BWA or Bowtie2 Sort by coordinates using SAMtools sort or PicardTools SortSam Output: Sorted BAM file (binary SAM sequence alignment map) Variant Calling & Annotation pipeline Reads mapped to reference genome SAM or BAM file Filter variant allele frequency Discard variants with variant allele frequency below threshold SAMtools Mpileup • Analyze mismatches & compute likelihoods of SNP etc. Slice VCF Cut vcf file to retain only the regions that were enriched for sequencing (= discard regions covered by off target reads Varscan2 • does the actual calling • Output: VCF • Various statistics on quality of each variant (read depth etc.), homozygous/heterozygous etc. ANNOVAR: Annotate SNPs with: • statistics (dbSNP, 1000 genomes etc.) and • predictions (SIFT, PholyPhen etc.) DGIdb: Drug Gene Interaction Database Find drugs to diseases arising from gene mutations standard format to sequencing reads with quality information (‘Q’ stands for Quality) @unique identifiers and optional descriptions of the sequence the actual DNA sequence + separator optionally followed by description The quality values of the sequence (one character per nucleotide) More info see wikipedia ‘FASTQ_format’ standard format to store sequence data (DNA and protein seq.) >FASTA header, often contains unique identifiers and descriptions of the sequence • format Read Quality Control with FastQC Examples of per base sequence quality of all reads Not so good, might still be usable, depending on application 50 bp Historical example of very first Solexa reads 2006 (Solexa acquired by Illumina 2007) “Mapping reads to the reference” is finding where their sequence occurs in the genome 100 bp identified 200 – 500 bp unknown sequence 100 bp identified Source: Wikimedia, file:Mapping Reads.png “Mapping reads to the reference”: naïve text search algorithms are too slow • Naïve approach: compare each read with every position in the genome – Takes too long, will not find sequences with mismatches • Search programs typically create an index of the reference sequence (or text) and store the reference sequence (text) in an advanced data structure for fast searching. • An index is basically like a phone book (with addresses) Quickly find address (location) of a person Example of algorithm using ‘indexed seed tables’ to quickly find locations of exact parts of a read Read Mapping: General problems • Read can match equally well at more than one location (e.g. repeats, pseudo-genes) • Even fit less well to it’s actual position, e.g. if it carries a break point, insertions and/or deletions Example from: genetics.stanford.edu/gene211/lectures/Lecture3_Resequencing_Functional_Genomics-2014.pdf Variant Calling & Annotation Possible reasons for a mismatch • True SNP • Error generated in library preparation • Base calling error – May be reduced by better base calling methods, but cannot be eliminated • Misalignment (mapping error): – Local re-alignment to improve mapping • Error in reference genome sequence Partly from www.biostat.jhsph.edu/~khansen/LecSNP2.pdf Variant Calling: Principles • Naïve approach (used in early NGS studies): – Filter base calls according to quality – Filter by frequency – Typically, a quality Filter of PHRED Q 20 was used (i.e., probability of error 1% ). – Then, the following frequency thresholds were used according to the frequency of the non-ref base, f(b): – The frequency heuristic works well if the sequencing depth is high, so that the probability of a heterozygous nucleotide falling outside of the 20% - 80% region is low. – Problems with frequency heuristic: • For low sequencing depth, leads to undercalling of heterozygous genotypes • Use of quality threshold leads to loss of information on • individual read/base qualities • Does not provide a measure of confidence in the call In parts from: compbio.charite.de/contao/index.php/genomics.html Variant Calling: Principles • Today’s Variant Callers rely on probability calculations • Use of Bayes’ Theorem: – E.g. MAQ: One of the first widely used read mappers and variant callers • Takes into account a quality score for whole read alignment & quality of base at the individual position • Calls the most likely genotype given observed substitutions • Reliability score can be calculated Variant Calling & Annotation: Popular Tools • • • • • SAMtools (Mpileup & Bcftools) GATK Varscan2 Freebayes MAQ VCF = Variant Call Format • Variant Call Format / BCF = binary version ANNOVAR a Swiss Knife to annotate genetic variants (SNPs and CNVs) • Input: – variants as VCF file – various databases with statistical and predictive annotations: dbSNP, 1000 genomes, … • Output: – In coding region? Which gene? How frequently observed in 1000 genomes project? (and more statistics). • According to coordinates from RefSeq genes, UCSC genes, ENSEMBL genes, GENCODE genes or others, etc. – In non-coding region? Conserved region? • According to conservation in 44 species, transcription factor binding sites, GWAS hits, etc. – Predicted Effect? • Score of SIFT, PolyPhen-2, GERP++ Annotation with DGIdb: mining the druggable genome Drug Gene Interaction Database: Matching disease genes with potential drugs. • Searches genes against a compendium of drug-gene interactions and identify potentially 'druggable' genes Practical part Import Workflow into Galaxy • Import the workflow 'Goecks Exome pipeline hg19_noDedup’ to your workflows by either – clicking on ‘Shared Data’ ‘Published Workflows’ Goecks Exome pipeline hg19_noDedup green plus symbol to import – Or click here: – http://bioinf-galaxian.erasmusmc.nl/galaxy/u/crausch/w/importedgoecks-exome-pipeline-hg19 – This workflow has been imported with small modifications from: http://www.usegalaxy.org/cancer Ref: Goecks et al., Cancer Med. 2014. Import data set • Import data by either – Clicking on: http://bioinfgalaxian.erasmusmc.nl/galaxy/u/crausch/h/vcap-variant-analysis – Or go to ‘Shared Data’ published histories VCaP Variant Analysis – Or download the data from http://tinyurl.com/pfmshlu, unzip and upload to Galaxy by clicking on Run Goeck’s Exome analysis pipeline on your data • Chose parameter Variant allele frequency: e.g. 10% • Chose the name of the data • The genomic regions (bed file) contains the locations of the exons / amplicons • Overview of the workflow: see next slide • Because the whole workflow runs about 20 min, you can import all results from: – http://bioinf-galaxian.erasmusmc.nl/galaxy/u/crausch/h/goecksexome-pipeline-hg19nodedupvcap • Before looking at the results of the variant analysis pipeline, check that our seq. data has good quality, using program FastQC (logically that would be the first one would do…) Overview of the Variant Analysis workflow Legend: see next slide Overview of the Variant Analysis workflow 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Input: 2 fastq files, of the forward and reverse reads. Make sure that sequencing adapters and parts of reads with low quality have been removed. Mapping of the reads to the reference genome hg19, output format: BAM Sorting: Reads are sorted according to their coordinate position on the genome Marking and Removing of ‘duplicate reads’: Reads with the identical position on the genome are likely duplicates created during the PCR amplification step. Exome sequencing typically relies on hybridization-based selection of genomic shared DNA fragments in the so-called target regions. Because the DNA sharing step is (assumed to be) a random process, reads starting at exactly the same position are more likely to be due to PCR amplification than to originate from two independent fragments starting at the same position. Note: this step is omitted when processing Amplicon sequencing reads, because reads of a given amplicon all start at the same position but can be copies of different original templates. Summary of Alignment Statistics MPileup: Variant counts per position and statistics (1st step of variant calling) Varscan2: variant calling with program Varscan 2. Check: different analysis types are possible. In this course: “Analysis type: single nucleotide variation” is selected. Output format: VCF (variant call file). Label and filter out variants with low Variant Allele frequency. Slice VCF: discard all genetic regions except the exons (defined in the input BED file). ANNOVAR: filter and annotate variants. Annotate with DGI db (Drug Gene Interaction database). Variant filtering and visualization • Open the Annovar output in a new tab • Go with the mouse over the lines of the VCF file and open the file in trackster (click on bar chart symbol) • In trackster, load the sorted BAM file as additional track • Choose one mutation that is annotated by the last annotation step, DGI. What was done in the last filter step before DGI was run? • Read all annotations that the mutation has that you have selected • Browse to this genomic location in Trackster. What is the coverage? Is this variant reliably covered?