Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Tumor Normal v2.0 BaseSpace App Guide For Research Use Only. Not for use in diagnostic procedures. Introduction Workflow Diagram Set Analysis Parameters Analysis Methods Analysis Output Revision History Technical Assistance ILLUMINA PROPRIETARY Document # 15050950 v01 January 2016 3 5 6 7 17 29 This document and its contents are proprietary to Illumina, Inc. and its affiliates ("Illumina"), and are intended solely for the contractual use of its customer in connection with the use of the product(s) described herein and for no other purpose. This document and its contents shall not be used or distributed for any other purpose and/or otherwise communicated, disclosed, or reproduced in any way whatsoever without the prior written consent of Illumina. Illumina does not convey any license under its patent, trademark, copyright, or common-law rights nor similar rights of any third parties by this document. The instructions in this document must be strictly and explicitly followed by qualified and properly trained personnel in order to ensure the proper and safe use of the product(s) described herein. All of the contents of this document must be fully read and understood prior to using such product(s). FAILURE TO COMPLETELY READ AND EXPLICITLY FOLLOW ALL OF THE INSTRUCTIONS CONTAINED HEREIN MAY RESULT IN DAMAGE TO THE PRODUCT(S), INJURY TO PERSONS, INCLUDING TO USERS OR OTHERS, AND DAMAGE TO OTHER PROPERTY. ILLUMINA DOES NOT ASSUME ANY LIABILITY ARISING OUT OF THE IMPROPER USE OF THE PRODUCT(S) DESCRIBED HEREIN (INCLUDING PARTS THEREOF OR SOFTWARE). © 2016 Illumina, Inc. All rights reserved. Illumina, 24sure, BaseSpace, BeadArray, BlueFish, BlueFuse, BlueGnome, cBot, CSPro, CytoChip, DesignStudio, Epicentre, ForenSeq, Genetic Energy, GenomeStudio, GoldenGate, HiScan, HiSeq, HiSeq X, Infinium, iScan, iSelect, MiniSeq, MiSeq, MiSeqDx, MiSeq FGx, NeoPrep, NextBio, Nextera, NextSeq, Powered by Illumina, SureMDA, TruGenome, TruSeq, TruSight, Understand Your Genome, UYG, VeraCode, verifi, VeriSeq, the pumpkin orange color, and the streaming bases design are trademarks of Illumina, Inc. and/or its affiliate(s) in the U.S. and/or other countries. All other names, logos, and other trademarks are the property of their respective owners. The Tumor Normal v2.0 App detects somatic variants from a matched pair of tumor and normal samples. The Isaac Genome Alignment Software aligns both tumor and normal samples to a reference. The Isaac Variant Caller calls germline small variants, structural variants, and copy number abnormalities (CNAs). Also, Strelka calls somatic small variants in the tumor matched samples, structural variants, and somatic copy number variants (CNVs). Compatible Libraries See the BaseSpace support page for a list of library types that are compatible with the Tumor Normal App. Workflow Requirements } } } } } } This app supports the following: } Human genomes. } Samples with paired-end reads. } BAM output files from Isaac Whole Genome Sequencing v4 AppResults as inputs. } Read lengths that are greater than or equal to 32 bp. Read lengths is between 100 and 150 bases. Reads shorter than 50 bases result in a warning, while reads shorter than 32 bases cause an error in the app. Minimum normal sample data set size is 150 gigabases. } Approximately 375 million reads assuming 2 × 100. } Approximately 250 million reads assuming 2 × 150. Minimum tumor sample data set size is 300 gigabases. } Approximately 750 million reads assuming 2 × 100. } Approximately 500 million reads assuming 2 × 150. Maximum combined (normal+tumor) data set size is 650 gigabases. } Approximately 3.25 billion reads assuming 2 × 100. } Approximately 2.17 billion reads assuming 2 × 150. This app does not support mate pair sample or other non-forward and -reverse styles of paired-end sequencing. Versions The following components are used in the Tumor Normal App. Software Version Canvas (CNV Caller) 1.1.0.5 Isaac (Aligner) iSAACSAAC00776.15.01.27 Isaac Variant Caller starka-2.1.4.2 Isis (Analysis Software) 2.5.55.16 IONA (Annotation Service) 1.0.10.37 Tumor Normal v2.0 App Guide 3 Introduction Introduction Software Version Canvas (CNV Caller) 1.1.0.5 Manta (SV Caller) 0.23.1 SAMtools 0.1.19-isis-1.0.3 Reference Genomes } 4 Human, UCSC hg19 The human reference genome is PAR-Masked, which means that the Y chromosome sequence has the Pseudo Autosomal Regions (PAR) masked (set to N) to avoid mismapping of reads in the duplicate regions of sex chromosomes. Document # 15050950 v01 Workflow Diagram Workflow Diagram Figure 1 Tumor Normal App Workflow Tumor Normal v2.0 App Guide 5 Set Analysis Parameters 1 Navigate to BaseSpace, click the Apps tab. 2 Click Tumor Normal. 3 From the drop-down list, select version 2.0.0, and then click Launch to open the app. 4 In the Analysis Name field, enter the analysis name. By default, the analysis name includes the app name, followed by the date and time that the analysis session starts. 5 From the Save Results To field, select the project that stores the app results. 6 From the Reference Genome field, select the reference genome you want to align. The default is Human (UCSC hg19 PAR-Masked). 7 From the Annotation field, select either RefSeq or Ensembl for gene and transcript annotation reference database. The default is RefSeq. 8 From the Start from BAM (AppResult) field: a b 9 Select Yes to use BAM output files from the Isaac Whole Genome Sequencing v4.0 App. This option gives you a quicker turnaround time. Select No to use sample pairs. The default is No. From the Sample Pairs field, click Select Sample Pairs to open the Select Sample Pairs screen, and then select the normal and tumor samples you want to analyze. Click Confirm. 10 From the AppResult Pairs field, click Select AppResult Pairs to open the Select AppResults Pairs screen, and then select the normal and tumor app results you want to analyze. Click Confirm . 11 Click Continue. The Tumor Normal App begins analysis of the samples. When analysis is complete, the app updates the status of the app session and sends a notification email to you. 6 Document # 15050950 v01 The Tumor Normal App uses these methods to analyze the sequencing data. Isaac Aligner The Isaac Aligner aligns DNA sequencing data, single or paired-end, with read lengths 32–150 bp and low error rates using the following steps: } Candidate mapping positions—Identifies the complete set of relevant candidate mapping positions using a 32-mer seed-based search. } Mapping selection—Selects the best mapping among all candidates. } Alignment score—Determines alignment scores for the selected candidates based on a Bayesian model. } Alignment output—Generates final output in a sorted duplicate-marked BAM file, and summary file. Come Raczy, Roman Petrovski, Christopher T. Saunders, Ilya Chorny, Semyon Kruglyak, Elliott H. Margulies, Han-Yu Chuang, Morten Källberg, Swathi A. Kumar, Arnold Liao, Kristina M. Little, Michael P. Strömberg and Stephen W. Tanner (2013) Isaac: Ultra-fast whole genome secondary analysis on Illumina sequencing platforms. Bioinformatics 29(16):2041-3 bioinformatics.oxfordjournals.org/content/29/16/2041 Candidate Mapping To align reads, the Isaac Aligner first identifies a small but complete set of relevant candidate mapping positions. The Isaac Aligner begins with a seed-based search using 32-mers from the extremities of the read as seeds. Isaac Aligner performs another search using different seeds for only those reads that were not mapped unambiguously with the first pass seeds. Mapping Selection Following a seed-based search, the Isaac Aligner selects the best mapping among all the candidates. For paired-end data sets, all mappings where only one end is aligned (called orphan mappings) trigger a local search to find additional mapping candidates. These candidates (called shadow mappings) are defined through the expected minimum and maximum insert size. After optional trimming of low quality 3' ends and adapter sequences, the possible mapping positions of each fragment are compared. This step takes into account pair-end information (when available), possible gaps using a banded Smith-Waterman gap aligner, and possible shadows. The selection is based on the Smith-Waterman score and on the log-probability of each mapping. Alignment Scores The alignment scores of each read pair are based on a Bayesian model, where the probability of each mapping is inferred from the base qualities and the positions of the mismatches. The final mapping quality (MAPQ) is the alignment score, truncated to 60 for scores above 60, and corrected based on known ambiguities in the reference flagged during candidate mapping. Following alignment, reads are sorted. Further analysis is performed to identify duplicates and optionally to realign indels. The alignment scores of each read pair are based on a Bayesian model, where the probability of each mapping is inferred from the base qualities and the positions of the Tumor Normal v2.0 App Guide 7 Analysis Methods Analysis Methods mismatches. The final mapping quality is the alignment score, truncated to 60 for scores above 60. Following alignment, reads are sorted. Further analysis is performed to identify duplicates and optionally to realign indels. Alignment Output } } } } After sorting the reads, the Isaac Aligner generates compressed binary alignment output files, called BAM (*.bam) files, using the following process: Marking duplicates—Detection of duplicates is based on the location and observed length of each fragment. The Isaac Aligner identifies and marks duplicates even when they appear on oversized fragments or chimeric fragments. Realigning indels—The Isaac Aligner tracks previously detected indels, over a window large enough for the current read length, and applies the known indels to all reads with mismatches. Generating BAM files—The first step in BAM file generation is creation of the BAM record, which contains all required information except the name of the read. The Isaac Aligner reads data from base call (BCL) files that were written during base calling on the sequencer to generate the read names. Data are then compressed into blocks of 64 kb or less to create the BAM file. Isaac Somatic Variant Caller The Isaac Somatic Variant Caller detects somatic SNVs and indels in sequencing data from a tumor and matched normal sample, based on the following assumptions: } The normal sample is a mixture of diploid germline variation and noise. } The tumor sample is a combination of the normal sample and somatic variation. It is assumed that the somatic variation and the normal noise can occur at any allele frequency ratio. For SNVs, but not for indels, the normal noise component is further modeled as a combination of single-strand and double-strand noise. 8 Document # 15050950 v01 Analysis Methods Figure 2 Isaac Somatic Variant Caller Method NOTE For a detailed overview of Isaac Somatic Variant Caller methods, go to www.ncbi.nlm.nih.gov/pubmed/22581179. Candidate Indel Search Strelka scans through the genome using sequence alignments from the normal sample and tumor sample together to find a joint set of candidate indels. The information in sequence alignments is supplemented with externally generated candidate indels discovered by Manta. Manta provides external candidate indels to Strelka for indels of size 50 and below. Candidate indels are used for realignment of reads, during which each candidate indel is evaluated as a potential somatic indel. Any other types of indels are considered noise indels. If a better alignment is not found, these indels are allowed to remain in the read alignments; otherwise, they are not used. The candidate indel thresholds are designed so that the joint candidate indel set is at least the combined set found if the Small Variant Caller (Starling) is run on the individual samples. Specifically, where a minimum number of nominating reads is required for candidacy in Starling, Strelka requires the same minimum number of nominating reads from the combined input. Strelka requires that at least 1 sample contains a minimum fraction of supporting reads among the sample reads for candidacy. Tumor Normal v2.0 App Guide 9 Realignment For every read that intersects a candidate alignment, the Strelka attempts to find the most probable alignments including the candidate indel and excluding the candidate indel. Typically, the alignment excluding the candidate indel aligns to the reference, but occasionally an alternate indel that overlaps or interferes with the candidate is found to be more likely. The indel caller uses the probabilities of both alignments as part of the indel quality score calculation, whereas only a single alignment (usually the most probable) is preserved for SNV calling. Somatic Caller Strelka uses a Bayesian probability model similar to the one used for germline variant calling in the Starling Small Variant Caller or in external tools such as GATK. Using this model, our objective is to compute the posterior probability P(θ│ D), which is the probability of the model state θ conditioned on the observed sequencing data. In a germline variant caller, the state space of the model is conventionally a discrete set of diploid genotypes. For SNVs, the set of possible states is G= {"AA,CC,GG,TT,AC,AG,AT,CG,CT,GT"}. The Strelka model instead approximates continuous allele frequencies for each allele: f={f_A, f_C, f_G, f_T} The allele frequencies are restricted to allow a maximum of 2 nonzero frequencies. Any additional alleles observed in the data are treated as noise. Another departure from typical germline calling methods is that the state space of the model is the allele frequency of both the tumor and the normal sample. In the following equation, f_t and f_n represent the allele frequencies of the tumor and normal samples, respectively. θ=(f_t, f_n) The final somatic variant quality value reported by the model is computed from the probability that the allele frequencies are unequal (ie, f_t≠f_n) given the observed sequence data. Post-Call Filtration Heuristic filters remove several types of improbable calls resulting from data artifacts that cannot be easily represented in the somatic probability model. These filters act as a final step to separate out the final set of somatic calls reported by Isaac Somatic Variant Caller. Input Data Filtration Isaac Somatic Variant Caller uses 2 tiers of input data filtration during somatic small variant calling: } Tier 1—A more stringent filtering to ensure high quality calls } Tier 2—A lower filtration stringency Initially, candidates are called using a subset of the data with more stringent tier 1 filtering. If the method produces a nonzero quality score for any SNV or indel, the potential somatic variant is called again using data with a lower tier 2 stringency. The lower quality from the 2 tiers is selected for output. However, if the tier 2 quality is 0, the call is eliminated. 10 Document # 15050950 v01 The tier used for each quality value is provided in the Isaac Somatic Variant Caller output record for each somatic variant. If the most likely normal genotype is not the same at tier 1 and tier 2, then the normal genotype is reported as a conflict in the output. Using 2 data tiers enables an initial somatic call based on high-quality data. Given a potential call, using 2 data tiers removes support for the putative somatic allele in the normal sample from lower quality data. The following table lists the primary data filtration levels that are changed between tier 1 and tier 2. Parameter Min paired-end alignment score Min single-end alignment score Single-end score rescue? Include unanchored pairs? Include anomalous pairs? Include singleton pairs? Mismatch density filter—Maximum mismatches in window Tier 1 Value 20 10 No No No No 3 Tier 2 Value 0 0 Yes Yes Yes Yes 10 Additional Filtration After the somatic filter is finished, more filters are applied. A single candidate somatic call can be annotated with several filters. Tumor Normal v2.0 App Guide 11 Analysis Methods For somatic SNVs and indels, Isaac Somatic Variant Caller produces a general somatic quality score, Q(ssnv), or Q(somatic indel). This score indicates the probability of the somatic variant and a joint probability of the somatic variant and a specific normal genotype, Q(ssnv+ntype), or Q(somatic indel+ntype). The 2 tier evaluation is applied to each of these qualities separately, as follows: } Q(ssnv) = min(Q(ssnv|tier1), Q(ssnv|tier2)) } Q(ssnv+ntype) = min(Q(ssnv+ntype|tier1), Q(ssnv+ntype|tier2)) Figure 3 Additional Filtration Quality Filtration Levels Only somatic calls originating from homozygous reference alleles in the normal sample are reviewed for validation and included in the output. } Somatic SNVs are reported if the normal genotype is equal to the reference and Q(ssnv+ntype) ≥ 15. } Somatic indels are reported if the normal genotype is equal to the reference and Q(somatic indel+ntype) ≥ 30. NOTE The value Q(ssnv+ntype) is associated with the VCF key QSS_NT. The value Q(somatic indel+ntype) is associated with the VCF key QSI_NT. Input Data The somatic caller requires data from both the tumor and the normal sample. The inputs for each sample are the same, which are a sorted BAM file containing the sequencing reads and (optional) externally generated indel candidates. 12 Document # 15050950 v01 Strelka generates output files in VCF 4.1 format that contain metadata to describe perrecord and per-sample data in the file, and list filters applied to the records. Large Indel and Structural Variant Calls The large indel and structural variant caller uses the series of modules described here, and then generates output files in VCF 4.1 format. Before ReadBroker } } } } StatsGenerator—Computes summary statistics on insert sizes, read orientation, and alignment scores for each input BAM file. AnomalousReadFinder—Grouper processes chromosomes in chunks. This method enables parallel execution and, therefore, faster performance. AnomalousReadFinder examines all alignments in a block and classifies reads and read pairs as follows: } Classifies reads as either shadow (unaligned) or semialigned partial or clipped alignment). } Classifies read pairs as either InsertionPair, DeletionPair, InversionPair, TandemDuplicationPair, or ChimericPair, according to which type of structural variant an anomalously mapped read pair is associated. ClusterFinder—Clusters reads based on their type and the position of their alignment. Only reads of the same type are clustered together at this stage, except shadow and semialigned reads, which can be clustered together. ClusterMerger—Associates clusters of various anomalous read types with shadow/semi-aligned read clusters, which breakpoints can cause. A breakpoint is a pair of bases that are adjacent in the sample genome but not in the reference. Two clusters are merged if they share the read or if they agree on the position and length of the structural variant. This information is inferred from read alignment orientation and distance. ReadBroker } Interchromosomal translocations yield chimeric read pairs where 1 read aligns to one chromosome and its partner aligns to another. Because Grouper examines each chromosome individually, the ReadBroker step is performed to join the information from chimeric read pairs across chromosomes. After ReadBroker } } } SmallAssembler—Assembles reads in clusters into contigs using a de Bruijn method and iteratively assembles reads into contigs until all reads in the cluster are assembled. It also produces a file containing the reads that were used to assemble the contig, with a realignment to the contig sequence. SpanContigs—Uses the presence of nearby anomalous read pairs to determine whether to extend the search range used by the subsequent AlignContig step from its default. AlignContig—Computes a dynamic programming alignment of a contig to a region of the reference genome; merges full or partial duplicate calls of the same event into a single call. Tumor Normal v2.0 App Guide 13 Analysis Methods Output Data } } } } VariantFilter—Removes all structural variants that overlap with gaps identified in UCSC gaps. The UCSC gaps file defines regions of the genome that have not been sequenced. DeletionGenotyper—Assigns a genotype to all deletions. SomaticGenotyper—Assigns a quality score (Q-score) to all structural variants. Higher Q-scores indicate a higher probability that this structural variant is somatic. DeletionGenotyper—Assigns a genotype to all deletions. Somatic Genotyping The Somatic Genotyping module simultaneously analyzes 2 BAM files with read alignments, 1 for a normal sample and 1 for the tumor. Mantra assumes that each BAM file contains reads from only one sample and tracks the sample membership of each read through the workflow. The Somatic Genotyping module attempts to estimate the probability of a variant being somatic given the read coverage in normal and tumor samples. To do so, it gathers a pool of breakpoint-associated reads from each sample and realigns them to the reference and putative somatic allele to estimate allele support from each sample. Pre-Scoring Alignment As an initial step in the process, candidate variants smaller than minSomaticCallSize are filtered out. Remaining candidate variants that meet the following criteria are then excluded from somatic scoring (as they are deemed to be likely false positives from the onset): 1 Candidate variants where the variant contig is constructed from an equal or greater number of normal sample reads compared to tumor sample reads (following coverage based normalization); 2 Candidate variants having less than 2 tumor sample reads used to construct the variant contig (parameter “minAnomReadSuppCancer”) 3 Candidate variants having more than 10 normal sample reads used to construct the variant contig (“maxAnomReadSuppNormal”) After this pre-processing filter, a pool of breakpoint-associated reads is found for each sample. This will be used for realignment to both the reference allele and putative somatic allele to find evidence for support of each allele in each sample. The breakpointassociated read pool is built from two sources: the first source is the reads used to assemble the variant contig; the second source comprises all reads (from the BAM files) that align within 10 bases of the predicted breakpoints. Reads that are flagged as PCR duplicates, unmapped, anomalous, or with a MAPQ score < 20 are not included. The reads in the breakpoint-associated reads pools are then realigned to the reference and putative somatic alleles using a Smith-Waterman alignment. To have sufficient sequence context, the reference and intra-chromosomal somatic alleles are extended to have at least 120 bp on either side of the variant breakpoints. Read alignments that have mismatches in more than 10% of their sequence or have more than 3 gaps are not eligible to count as support for an allele. In addition, the read alignment score must favor the reference or somatic allele by at least (matchscore-mismatch score)*10 to be included as a supporting allele count; the realigned read must cross one of the predicted breakpoints by at least 8 bases. 14 Document # 15050950 v01 After the prescoring steps, Mantra uses a probabilistic model to estimate Q-scores. Specifically, given counts of anomalous and normal reads in tumor and normal BAM files, Mantra estimates the probability of observing a given or more extreme number of anomalous reads in a tumor sample using the Fisher Exact Test. This estimate is written as a Phred-scaled score; for example, Q-score = -10 log10(Prob variant is somatic). Inspection of somatic variant calls in both real and simulated data suggests that Q-score threshold of 30 should provide a good trade-off between sensitivity and specificity. The maximum achievable score is 60; the lowest is 0. Somatic Genotyping VCF The Somatic Genotyping module generates a modified VCF file with records having a SOMATICSCORE key in the INFO field and a modified READSOURCES. This is shown in the following example: chr1 5618276 chr1uMantrauDELu0u3420 TGAGTGAATGAATGAGGGAATGAATGAATGAGG T 101 PASS SOMATICSCORE=31;NS=1;SVTYPE=DEL;SVLEN=32;READSOURCES= (0:17:24,1:1:22);UPSTREAM=AATGAATGAGTGAATGCATGAAT GAATGAGTGAATGAATGAATGAGTGAATGCATGAATGAGTGAATAAGT;DOWNSTREAM=GAA T GAATGAATGAGTGAATAATTGAGTGAGTGAATGAATGA;CONTIG=AATGAATGAGTGAATGC AT GAATGAATGAGTGAATGAATGAATGAGTGAATGCATGAATGAGTGAATAAGTGAATGAATGAA TG AGTGAATAATTGAGTGAGTGAATGAATGA;CONTIG_NUM=9130;CLUSTER_NUM=7630 GT 1/. Values for READSOURCES (0:17:24,1:1:22) can be interpreted as follows: } } } } } } 0—First/tumor sample 17—Number of anomalous reads in tumor sample 24—Normal reads in tumor sample 1—Second/normal sample 1—Number of anomalous reads in normal sample 22—Number of normal reads in normal sample Mantra tags some somatic structural variants as having low-confidence. The following VCF filter flags are used to indicate low-confidence: } MaxSVLEN—Somatic variants is larger than default length of 10Kb of confident SV calls. } LowSomaticScore—Somatic variants with a Q-score of less than 30 are still listed in the output file. They are marked using this flag to indicate that they are of lower confidence. } LowMAPQ—The fraction of reads that are non-anomalous and above the minimum MAPQ threshold in the normal sample is less than 0.8. } MaxDepth—Normal sample site depth is greater than 3x the mean chromosome depth. Tumor Normal v2.0 App Guide 15 Analysis Methods Somatic Quality Scores } NormalSupport—At least one read in the normal sample strongly supports the putative somatic allele. } MinSampleCount—Fewer than 8 reads support any allele in one of the samples. Copy Number Aberrations (SENECA) The copy number aberrations module is also referred to as SENECA (SEnsitive detection of copy NumbErs in CAncer). It identifies copy number aberrations (CNAs) in heterogeneous tumor samples that exhibit contamination with normal tissues, aneuploidy, and loss of heterozygosity (LOH) that can confound correct copy assignment and lead to erroneous CNA calls. The algorithm workflow comprises of 2 distinct steps: } Segmentation of data into regions with putatively distinct copy numbers. } Calculation of ploidy and purity with a final copy number assignment. As input, SENECA uses aligned sequences from tumor and matched normal samples and annotation information about the location of known variants in dbSNP, regional alignability, and the location of gaps in dbSNP. Segmentation SENECA is a count-based method to assign copy number state. It compares coverage between tumor and normal samples. Specifically, it bins read coverage using nonoverlapping 1 kb windows to derive counts in tumor and normal samples, and it then takes the ratio of the 2 counts. Bins are skipped during segmentation when they overlap low alignability regions in more than 20% of their size. Independently, SENECA calculates B allele ratios at dbSNP positions from a tumor BAM file, and it keeps only SNVs that are heterozygous in the corresponding normal sample. Segmentation is carried out independently for copy number and B allele ratios. Ploidy and Purity Calculation Following segmentation, SENECA performs ploidy and purity calculations. These calculations are based on the principle that for each value of ploidy and purity and a selected copy number, the values of B allele and read count ratios are inferred. For example, for copy number state 1 (1 deleted allele of a diploid genome), the B allele ratio is always near 0 because only 1 allele is present. However, if a tumor sample has only 70% percent purity because of the presence of the normal genome as background, the B allele ratio increases due to the presence of a heterozygous normal allele. The low percentage of purity results in a final B allele ratio of 0.15. SENECA fits a multivariate Gaussian distribution to copy data and B allele ratio data on a two-dimensional grid of varying ploidy and purity. On the grid, each state encodes ploidy and purity values. In addition, SENECA uses a separate state encoding copy neutral LOH and copy gain LOH to identify loss-of-heterozygosity events. Ploidy and purity associated with the model that has the highest log-likelihood are then used to assign a copy number state to each segment. When both segments and copy numbers are estimated, a quality score for copy number assignment is computed using a likelihood ratio test. This test compares the likelihood of a current copy number assignment to a likelihood of assigning 1 more or 1 less copy. Results of the likelihood ratio test are then reported as a Q-score field in the VCF file using the following transformation: 2*log (s1/s2), where s1 is a sum of squares for selected model and s2 is a sum of squares for the next nearest model. Q-score threshold of 1.5 provides a good trade-off between sensitivity and specificity. 16 Document # 15050950 v01 To view the results, navigate to BaseSpace, click the Projectstab, then the project name, and then the analysis. Figure 4 Tumor Normal Output Navigation Bar After analysis is complete, access the output through the left navigation bar. } Analysis Info—Information about the analysis session, including log files. } Inputs—Overview of input settings. } Output Files—Output files for the sample. } Analysis Reports—A list of reports for a single sample pair. Analysis Info The Analysis Info page displays the analysis settings and execution details. Row Heading Definition Name Name of the analysis session. Application App that generated this analysis. Date Started Date and time the analysis session started. Date Completed Date and time the analysis session completed. Duration Duration of the analysis. Session Type Multi-Node or Single-Node Status Status of the analysis session. The status shows either Running or Complete and the number of nodes used. File Name Description CompletedJobInfo.xml Contains information about the completed analysis session. Log Files Tumor Normal v2.0 App Guide 17 Analysis Output Analysis Output File Name Description Logging.zip Contains all detailed log files for each step of the workflow. SampleSheet.csv Sample sheet. SampleSheetUsed.csv WorkflowError.txt Contains error messages created when running the workflow. WorkflowLog.txt Contains details about workflow steps, command line calls with parameters, timing, and progress. Output Files The Output Files page provides access to the output files for each sample analysis. } BAM Files } VCF Files } Genome VCF Files BAM File Format A BAM file (*.bam) is the compressed binary version of a SAM file that is used to represent aligned sequences up to 128 Mb. SAM and BAM formats are described in detail at https://samtools.github.io/hts-specs/SAMv1.pdf. BAM files use the file naming format of SampleName_S#.bam, where # is the sample number determined by the order that samples are listed for the run. BAM files contain a header section and an alignment section: } Header—Contains information about the entire file, such as sample name, sample length, and alignment method. Alignments in the alignments section are associated with specific information in the header section. } Alignments—Contains read name, read sequence, read quality, alignment information, and custom tags. The read name includes the chromosome, start coordinate, alignment quality, and the match descriptor string. The alignments section includes the following information for each or read pair: } RG: Read group, which indicates the number of reads for a specific sample. } BC: Barcode tag, which indicates the demultiplexed sample ID associated with the read. } SM: Single-end alignment quality. } AS: Paired-end alignment quality. } NM: Edit distance tag, which records the Levenshtein distance between the read and the reference. } XN: Amplicon name tag, which records the amplicon tile ID associated with the read. BAM index files (*.bam.bai) provide an index of the corresponding BAM file. VCF File Format Variant Call Format (VCF) is a widely used file format developed by the genomics scientific community that contains information about variants found at specific positions in a reference genome. 18 Document # 15050950 v01 VCF File Header—Includes the VCF file format version and the variant caller version. The header lists the annotations used in the remainder of the file. If MARS is listed, the Illumina internal annotation algorithm annotated the VCF file. The VCF header includes the reference genome file and BAM file. The last line in the header contains the column headings for the data lines. VCF File Data Lines—Each data line contains information about a single variant. VCF File Headings Heading Description CHROM The chromosome of the reference genome. Chromosomes appear in the same order as the reference FASTA file. POS The single-base position of the variant in the reference chromosome. For SNPs, this position is the reference base with the variant; for indels or deletions, this position is the reference base immediately before the variant. ID The rs number for the SNP obtained from dbSNP.txt, if applicable. If there are multiple rs numbers at this location, the list is semicolon delimited. If no dbSNP entry exists at this position, a missing value marker ('.') is used. REF The reference genotype. For example, a deletion of a single T is represented as reference TT and alternate T. An A to T single nucleotide variant is represented as reference A and alternate T. ALT The alleles that differ from the reference read. For example, an insertion of a single T is represented as reference A and alternate AT. An A to T single nucleotide variant is represented as reference A and alternate T. QUAL A Phred-scaled quality score assigned by the variant caller. Higher scores indicate higher confidence in the variant and lower probability of errors. For a quality score of Q, the estimated probability of an error is 10-(Q/10). For example, the set of Q30 calls has a 0.1% error rate. Many variant callers assign quality scores based on their statistical models, which are high in relation to the error rate observed. Tumor Normal v2.0 App Guide 19 Analysis Output VCF files use the file naming format SampleName_S#.vcf, where # is the sample number determined by the order that samples are listed for the run. VCF File Annotations 20 Heading Description FILTER If all filters are passed, PASS is written in the filter column. • LowDP—Applied to sites with depth of coverage below a cutoff. • LowGQ—The genotyping quality (GQ) is below a cutoff. • LowQual—The variant quality (QUAL) is below a cutoff. • LowVariantFreq—The variant frequency is less than the given threshold. • R8—For an indel, the number of adjacent repeats (1-base or 2-base) in the reference is greater than 8. • SB—The strand bias is more than the given threshold. Used with the Somatic Variant Caller and GATK. INFO Possible entries in the INFO column include: • AC—Allele count in genotypes for each ALT allele, in the same order as listed. • AF—Allele Frequency for each ALT allele, in the same order as listed. • AN—The total number of alleles in called genotypes. • CD—A flag indicating that the SNP occurs within the coding region of at least 1 RefGene entry. • DP—The depth (number of base calls aligned to a position and used in variant calling). • Exon—A comma-separated list of exon regions read from RefGene. • FC—Functional Consequence. • GI—A comma-separated list of gene IDs read from RefGene. • QD—Variant Confidence/Quality by Depth. • TI—A comma-separated list of transcript IDs read from RefGene. FORMAT The format column lists fields separated by colons. For example, GT:GQ. The list of fields provided depends on the variant caller used. Available fields include: • AD—Entry of the form X,Y, where X is the number of reference calls, and Y is the number of alternate calls. • DP—Approximate read depth; reads with MQ=255 or with bad mates are filtered. • GQ—Genotype quality. • GQX—Genotype quality. GQX is the minimum of the GQ value and the QUAL column. In general, these values are similar; taking the minimum makes GQX the more conservative measure of genotype quality. • GT—Genotype. 0 corresponds to the reference base, 1 corresponds to the first entry in the ALT column, and so on. The forward slash (/) indicates that no phasing information is available. • NL—Noise level; an estimate of base calling noise at this position. • PL—Normalized, Phred-scaled likelihoods for genotypes. • SB—Strand bias at this position. Larger negative values indicate less bias; values near 0 indicate more bias. Used with the Somatic Variant Caller and GATK. • VF—Variant frequency; the percentage of reads supporting the alternate allele. SAMPLE The sample column gives the values specified in the FORMAT column. Document # 15050950 v01 Genome VCF (gVCF) files are VCF v4.1 files that follow a set of conventions for representing all sites within the genome in a reasonably compact format. The gVCF files include all sites within the region of interest in a single file for each sample. The gVCF file shows no-calls at positions with low coverage, or where a low-frequency variant (< 3%) occurs often enough (> 1%) that the position cannot be called to the reference. A genotype (GT) tag of ./. indicates a no-call. For more information, see sites.google.com/site/gvcftools/home/about-gvcf. Sample Summary Reports The Tumor Normal App provides an overview of statistics per sample in the Analysis Reports sample pages. To download the statistics, click PDF Summary Report. } Somatic Analysis Summary } Normal Sample Summary } Tumor Sample Summary NOTE For more information about the summary report, see Molecular Characterization of Tumors Using Next-Generation Sequencing. Somatic Analysis Summary The Tumor Normal App provides a summary of the somatic analysis statistics on the tumor and normal samples. To download the statistics, click PDF Summary Report. Sample Information Table 1 Sample Information Statistic Definition Gigabases Passing Filter Number of gigabases passing filter for this sample. % Bases ≥ Q30 The percentage of bases with a quality score of 30 or higher. Purity Estimates the amount of signal for the tumor sample from normal cells as background. For more details, see Ploidy and Purity Calculation on page 16. Ploidy Estimates the copy number variations for the entire tumor genome. For more details, see Ploidy and Purity Calculation on page 16. Tumor Normal v2.0 App Guide 21 Analysis Output Genome VCF Files Variants Summary Table 2 Somatic Small Variants Summary Statistic Definition Total The total number of variants present in the data set that pass the quality filters. Number in Genes The number of variants in a gene. Number in Exons The number of variants in an exon. Number in Coding Regions The number of variants in a coding region. Splice Site Region The number of variants in a splice site region. Stop Gained The number of variants that cause an additional stop codon. Stop Lost The number of variants that cause the loss of a stop codon. Frameshift The number of variants that cause a frameshift. Non-synonymous The number of variants that cause an amino acid change in a coding region. Synonymous The number of variants within a coding region but do not cause an amino acid change. Mature miRNA The number of variants in a mature miRNA. UTR Region The number of variants in an untranslated region (UTR). dbSNP The number of variants present in dbSNP. Table 3 Somatic Small Variants 22 Statistics Definition Chr Name of reference chromosome. Pos The reference position within chromosome. Depth The number of reads aligned at this position. Ref The reference allele. Alt The alternative allele. Alt Freq The proportion of the alternative allele among all alleles being considered. Type The type of small variant (SNV, Insertion, Deletion). Consequence Predicted transcript consequence. dsSNP The numeric identifier developed by the National Center for Biotechnology Information (NCBI) for the Single Nucleotide Polymorphism Database (dbSNP). Document # 15050950 v01 Analysis Output Statistics Definition COSMIC The numeric identifier for the variant in the Catalogue of Somatic Mutations in Cancer (COSMIC) database. ClinVar A public archive of reports of the relationships among human variations and phenotypes. Table 4 Somatic Structural Variants Summary Variant Class Notes CNV The method to determine copy number variations is described in Copy Number Aberrations (SENECA) on page 16 Deletions For more information regarding the criteria used to determine these structural variants, see Large Indel and Structural Variant Calls on page 13. Tandem duplications Insertions, inversions Translocation breakends Duplications (DUP) Table 5 Somatic Structural Variants Statistics Definition Chr Name of reference chromosome. Pos Position within reference chromosome. Len Type Qual Length difference between reference allele and alternative allele. Structural variant type. Structural variant quality score. Table 6 Somatic Copy Number Variants Statistics Definition Chr Name of reference chromosome. Pos The reference position within chromosome. Len The estimated length of the copy number variant. Qual The quality score of the copy number variant. Copy Number The number of copies of the copy number variant. LOH Loss of heterozygosity of the copy number variant. Tumor Normal v2.0 App Guide 23 Table 7 Somatic Translocation Variants Statistics Definition Chr Name of reference chromosome. Pos Position within reference chromosome. Ref The reference allele. The alt allele. Alt Qual Structural variant quality score. Circos Plot of Somatic Variations The circos plot provides visualization of somatic small variation, ploidy, and structural variations reported in the somatic variation files (VCF). The circos plot displays somatic variation data in tracks with chromosomes circularly arranged. Following is an example legend. Labels are described from inside the circle to the outside. Legend A 24 Label (From Inner Circle to Outer Circle) Somatic structural variants Description The somatic structural variants detailed in somatic.SVs.vcf are plotted in the center of the plot. Green links—Segmental duplications (at the center of the circle). Green boxes—Inversions (the first inner track). Purple boxes—Deletions (the second track). The width of the boxes indicates the length of SVs. Purple bars—Insertion breakpoints (the third track). Red links—Translocations. The end of the links indicates the 2 breakpoints of SVs. Document # 15050950 v01 B Label (From Inner Circle to Outer Circle) E Number of somatic indels per Mb Number of somatic SNVs per Mb Copy-neutral loss of heterozygosity (LOH) B-allele frequency F Called level G Karyotype H Chromosome position Chromosome number HGNC symbols for genes harboring variants Genes of nonsynonymous variants C D I J K Description The density of PASS somatic indels reported in somatic.indels.vcf.gz in 1 Mb windows. The scale of Y-axis in the histogram indicates the counts. The density of PASS somatic SNVs reported in somatic.snvs.vcf in 1 Mb windows, arbitrarily scaled in a histogram with Y-axis pointing inward. The LOH regions with SNP calls in the normal genome but a homozygous reference call in the tumor genome, in CNVs.vcf. The B-allele ratios calculated by SENECA that will be used in the ploidy and purity estimation. The copy number aberrations from CNVs.vcf file. The scale of Y-axis in the histogram indicates the called level. The standard Circos ideogram defining the chromosome position, identity, and color of cytogenetic bands. The reference coordinates along the chromosome (in megabases) Chromosome number: 1, 2,…,22, X, Y. HGNC genes impacted by somatic SNVs. Genes containing SNVs in the coding region with an HGNC symbol are labeled. Genes identified in (J) resulting in non-synonymous changes in the coding region are highlighted in red. Depth/B-Allele Plot The top plot provides an overview of the depth of coverage by chromosomal position. Aberrant values indicate copy number variations. Copy number ratios are classified as either gains (red), losses (green), or copy number unchanged (black). Estimated purity and ploidy values are listed at the top of the plot. The bottom of the plot displays B-allele ratio by chromosomal position. The B-allele ratio is the ratio of the 2 alleles A and B. The B-allele ratios that are around 0.5 are filtered out for diploid regions. Tumor Normal v2.0 App Guide 25 Analysis Output Legend Normal Sample The Tumor Normal App provides an overview of statistics on the normal sample. To download the statistics, click PDF Summary Report. Alignment Summary Table 8 Alignment Summary Statistic Definition Number of Reads Total number of reads passing filter for this sample. Coverage Total number of aligned bases divided by the genome size. Percent Duplicate Paired Reads Percentage of paired reads that have duplicates. Fragment Length Median Median length of the sequenced fragment. The fragment length is calculated based on the locations at which a read pair aligns to the reference. The read mapping information is parsed from the BAM files. Fragment Length Standard Deviation Standard deviation of the sequenced fragment length. Table 9 Read Statistics Statistic Definition Percent Aligned The percentage of reads passing filter that aligned to the reference genome. Percent Q30 The percentage of bases with a quality score of 30 or higher. Mismatch Rate The average percentage of mismatches across both reads 1 and 2 over all cycles. Variants Summary Table 10 Small Variants Summary Statistic Total Passing 26 Definition Total number of variants present in the data set that pass the quality filters. Percent Found in dbSNP 100*(number of variants in dbSNP/number of variants). Het/Hom Ratio Number of heterozygous/number of homozygous variants. Ts/Tv Ratio Transition rate of SNVs that pass the quality filter/by transversion rate of SNVs that pass the quality filters. Document # 15050950 v01 Analysis Output Table 11 Variants by Sequence Context Statistic Definition Number in Genes The number of variants in a gene. Number in Exons The number of variants in an exon. Number in Coding Regions The number of variants in a coding region. Number in UTR Region The number of variants in an untranslated region (UTR). Number in Mature miRNA The number of variants in a mature miRNA. Splice Site Region The number of variants in a splice site region. Table 12 Variants by Consequence Statistic Definition Frameshift The number of variants that cause a frameshift. Non-synonymous The number of variants that cause an amino acid change in a coding region. Synonymous The number of variants within a coding region but do not cause an amino acid change. Stop Gained The number of variants that cause an additional stop codon. Stop Lost The number of variants that cause the loss of a stop codon. Coverage Histogram The coverage histogram shows the number of reference bases plotted against the depth of coverage (read depth). It has the following features: } The drop-down menu lets you view the overall frequency or a specific chromosome. } The Fix Y Scale checkbox lets you keep the y-axis scale the same when comparing multiple chromosomes. } The Export TSV button lets you export the coverage data in a tab-separated text file. Tumor Sample The Tumor Normal App provides an overview of statistics on the tumor sample. To download the statistics, click PDF Summary Report. Alignment Summary Table 13 Alignment Summary Statistic Definition Number of Reads Total number of reads passing filter for this sample. Tumor Normal v2.0 App Guide 27 Statistic Definition Coverage Total number of aligned bases divided by the genome size. Percent Duplicate Paired Reads Percentage of paired reads that have duplicates. Fragment Length Median Median length of the sequenced fragment. The fragment length is calculated based on the locations at which a read pair aligns to the reference. The read mapping information is parsed from the BAM files. Fragment Length Standard Deviation Standard deviation of the sequenced fragment length. Table 14 Read Statistics Statistic Definition Percent Aligned The percentage of reads passing filter that aligned to the reference genome. Percent Q30 The percentage of bases with a quality score of 30 or higher. Mismatch Rate The average percentage of mismatches across both reads 1 and 2 over all cycles. Coverage Histogram The coverage histogram shows the number of reference bases plotted against the depth of coverage (read depth). It has the following features: } The drop-down menu lets you view the overall frequency or a specific chromosome. } The Fix Y Scale checkbox lets you keep the y-axis scale the same when comparing multiple chromosomes. } The Export TSV button lets you export the coverage data in a tab-separated text file. 28 Document # 15050950 v01 Revision History Revision History Document Document # 15050950 v01 Tumor Normal v2.0 App Guide Date January 2016 Description of Change Supports Tumor Normal v2.0. 29 Notes For technical assistance, contact Illumina Technical Support. Table 15 Illumina General Contact Information Website Email www.illumina.com [email protected] Table 16 Illumina Customer Support Telephone Numbers Region Contact Number Region North America 1.800.809.4566 Japan Australia 1.800.775.688 Netherlands Austria 0800.296575 New Zealand Belgium 0800.81102 Norway China 400.635.9898 Singapore Denmark 80882346 Spain Finland 0800.918363 Sweden France 0800.911850 Switzerland Germany 0800.180.8994 Taiwan Hong Kong 800960230 United Kingdom Ireland 1.800.812949 Other countries Italy 800.874909 Contact Number 0800.111.5011 0800.0223859 0800.451.650 800.16836 1.800.579.2745 900.812168 020790181 0800.563118 00806651752 0800.917.0041 +44.1799.534000 Safety data sheets (SDSs)—Available on the Illumina website at support.illumina.com/sds.html. Product documentation—Available for download in PDF from the Illumina website. Go to support.illumina.com, select a product, then select Documentation & Literature. Tumor Normal v2.0 App Guide Technical Assistance Technical Assistance Illumina 5200 Illumina Way San Diego, California 92122 U.S.A. +1.800.809.ILMN (4566) +1.858.202.4566 (outside North America) [email protected] www.illumina.com