Download Tumor Normal v2.0 App Guide

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Tumor Normal v2.0
BaseSpace App Guide
For Research Use Only. Not for use in diagnostic procedures.
Introduction
Workflow Diagram
Set Analysis Parameters
Analysis Methods
Analysis Output
Revision History
Technical Assistance
ILLUMINA PROPRIETARY
Document # 15050950 v01
January 2016
3
5
6
7
17
29
This document and its contents are proprietary to Illumina, Inc. and its affiliates ("Illumina"), and are intended solely for the
contractual use of its customer in connection with the use of the product(s) described herein and for no other purpose. This
document and its contents shall not be used or distributed for any other purpose and/or otherwise communicated, disclosed,
or reproduced in any way whatsoever without the prior written consent of Illumina. Illumina does not convey any license
under its patent, trademark, copyright, or common-law rights nor similar rights of any third parties by this document.
The instructions in this document must be strictly and explicitly followed by qualified and properly trained personnel in order
to ensure the proper and safe use of the product(s) described herein. All of the contents of this document must be fully read
and understood prior to using such product(s).
FAILURE TO COMPLETELY READ AND EXPLICITLY FOLLOW ALL OF THE INSTRUCTIONS CONTAINED HEREIN
MAY RESULT IN DAMAGE TO THE PRODUCT(S), INJURY TO PERSONS, INCLUDING TO USERS OR OTHERS, AND
DAMAGE TO OTHER PROPERTY.
ILLUMINA DOES NOT ASSUME ANY LIABILITY ARISING OUT OF THE IMPROPER USE OF THE PRODUCT(S)
DESCRIBED HEREIN (INCLUDING PARTS THEREOF OR SOFTWARE).
© 2016 Illumina, Inc. All rights reserved.
Illumina, 24sure, BaseSpace, BeadArray, BlueFish, BlueFuse, BlueGnome, cBot, CSPro, CytoChip, DesignStudio,
Epicentre, ForenSeq, Genetic Energy, GenomeStudio, GoldenGate, HiScan, HiSeq, HiSeq X, Infinium, iScan, iSelect,
MiniSeq, MiSeq, MiSeqDx, MiSeq FGx, NeoPrep, NextBio, Nextera, NextSeq, Powered by Illumina, SureMDA,
TruGenome, TruSeq, TruSight, Understand Your Genome, UYG, VeraCode, verifi, VeriSeq, the pumpkin orange color,
and the streaming bases design are trademarks of Illumina, Inc. and/or its affiliate(s) in the U.S. and/or other countries. All
other names, logos, and other trademarks are the property of their respective owners.
The Tumor Normal v2.0 App detects somatic variants from a matched pair of tumor and
normal samples. The Isaac Genome Alignment Software aligns both tumor and normal
samples to a reference. The Isaac Variant Caller calls germline small variants, structural
variants, and copy number abnormalities (CNAs). Also, Strelka calls somatic small
variants in the tumor matched samples, structural variants, and somatic copy number
variants (CNVs).
Compatible Libraries
See the BaseSpace support page for a list of library types that are compatible with the
Tumor Normal App.
Workflow Requirements
}
}
}
}
}
}
This app supports the following:
} Human genomes.
} Samples with paired-end reads.
} BAM output files from Isaac Whole Genome Sequencing v4 AppResults as inputs.
} Read lengths that are greater than or equal to 32 bp.
Read lengths is between 100 and 150 bases. Reads shorter than 50 bases result in a
warning, while reads shorter than 32 bases cause an error in the app.
Minimum normal sample data set size is 150 gigabases.
} Approximately 375 million reads assuming 2 × 100.
} Approximately 250 million reads assuming 2 × 150.
Minimum tumor sample data set size is 300 gigabases.
} Approximately 750 million reads assuming 2 × 100.
} Approximately 500 million reads assuming 2 × 150.
Maximum combined (normal+tumor) data set size is 650 gigabases.
} Approximately 3.25 billion reads assuming 2 × 100.
} Approximately 2.17 billion reads assuming 2 × 150.
This app does not support mate pair sample or other non-forward and -reverse
styles of paired-end sequencing.
Versions
The following components are used in the Tumor Normal App.
Software
Version
Canvas (CNV Caller)
1.1.0.5
Isaac (Aligner)
iSAACSAAC00776.15.01.27
Isaac Variant Caller
starka-2.1.4.2
Isis (Analysis Software)
2.5.55.16
IONA (Annotation Service)
1.0.10.37
Tumor Normal v2.0 App Guide
3
Introduction
Introduction
Software
Version
Canvas (CNV Caller)
1.1.0.5
Manta (SV Caller)
0.23.1
SAMtools
0.1.19-isis-1.0.3
Reference Genomes
}
4
Human, UCSC hg19
The human reference genome is PAR-Masked, which means that the Y chromosome
sequence has the Pseudo Autosomal Regions (PAR) masked (set to N) to avoid
mismapping of reads in the duplicate regions of sex chromosomes.
Document # 15050950 v01
Workflow Diagram
Workflow Diagram
Figure 1 Tumor Normal App Workflow
Tumor Normal v2.0 App Guide
5
Set Analysis Parameters
1
Navigate to BaseSpace, click the Apps tab.
2
Click Tumor Normal.
3
From the drop-down list, select version 2.0.0, and then click Launch to open the app.
4
In the Analysis Name field, enter the analysis name.
By default, the analysis name includes the app name, followed by the date and time
that the analysis session starts.
5
From the Save Results To field, select the project that stores the app results.
6
From the Reference Genome field, select the reference genome you want to align.
The default is Human (UCSC hg19 PAR-Masked).
7
From the Annotation field, select either RefSeq or Ensembl for gene and transcript
annotation reference database. The default is RefSeq.
8
From the Start from BAM (AppResult) field:
a
b
9
Select Yes to use BAM output files from the Isaac Whole Genome Sequencing
v4.0 App. This option gives you a quicker turnaround time.
Select No to use sample pairs. The default is No.
From the Sample Pairs field, click Select Sample Pairs to open the Select Sample
Pairs screen, and then select the normal and tumor samples you want to analyze.
Click Confirm.
10 From the AppResult Pairs field, click Select AppResult Pairs to open the Select
AppResults Pairs screen, and then select the normal and tumor app results you want
to analyze. Click Confirm .
11 Click Continue.
The Tumor Normal App begins analysis of the samples.
When analysis is complete, the app updates the status of the app session and sends
a notification email to you.
6
Document # 15050950 v01
The Tumor Normal App uses these methods to analyze the sequencing data.
Isaac Aligner
The Isaac Aligner aligns DNA sequencing data, single or paired-end, with read lengths
32–150 bp and low error rates using the following steps:
} Candidate mapping positions—Identifies the complete set of relevant candidate
mapping positions using a 32-mer seed-based search.
} Mapping selection—Selects the best mapping among all candidates.
} Alignment score—Determines alignment scores for the selected candidates based on
a Bayesian model.
} Alignment output—Generates final output in a sorted duplicate-marked BAM file,
and summary file.
Come Raczy, Roman Petrovski, Christopher T. Saunders, Ilya Chorny, Semyon Kruglyak,
Elliott H. Margulies, Han-Yu Chuang, Morten Källberg, Swathi A. Kumar, Arnold Liao,
Kristina M. Little, Michael P. Strömberg and Stephen W. Tanner (2013) Isaac: Ultra-fast whole
genome secondary analysis on Illumina sequencing platforms. Bioinformatics 29(16):2041-3
bioinformatics.oxfordjournals.org/content/29/16/2041
Candidate Mapping
To align reads, the Isaac Aligner first identifies a small but complete set of relevant
candidate mapping positions. The Isaac Aligner begins with a seed-based search using
32-mers from the extremities of the read as seeds. Isaac Aligner performs another search
using different seeds for only those reads that were not mapped unambiguously with the
first pass seeds.
Mapping Selection
Following a seed-based search, the Isaac Aligner selects the best mapping among all the
candidates. For paired-end data sets, all mappings where only one end is aligned (called
orphan mappings) trigger a local search to find additional mapping candidates. These
candidates (called shadow mappings) are defined through the expected minimum and
maximum insert size. After optional trimming of low quality 3' ends and adapter
sequences, the possible mapping positions of each fragment are compared. This step
takes into account pair-end information (when available), possible gaps using a banded
Smith-Waterman gap aligner, and possible shadows. The selection is based on the
Smith-Waterman score and on the log-probability of each mapping.
Alignment Scores
The alignment scores of each read pair are based on a Bayesian model, where the
probability of each mapping is inferred from the base qualities and the positions of the
mismatches. The final mapping quality (MAPQ) is the alignment score, truncated to 60
for scores above 60, and corrected based on known ambiguities in the reference flagged
during candidate mapping. Following alignment, reads are sorted. Further analysis is
performed to identify duplicates and optionally to realign indels.
The alignment scores of each read pair are based on a Bayesian model, where the
probability of each mapping is inferred from the base qualities and the positions of the
Tumor Normal v2.0 App Guide
7
Analysis Methods
Analysis Methods
mismatches. The final mapping quality is the alignment score, truncated to 60 for scores
above 60. Following alignment, reads are sorted. Further analysis is performed to identify
duplicates and optionally to realign indels.
Alignment Output
}
}
}
}
After sorting the reads, the Isaac Aligner generates compressed binary alignment
output files, called BAM (*.bam) files, using the following process:
Marking duplicates—Detection of duplicates is based on the location and observed
length of each fragment. The Isaac Aligner identifies and marks duplicates even
when they appear on oversized fragments or chimeric fragments.
Realigning indels—The Isaac Aligner tracks previously detected indels, over a
window large enough for the current read length, and applies the known indels to
all reads with mismatches.
Generating BAM files—The first step in BAM file generation is creation of the BAM
record, which contains all required information except the name of the read. The
Isaac Aligner reads data from base call (BCL) files that were written during base
calling on the sequencer to generate the read names. Data are then compressed into
blocks of 64 kb or less to create the BAM file.
Isaac Somatic Variant Caller
The Isaac Somatic Variant Caller detects somatic SNVs and indels in sequencing data
from a tumor and matched normal sample, based on the following assumptions:
} The normal sample is a mixture of diploid germline variation and noise.
} The tumor sample is a combination of the normal sample and somatic variation. It
is assumed that the somatic variation and the normal noise can occur at any allele
frequency ratio.
For SNVs, but not for indels, the normal noise component is further modeled as a
combination of single-strand and double-strand noise.
8
Document # 15050950 v01
Analysis Methods
Figure 2 Isaac Somatic Variant Caller Method
NOTE
For a detailed overview of Isaac Somatic Variant Caller methods, go to
www.ncbi.nlm.nih.gov/pubmed/22581179.
Candidate Indel Search
Strelka scans through the genome using sequence alignments from the normal sample
and tumor sample together to find a joint set of candidate indels. The information in
sequence alignments is supplemented with externally generated candidate indels
discovered by Manta. Manta provides external candidate indels to Strelka for indels of
size 50 and below.
Candidate indels are used for realignment of reads, during which each candidate indel is
evaluated as a potential somatic indel. Any other types of indels are considered noise
indels. If a better alignment is not found, these indels are allowed to remain in the read
alignments; otherwise, they are not used.
The candidate indel thresholds are designed so that the joint candidate indel set is at
least the combined set found if the Small Variant Caller (Starling) is run on the
individual samples. Specifically, where a minimum number of nominating reads is
required for candidacy in Starling, Strelka requires the same minimum number of
nominating reads from the combined input. Strelka requires that at least 1 sample
contains a minimum fraction of supporting reads among the sample reads for
candidacy.
Tumor Normal v2.0 App Guide
9
Realignment
For every read that intersects a candidate alignment, the Strelka attempts to find the most
probable alignments including the candidate indel and excluding the candidate indel.
Typically, the alignment excluding the candidate indel aligns to the reference, but
occasionally an alternate indel that overlaps or interferes with the candidate is found to
be more likely. The indel caller uses the probabilities of both alignments as part of the
indel quality score calculation, whereas only a single alignment (usually the most
probable) is preserved for SNV calling.
Somatic Caller
Strelka uses a Bayesian probability model similar to the one used for germline variant
calling in the Starling Small Variant Caller or in external tools such as GATK. Using this
model, our objective is to compute the posterior probability P(θ│ D), which is the
probability of the model state θ conditioned on the observed sequencing data.
In a germline variant caller, the state space of the model is conventionally a discrete set
of diploid genotypes. For SNVs, the set of possible states is G=
{"AA,CC,GG,TT,AC,AG,AT,CG,CT,GT"}.
The Strelka model instead approximates continuous allele frequencies for each allele:
f={f_A, f_C, f_G, f_T}
The allele frequencies are restricted to allow a maximum of 2 nonzero frequencies. Any
additional alleles observed in the data are treated as noise.
Another departure from typical germline calling methods is that the state space of the
model is the allele frequency of both the tumor and the normal sample. In the following
equation, f_t and f_n represent the allele frequencies of the tumor and normal samples,
respectively.
θ=(f_t, f_n)
The final somatic variant quality value reported by the model is computed from the
probability that the allele frequencies are unequal (ie, f_t≠f_n) given the observed
sequence data.
Post-Call Filtration
Heuristic filters remove several types of improbable calls resulting from data artifacts
that cannot be easily represented in the somatic probability model. These filters act as a
final step to separate out the final set of somatic calls reported by Isaac Somatic Variant
Caller.
Input Data Filtration
Isaac Somatic Variant Caller uses 2 tiers of input data filtration during somatic small
variant calling:
} Tier 1—A more stringent filtering to ensure high quality calls
} Tier 2—A lower filtration stringency
Initially, candidates are called using a subset of the data with more stringent tier 1
filtering. If the method produces a nonzero quality score for any SNV or indel, the
potential somatic variant is called again using data with a lower tier 2 stringency. The
lower quality from the 2 tiers is selected for output. However, if the tier 2 quality is 0, the
call is eliminated.
10
Document # 15050950 v01
The tier used for each quality value is provided in the Isaac Somatic Variant Caller
output record for each somatic variant. If the most likely normal genotype is not the
same at tier 1 and tier 2, then the normal genotype is reported as a conflict in the output.
Using 2 data tiers enables an initial somatic call based on high-quality data. Given a
potential call, using 2 data tiers removes support for the putative somatic allele in the
normal sample from lower quality data. The following table lists the primary data
filtration levels that are changed between tier 1 and tier 2.
Parameter
Min paired-end alignment score
Min single-end alignment score
Single-end score rescue?
Include unanchored pairs?
Include anomalous pairs?
Include singleton pairs?
Mismatch density filter—Maximum mismatches in window
Tier 1
Value
20
10
No
No
No
No
3
Tier 2
Value
0
0
Yes
Yes
Yes
Yes
10
Additional Filtration
After the somatic filter is finished, more filters are applied. A single candidate somatic
call can be annotated with several filters.
Tumor Normal v2.0 App Guide
11
Analysis Methods
For somatic SNVs and indels, Isaac Somatic Variant Caller produces a general somatic
quality score, Q(ssnv), or Q(somatic indel). This score indicates the probability of the
somatic variant and a joint probability of the somatic variant and a specific normal
genotype, Q(ssnv+ntype), or Q(somatic indel+ntype). The 2 tier evaluation is applied to
each of these qualities separately, as follows:
} Q(ssnv) = min(Q(ssnv|tier1), Q(ssnv|tier2))
} Q(ssnv+ntype) = min(Q(ssnv+ntype|tier1), Q(ssnv+ntype|tier2))
Figure 3 Additional Filtration
Quality Filtration Levels
Only somatic calls originating from homozygous reference alleles in the normal sample
are reviewed for validation and included in the output.
} Somatic SNVs are reported if the normal genotype is equal to the reference
and Q(ssnv+ntype) ≥ 15.
} Somatic indels are reported if the normal genotype is equal to the reference
and Q(somatic indel+ntype) ≥ 30.
NOTE
The value Q(ssnv+ntype) is associated with the VCF key QSS_NT.
The value Q(somatic indel+ntype) is associated with the VCF key QSI_NT.
Input Data
The somatic caller requires data from both the tumor and the normal sample. The inputs
for each sample are the same, which are a sorted BAM file containing the sequencing
reads and (optional) externally generated indel candidates.
12
Document # 15050950 v01
Strelka generates output files in VCF 4.1 format that contain metadata to describe perrecord and per-sample data in the file, and list filters applied to the records.
Large Indel and Structural Variant Calls
The large indel and structural variant caller uses the series of modules described here,
and then generates output files in VCF 4.1 format.
Before ReadBroker
}
}
}
}
StatsGenerator—Computes summary statistics on insert sizes, read orientation, and
alignment scores for each input BAM file.
AnomalousReadFinder—Grouper processes chromosomes in chunks. This method
enables parallel execution and, therefore, faster performance. AnomalousReadFinder
examines all alignments in a block and classifies reads and read pairs as follows:
} Classifies reads as either shadow (unaligned) or semialigned partial or clipped
alignment).
} Classifies read pairs as either InsertionPair, DeletionPair, InversionPair,
TandemDuplicationPair, or ChimericPair, according to which type of structural
variant an anomalously mapped read pair is associated.
ClusterFinder—Clusters reads based on their type and the position of their
alignment. Only reads of the same type are clustered together at this stage, except
shadow and semialigned reads, which can be clustered together.
ClusterMerger—Associates clusters of various anomalous read types with
shadow/semi-aligned read clusters, which breakpoints can cause. A breakpoint is a
pair of bases that are adjacent in the sample genome but not in the reference. Two
clusters are merged if they share the read or if they agree on the position and length
of the structural variant. This information is inferred from read alignment orientation
and distance.
ReadBroker
}
Interchromosomal translocations yield chimeric read pairs where 1 read aligns to
one chromosome and its partner aligns to another. Because Grouper examines each
chromosome individually, the ReadBroker step is performed to join the information
from chimeric read pairs across chromosomes.
After ReadBroker
}
}
}
SmallAssembler—Assembles reads in clusters into contigs using a de Bruijn method
and iteratively assembles reads into contigs until all reads in the cluster are
assembled. It also produces a file containing the reads that were used to assemble
the contig, with a realignment to the contig sequence.
SpanContigs—Uses the presence of nearby anomalous read pairs to determine
whether to extend the search range used by the subsequent AlignContig step from its
default.
AlignContig—Computes a dynamic programming alignment of a contig to a region
of the reference genome; merges full or partial duplicate calls of the same event into
a single call.
Tumor Normal v2.0 App Guide
13
Analysis Methods
Output Data
}
}
}
}
VariantFilter—Removes all structural variants that overlap with gaps identified in
UCSC gaps. The UCSC gaps file defines regions of the genome that have not been
sequenced.
DeletionGenotyper—Assigns a genotype to all deletions.
SomaticGenotyper—Assigns a quality score (Q-score) to all structural variants.
Higher Q-scores indicate a higher probability that this structural variant is somatic.
DeletionGenotyper—Assigns a genotype to all deletions.
Somatic Genotyping
The Somatic Genotyping module simultaneously analyzes 2 BAM files with read
alignments, 1 for a normal sample and 1 for the tumor. Mantra assumes that each BAM
file contains reads from only one sample and tracks the sample membership of each
read through the workflow.
The Somatic Genotyping module attempts to estimate the probability of a variant being
somatic given the read coverage in normal and tumor samples. To do so, it gathers a
pool of breakpoint-associated reads from each sample and realigns them to the reference
and putative somatic allele to estimate allele support from each sample.
Pre-Scoring Alignment
As an initial step in the process, candidate variants smaller than minSomaticCallSize are
filtered out. Remaining candidate variants that meet the following criteria are then
excluded from somatic scoring (as they are deemed to be likely false positives from the
onset):
1
Candidate variants where the variant contig is constructed from an equal or
greater number of normal sample reads compared to tumor sample reads
(following coverage based normalization);
2
Candidate variants having less than 2 tumor sample reads used to construct the
variant contig (parameter “minAnomReadSuppCancer”)
3
Candidate variants having more than 10 normal sample reads used to construct
the variant contig (“maxAnomReadSuppNormal”)
After this pre-processing filter, a pool of breakpoint-associated reads is found for each
sample. This will be used for realignment to both the reference allele and putative
somatic allele to find evidence for support of each allele in each sample. The breakpointassociated read pool is built from two sources: the first source is the reads used to
assemble the variant contig; the second source comprises all reads (from the BAM files)
that align within 10 bases of the predicted breakpoints. Reads that are flagged as PCR
duplicates, unmapped, anomalous, or with a MAPQ score < 20 are not included.
The reads in the breakpoint-associated reads pools are then realigned to the reference
and putative somatic alleles using a Smith-Waterman alignment. To have sufficient
sequence context, the reference and intra-chromosomal somatic alleles are extended to
have at least 120 bp on either side of the variant breakpoints. Read alignments that have
mismatches in more than 10% of their sequence or have more than 3 gaps are not
eligible to count as support for an allele. In addition, the read alignment score must favor
the reference or somatic allele by at least (matchscore-mismatch score)*10 to be included
as a supporting allele count; the realigned read must cross one of the predicted
breakpoints by at least 8 bases.
14
Document # 15050950 v01
After the prescoring steps, Mantra uses a probabilistic model to estimate Q-scores.
Specifically, given counts of anomalous and normal reads in tumor and normal BAM
files, Mantra estimates the probability of observing a given or more extreme number of
anomalous reads in a tumor sample using the Fisher Exact Test. This estimate is written
as a Phred-scaled score; for example, Q-score = -10 log10(Prob variant is somatic).
Inspection of somatic variant calls in both real and simulated data suggests that Q-score
threshold of 30 should provide a good trade-off between sensitivity and specificity. The
maximum achievable score is 60; the lowest is 0.
Somatic Genotyping VCF
The Somatic Genotyping module generates a modified VCF file with records having a
SOMATICSCORE key in the INFO field and a modified READSOURCES.
This is shown in the following example:
chr1 5618276 chr1uMantrauDELu0u3420
TGAGTGAATGAATGAGGGAATGAATGAATGAGG T
101 PASS SOMATICSCORE=31;NS=1;SVTYPE=DEL;SVLEN=32;READSOURCES=
(0:17:24,1:1:22);UPSTREAM=AATGAATGAGTGAATGCATGAAT
GAATGAGTGAATGAATGAATGAGTGAATGCATGAATGAGTGAATAAGT;DOWNSTREAM=GAA
T
GAATGAATGAGTGAATAATTGAGTGAGTGAATGAATGA;CONTIG=AATGAATGAGTGAATGC
AT
GAATGAATGAGTGAATGAATGAATGAGTGAATGCATGAATGAGTGAATAAGTGAATGAATGAA
TG
AGTGAATAATTGAGTGAGTGAATGAATGA;CONTIG_NUM=9130;CLUSTER_NUM=7630
GT 1/.
Values for READSOURCES (0:17:24,1:1:22) can be interpreted as follows:
}
}
}
}
}
}
0—First/tumor sample
17—Number of anomalous reads in tumor sample
24—Normal reads in tumor sample
1—Second/normal sample
1—Number of anomalous reads in normal sample
22—Number of normal reads in normal sample
Mantra tags some somatic structural variants as having low-confidence. The following
VCF filter flags are used to indicate low-confidence:
} MaxSVLEN—Somatic variants is larger than default length of 10Kb of confident SV
calls.
} LowSomaticScore—Somatic variants with a Q-score of less than 30 are still listed in
the output file. They are marked using this flag to indicate that they are of lower
confidence.
} LowMAPQ—The fraction of reads that are non-anomalous and above the minimum
MAPQ threshold in the normal sample is less than 0.8.
} MaxDepth—Normal sample site depth is greater than 3x the mean chromosome
depth.
Tumor Normal v2.0 App Guide
15
Analysis Methods
Somatic Quality Scores
} NormalSupport—At least one read in the normal sample strongly supports the
putative somatic allele.
} MinSampleCount—Fewer than 8 reads support any allele in one of the samples.
Copy Number Aberrations (SENECA)
The copy number aberrations module is also referred to as SENECA (SEnsitive detection
of copy NumbErs in CAncer). It identifies copy number aberrations (CNAs) in
heterogeneous tumor samples that exhibit contamination with normal tissues,
aneuploidy, and loss of heterozygosity (LOH) that can confound correct copy assignment
and lead to erroneous CNA calls.
The algorithm workflow comprises of 2 distinct steps:
} Segmentation of data into regions with putatively distinct copy numbers.
} Calculation of ploidy and purity with a final copy number assignment.
As input, SENECA uses aligned sequences from tumor and matched normal samples
and annotation information about the location of known variants in dbSNP, regional
alignability, and the location of gaps in dbSNP.
Segmentation
SENECA is a count-based method to assign copy number state. It compares coverage
between tumor and normal samples. Specifically, it bins read coverage using
nonoverlapping 1 kb windows to derive counts in tumor and normal samples, and it
then takes the ratio of the 2 counts. Bins are skipped during segmentation when they
overlap low alignability regions in more than 20% of their size.
Independently, SENECA calculates B allele ratios at dbSNP positions from a tumor BAM
file, and it keeps only SNVs that are heterozygous in the corresponding normal sample.
Segmentation is carried out independently for copy number and B allele ratios.
Ploidy and Purity Calculation
Following segmentation, SENECA performs ploidy and purity calculations. These
calculations are based on the principle that for each value of ploidy and purity and a
selected copy number, the values of B allele and read count ratios are inferred.
For example, for copy number state 1 (1 deleted allele of a diploid genome), the B allele
ratio is always near 0 because only 1 allele is present. However, if a tumor sample has
only 70% percent purity because of the presence of the normal genome as background,
the B allele ratio increases due to the presence of a heterozygous normal allele. The low
percentage of purity results in a final B allele ratio of 0.15.
SENECA fits a multivariate Gaussian distribution to copy data and B allele ratio data on
a two-dimensional grid of varying ploidy and purity. On the grid, each state encodes
ploidy and purity values. In addition, SENECA uses a separate state encoding copy
neutral LOH and copy gain LOH to identify loss-of-heterozygosity events.
Ploidy and purity associated with the model that has the highest log-likelihood are then
used to assign a copy number state to each segment. When both segments and copy
numbers are estimated, a quality score for copy number assignment is computed using a
likelihood ratio test. This test compares the likelihood of a current copy number
assignment to a likelihood of assigning 1 more or 1 less copy. Results of the likelihood
ratio test are then reported as a Q-score field in the VCF file using the following
transformation: 2*log (s1/s2), where s1 is a sum of squares for selected model and s2 is a
sum of squares for the next nearest model. Q-score threshold of 1.5 provides a good
trade-off between sensitivity and specificity.
16
Document # 15050950 v01
To view the results, navigate to BaseSpace, click the Projectstab, then the project name,
and then the analysis.
Figure 4 Tumor Normal Output Navigation Bar
After analysis is complete, access the output through the left navigation bar.
} Analysis Info—Information about the analysis session, including log files.
} Inputs—Overview of input settings.
} Output Files—Output files for the sample.
} Analysis Reports—A list of reports for a single sample pair.
Analysis Info
The Analysis Info page displays the analysis settings and execution details.
Row Heading
Definition
Name
Name of the analysis session.
Application
App that generated this analysis.
Date Started
Date and time the analysis session started.
Date Completed
Date and time the analysis session completed.
Duration
Duration of the analysis.
Session Type
Multi-Node or Single-Node
Status
Status of the analysis session. The status shows either Running
or Complete and the number of nodes used.
File Name
Description
CompletedJobInfo.xml
Contains information about the completed analysis session.
Log Files
Tumor Normal v2.0 App Guide
17
Analysis Output
Analysis Output
File Name
Description
Logging.zip
Contains all detailed log files for each step of the workflow.
SampleSheet.csv
Sample sheet.
SampleSheetUsed.csv
WorkflowError.txt
Contains error messages created when running the
workflow.
WorkflowLog.txt
Contains details about workflow steps, command line calls
with parameters, timing, and progress.
Output Files
The Output Files page provides access to the output files for each sample analysis.
} BAM Files
} VCF Files
} Genome VCF Files
BAM File Format
A BAM file (*.bam) is the compressed binary version of a SAM file that is used to
represent aligned sequences up to 128 Mb. SAM and BAM formats are described in
detail at https://samtools.github.io/hts-specs/SAMv1.pdf.
BAM files use the file naming format of SampleName_S#.bam, where # is the sample
number determined by the order that samples are listed for the run.
BAM files contain a header section and an alignment section:
} Header—Contains information about the entire file, such as sample name, sample
length, and alignment method. Alignments in the alignments section are associated
with specific information in the header section.
} Alignments—Contains read name, read sequence, read quality, alignment
information, and custom tags. The read name includes the chromosome, start
coordinate, alignment quality, and the match descriptor string.
The alignments section includes the following information for each or read pair:
} RG: Read group, which indicates the number of reads for a specific sample.
} BC: Barcode tag, which indicates the demultiplexed sample ID associated with the
read.
} SM: Single-end alignment quality.
} AS: Paired-end alignment quality.
} NM: Edit distance tag, which records the Levenshtein distance between the read and
the reference.
} XN: Amplicon name tag, which records the amplicon tile ID associated with the
read.
BAM index files (*.bam.bai) provide an index of the corresponding BAM file.
VCF File Format
Variant Call Format (VCF) is a widely used file format developed by the genomics
scientific community that contains information about variants found at specific positions
in a reference genome.
18
Document # 15050950 v01
VCF File Header—Includes the VCF file format version and the variant caller version.
The header lists the annotations used in the remainder of the file. If MARS is listed, the
Illumina internal annotation algorithm annotated the VCF file. The VCF header includes
the reference genome file and BAM file. The last line in the header contains the column
headings for the data lines.
VCF File Data Lines—Each data line contains information about a single variant.
VCF File Headings
Heading
Description
CHROM
The chromosome of the reference genome. Chromosomes appear in
the same order as the reference FASTA file.
POS
The single-base position of the variant in the reference chromosome.
For SNPs, this position is the reference base with the variant; for indels
or deletions, this position is the reference base immediately before the
variant.
ID
The rs number for the SNP obtained from dbSNP.txt, if applicable.
If there are multiple rs numbers at this location, the list is semicolon
delimited. If no dbSNP entry exists at this position, a missing value
marker ('.') is used.
REF
The reference genotype. For example, a deletion of a single T is
represented as reference TT and alternate T. An A to T single nucleotide
variant is represented as reference A and alternate T.
ALT
The alleles that differ from the reference read.
For example, an insertion of a single T is represented as reference A and
alternate AT. An A to T single nucleotide variant is represented as
reference A and alternate T.
QUAL
A Phred-scaled quality score assigned by the variant caller.
Higher scores indicate higher confidence in the variant and lower
probability of errors. For a quality score of Q, the estimated probability
of an error is 10-(Q/10). For example, the set of Q30 calls has a 0.1% error
rate. Many variant callers assign quality scores based on their statistical
models, which are high in relation to the error rate observed.
Tumor Normal v2.0 App Guide
19
Analysis Output
VCF files use the file naming format SampleName_S#.vcf, where # is the sample number
determined by the order that samples are listed for the run.
VCF File Annotations
20
Heading
Description
FILTER
If all filters are passed, PASS is written in the filter column.
• LowDP—Applied to sites with depth of coverage below a cutoff.
• LowGQ—The genotyping quality (GQ) is below a cutoff.
• LowQual—The variant quality (QUAL) is below a cutoff.
• LowVariantFreq—The variant frequency is less than the given
threshold.
• R8—For an indel, the number of adjacent repeats (1-base or 2-base)
in the reference is greater than 8.
• SB—The strand bias is more than the given threshold. Used with the
Somatic Variant Caller and GATK.
INFO
Possible entries in the INFO column include:
• AC—Allele count in genotypes for each ALT allele, in the same order
as listed.
• AF—Allele Frequency for each ALT allele, in the same order as listed.
• AN—The total number of alleles in called genotypes.
• CD—A flag indicating that the SNP occurs within the coding region
of at least 1 RefGene entry.
• DP—The depth (number of base calls aligned to a position and used
in variant calling).
• Exon—A comma-separated list of exon regions read from RefGene.
• FC—Functional Consequence.
• GI—A comma-separated list of gene IDs read from RefGene.
• QD—Variant Confidence/Quality by Depth.
• TI—A comma-separated list of transcript IDs read from RefGene.
FORMAT
The format column lists fields separated by colons. For example,
GT:GQ. The list of fields provided depends on the variant caller used.
Available fields include:
• AD—Entry of the form X,Y, where X is the number of reference calls,
and Y is the number of alternate calls.
• DP—Approximate read depth; reads with MQ=255 or with bad mates
are filtered.
• GQ—Genotype quality.
• GQX—Genotype quality. GQX is the minimum of the GQ value and
the QUAL column. In general, these values are similar; taking the
minimum makes GQX the more conservative measure of genotype
quality.
• GT—Genotype. 0 corresponds to the reference base, 1 corresponds
to the first entry in the ALT column, and so on. The forward slash (/)
indicates that no phasing information is available.
• NL—Noise level; an estimate of base calling noise at this position.
• PL—Normalized, Phred-scaled likelihoods for genotypes.
• SB—Strand bias at this position. Larger negative values indicate less
bias; values near 0 indicate more bias. Used with the Somatic Variant
Caller and GATK.
• VF—Variant frequency; the percentage of reads supporting the
alternate allele.
SAMPLE
The sample column gives the values specified in the FORMAT column.
Document # 15050950 v01
Genome VCF (gVCF) files are VCF v4.1 files that follow a set of conventions for
representing all sites within the genome in a reasonably compact format. The gVCF files
include all sites within the region of interest in a single file for each sample.
The gVCF file shows no-calls at positions with low coverage, or where a low-frequency
variant (< 3%) occurs often enough (> 1%) that the position cannot be called to the
reference. A genotype (GT) tag of ./. indicates a no-call.
For more information, see sites.google.com/site/gvcftools/home/about-gvcf.
Sample Summary Reports
The Tumor Normal App provides an overview of statistics per sample in the Analysis
Reports sample pages. To download the statistics, click PDF Summary Report.
} Somatic Analysis Summary
} Normal Sample Summary
} Tumor Sample Summary
NOTE
For more information about the summary report, see Molecular Characterization of
Tumors Using Next-Generation Sequencing.
Somatic Analysis Summary
The Tumor Normal App provides a summary of the somatic analysis statistics on the
tumor and normal samples. To download the statistics, click PDF Summary Report.
Sample Information
Table 1 Sample Information
Statistic
Definition
Gigabases Passing
Filter
Number of gigabases passing filter for this sample.
% Bases ≥ Q30
The percentage of bases with a quality score of 30 or higher.
Purity
Estimates the amount of signal for the tumor sample from
normal cells as background.
For more details, see Ploidy and Purity Calculation on page 16.
Ploidy
Estimates the copy number variations for the entire tumor
genome.
For more details, see Ploidy and Purity Calculation on page 16.
Tumor Normal v2.0 App Guide
21
Analysis Output
Genome VCF Files
Variants Summary
Table 2 Somatic Small Variants Summary
Statistic
Definition
Total
The total number of variants present in the data set that pass
the quality filters.
Number in Genes
The number of variants in a gene.
Number in Exons
The number of variants in an exon.
Number in Coding
Regions
The number of variants in a coding region.
Splice Site Region
The number of variants in a splice site region.
Stop Gained
The number of variants that cause an additional stop codon.
Stop Lost
The number of variants that cause the loss of a stop codon.
Frameshift
The number of variants that cause a frameshift.
Non-synonymous
The number of variants that cause an amino acid change in a
coding region.
Synonymous
The number of variants within a coding region but do not
cause an amino acid change.
Mature miRNA
The number of variants in a mature miRNA.
UTR Region
The number of variants in an untranslated region (UTR).
dbSNP
The number of variants present in dbSNP.
Table 3 Somatic Small Variants
22
Statistics
Definition
Chr
Name of reference chromosome.
Pos
The reference position within chromosome.
Depth
The number of reads aligned at this position.
Ref
The reference allele.
Alt
The alternative allele.
Alt Freq
The proportion of the alternative allele among all alleles being
considered.
Type
The type of small variant (SNV, Insertion, Deletion).
Consequence
Predicted transcript consequence.
dsSNP
The numeric identifier developed by the National Center for
Biotechnology Information (NCBI) for the Single Nucleotide
Polymorphism Database (dbSNP).
Document # 15050950 v01
Analysis Output
Statistics
Definition
COSMIC
The numeric identifier for the variant in the Catalogue of
Somatic Mutations in Cancer (COSMIC) database.
ClinVar
A public archive of reports of the relationships among human
variations and phenotypes.
Table 4 Somatic Structural Variants Summary
Variant Class
Notes
CNV
The method to determine copy number variations is described in Copy
Number Aberrations (SENECA) on page 16
Deletions
For more information regarding the criteria used to determine these
structural variants, see Large Indel and Structural Variant Calls on page 13.
Tandem
duplications
Insertions,
inversions
Translocation
breakends
Duplications
(DUP)
Table 5 Somatic Structural Variants
Statistics
Definition
Chr
Name of reference chromosome.
Pos
Position within reference chromosome.
Len
Type
Qual
Length difference between reference allele and alternative
allele.
Structural variant type.
Structural variant quality score.
Table 6 Somatic Copy Number Variants
Statistics
Definition
Chr
Name of reference chromosome.
Pos
The reference position within chromosome.
Len
The estimated length of the copy number variant.
Qual
The quality score of the copy number variant.
Copy Number
The number of copies of the copy number variant.
LOH
Loss of heterozygosity of the copy number variant.
Tumor Normal v2.0 App Guide
23
Table 7 Somatic Translocation Variants
Statistics
Definition
Chr
Name of reference chromosome.
Pos
Position within reference chromosome.
Ref
The reference allele.
The alt allele.
Alt
Qual
Structural variant quality score.
Circos Plot of Somatic Variations
The circos plot provides visualization of somatic small variation, ploidy, and structural
variations reported in the somatic variation files (VCF). The circos plot displays somatic
variation data in tracks with chromosomes circularly arranged. Following is an example
legend. Labels are described from inside the circle to the outside.
Legend
A
24
Label (From Inner
Circle to Outer
Circle)
Somatic structural
variants
Description
The somatic structural variants detailed in somatic.SVs.vcf are
plotted in the center of the plot.
Green links—Segmental duplications (at the center of the circle).
Green boxes—Inversions (the first inner track).
Purple boxes—Deletions (the second track). The width of the
boxes indicates the length of SVs.
Purple bars—Insertion breakpoints (the third track).
Red links—Translocations. The end of the links indicates the 2
breakpoints of SVs.
Document # 15050950 v01
B
Label (From Inner
Circle to Outer
Circle)
E
Number of
somatic indels per
Mb
Number of
somatic SNVs per
Mb
Copy-neutral loss
of heterozygosity
(LOH)
B-allele frequency
F
Called level
G
Karyotype
H
Chromosome
position
Chromosome
number
HGNC symbols
for genes
harboring variants
Genes of nonsynonymous
variants
C
D
I
J
K
Description
The density of PASS somatic indels reported in
somatic.indels.vcf.gz in 1 Mb windows.
The scale of Y-axis in the histogram indicates the counts.
The density of PASS somatic SNVs reported in somatic.snvs.vcf
in 1 Mb windows, arbitrarily scaled in a histogram with Y-axis
pointing inward.
The LOH regions with SNP calls in the normal genome but a
homozygous reference call in the tumor genome, in CNVs.vcf.
The B-allele ratios calculated by SENECA that will be used in the
ploidy and purity estimation.
The copy number aberrations from CNVs.vcf file.
The scale of Y-axis in the histogram indicates the called level.
The standard Circos ideogram defining the chromosome
position, identity, and color of cytogenetic bands.
The reference coordinates along the chromosome (in
megabases)
Chromosome number: 1, 2,…,22, X, Y.
HGNC genes impacted by somatic SNVs.
Genes containing SNVs in the coding region with an HGNC
symbol are labeled.
Genes identified in (J) resulting in non-synonymous changes in
the coding region are highlighted in red.
Depth/B-Allele Plot
The top plot provides an overview of the depth of coverage by chromosomal position.
Aberrant values indicate copy number variations. Copy number ratios are classified as
either gains (red), losses (green), or copy number unchanged (black). Estimated purity
and ploidy values are listed at the top of the plot.
The bottom of the plot displays B-allele ratio by chromosomal position. The B-allele ratio
is the ratio of the 2 alleles A and B. The B-allele ratios that are around 0.5 are filtered out
for diploid regions.
Tumor Normal v2.0 App Guide
25
Analysis Output
Legend
Normal Sample
The Tumor Normal App provides an overview of statistics on the normal sample. To
download the statistics, click PDF Summary Report.
Alignment Summary
Table 8 Alignment Summary
Statistic
Definition
Number of Reads
Total number of reads passing filter for this sample.
Coverage
Total number of aligned bases divided by the genome size.
Percent Duplicate
Paired Reads
Percentage of paired reads that have duplicates.
Fragment Length
Median
Median length of the sequenced fragment. The fragment
length is calculated based on the locations at which a read pair
aligns to the reference. The read mapping information is
parsed from the BAM files.
Fragment Length
Standard Deviation
Standard deviation of the sequenced fragment length.
Table 9 Read Statistics
Statistic
Definition
Percent Aligned
The percentage of reads passing filter that aligned to the
reference genome.
Percent Q30
The percentage of bases with a quality score of 30 or higher.
Mismatch Rate
The average percentage of mismatches across both reads 1
and 2 over all cycles.
Variants Summary
Table 10 Small Variants Summary
Statistic
Total Passing
26
Definition
Total number of variants present in the data set that pass the
quality filters.
Percent Found in
dbSNP
100*(number of variants in dbSNP/number of variants).
Het/Hom Ratio
Number of heterozygous/number of homozygous variants.
Ts/Tv Ratio
Transition rate of SNVs that pass the quality filter/by
transversion rate of SNVs that pass the quality filters.
Document # 15050950 v01
Analysis Output
Table 11 Variants by Sequence Context
Statistic
Definition
Number in Genes
The number of variants in a gene.
Number in Exons
The number of variants in an exon.
Number in Coding
Regions
The number of variants in a coding region.
Number in UTR
Region
The number of variants in an untranslated region (UTR).
Number in Mature
miRNA
The number of variants in a mature miRNA.
Splice Site Region
The number of variants in a splice site region.
Table 12 Variants by Consequence
Statistic
Definition
Frameshift
The number of variants that cause a frameshift.
Non-synonymous
The number of variants that cause an amino acid change in a
coding region.
Synonymous
The number of variants within a coding region but do not
cause an amino acid change.
Stop Gained
The number of variants that cause an additional stop codon.
Stop Lost
The number of variants that cause the loss of a stop codon.
Coverage Histogram
The coverage histogram shows the number of reference bases plotted against the depth of
coverage (read depth). It has the following features:
} The drop-down menu lets you view the overall frequency or a specific chromosome.
} The Fix Y Scale checkbox lets you keep the y-axis scale the same when comparing
multiple chromosomes.
} The Export TSV button lets you export the coverage data in a tab-separated text file.
Tumor Sample
The Tumor Normal App provides an overview of statistics on the tumor sample. To
download the statistics, click PDF Summary Report.
Alignment Summary
Table 13 Alignment Summary
Statistic
Definition
Number of Reads
Total number of reads passing filter for this sample.
Tumor Normal v2.0 App Guide
27
Statistic
Definition
Coverage
Total number of aligned bases divided by the genome size.
Percent Duplicate
Paired Reads
Percentage of paired reads that have duplicates.
Fragment Length
Median
Median length of the sequenced fragment. The fragment
length is calculated based on the locations at which a read pair
aligns to the reference. The read mapping information is
parsed from the BAM files.
Fragment Length
Standard Deviation
Standard deviation of the sequenced fragment length.
Table 14 Read Statistics
Statistic
Definition
Percent Aligned
The percentage of reads passing filter that aligned to the
reference genome.
Percent Q30
The percentage of bases with a quality score of 30 or higher.
Mismatch Rate
The average percentage of mismatches across both reads 1
and 2 over all cycles.
Coverage Histogram
The coverage histogram shows the number of reference bases plotted against the depth of
coverage (read depth). It has the following features:
} The drop-down menu lets you view the overall frequency or a specific chromosome.
} The Fix Y Scale checkbox lets you keep the y-axis scale the same when comparing
multiple chromosomes.
} The Export TSV button lets you export the coverage data in a tab-separated text file.
28
Document # 15050950 v01
Revision History
Revision History
Document
Document #
15050950 v01
Tumor Normal v2.0 App Guide
Date
January
2016
Description of Change
Supports Tumor Normal v2.0.
29
Notes
For technical assistance, contact Illumina Technical Support.
Table 15 Illumina General Contact Information
Website
Email
www.illumina.com
[email protected]
Table 16 Illumina Customer Support Telephone Numbers
Region
Contact Number
Region
North America
1.800.809.4566
Japan
Australia
1.800.775.688
Netherlands
Austria
0800.296575
New Zealand
Belgium
0800.81102
Norway
China
400.635.9898
Singapore
Denmark
80882346
Spain
Finland
0800.918363
Sweden
France
0800.911850
Switzerland
Germany
0800.180.8994
Taiwan
Hong Kong
800960230
United Kingdom
Ireland
1.800.812949
Other countries
Italy
800.874909
Contact Number
0800.111.5011
0800.0223859
0800.451.650
800.16836
1.800.579.2745
900.812168
020790181
0800.563118
00806651752
0800.917.0041
+44.1799.534000
Safety data sheets (SDSs)—Available on the Illumina website at
support.illumina.com/sds.html.
Product documentation—Available for download in PDF from the Illumina website. Go
to support.illumina.com, select a product, then select Documentation & Literature.
Tumor Normal v2.0 App Guide
Technical Assistance
Technical Assistance
Illumina
5200 Illumina Way
San Diego, California 92122 U.S.A.
+1.800.809.ILMN (4566)
+1.858.202.4566 (outside North America)
[email protected]
www.illumina.com