* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download RNA-Seq Alignment v1.0 App Guide
History of genetic engineering wikipedia , lookup
Gene nomenclature wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Copy-number variation wikipedia , lookup
Gene desert wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Human genome wikipedia , lookup
Minimal genome wikipedia , lookup
Genomic library wikipedia , lookup
Public health genomics wikipedia , lookup
Genome (book) wikipedia , lookup
Primary transcript wikipedia , lookup
Gene expression programming wikipedia , lookup
Microevolution wikipedia , lookup
Helitron (biology) wikipedia , lookup
Genome editing wikipedia , lookup
Pathogenomics wikipedia , lookup
Gene expression profiling wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Designer baby wikipedia , lookup
Smith–Waterman algorithm wikipedia , lookup
Sequence alignment wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Genome evolution wikipedia , lookup
Multiple sequence alignment wikipedia , lookup
RNA-Seq Alignment v1.0 BaseSpace App Guide For Research Use Only. Not for use in diagnostic procedures. Introduction Workflow Workflow Diagram Set Analysis Parameters Analysis Methods Analysis Output Technical Assistance ILLUMINA PROPRIETARY 1000000006111 v00 February 2016 3 5 6 7 11 15 This document and its contents are proprietary to Illumina, Inc. and its affiliates ("Illumina"), and are intended solely for the contractual use of its customer in connection with the use of the product(s) described herein and for no other purpose. This document and its contents shall not be used or distributed for any other purpose and/or otherwise communicated, disclosed, or reproduced in any way whatsoever without the prior written consent of Illumina. Illumina does not convey any license under its patent, trademark, copyright, or common-law rights nor similar rights of any third parties by this document. The instructions in this document must be strictly and explicitly followed by qualified and properly trained personnel in order to ensure the proper and safe use of the product(s) described herein. All of the contents of this document must be fully read and understood prior to using such product(s). FAILURE TO COMPLETELY READ AND EXPLICITLY FOLLOW ALL OF THE INSTRUCTIONS CONTAINED HEREIN MAY RESULT IN DAMAGE TO THE PRODUCT(S), INJURY TO PERSONS, INCLUDING TO USERS OR OTHERS, AND DAMAGE TO OTHER PROPERTY. ILLUMINA DOES NOT ASSUME ANY LIABILITY ARISING OUT OF THE IMPROPER USE OF THE PRODUCT(S) DESCRIBED HEREIN (INCLUDING PARTS THEREOF OR SOFTWARE). © 2016 Illumina, Inc. All rights reserved. Illumina, 24sure, BaseSpace, BeadArray, BlueFish, BlueFuse, BlueGnome, cBot, CSPro, CytoChip, DesignStudio, Epicentre, ForenSeq, Genetic Energy, GenomeStudio, GoldenGate, HiScan, HiSeq, HiSeq X, Infinium, iScan, iSelect, MiniSeq, MiSeq, MiSeqDx, MiSeq FGx, NeoPrep, NextBio, Nextera, NextSeq, Powered by Illumina, SureMDA, TruGenome, TruSeq, TruSight, Understand Your Genome, UYG, VeraCode, verifi, VeriSeq, the pumpkin orange color, and the streaming bases design are trademarks of Illumina, Inc. and/or its affiliate(s) in the U.S. and/or other countries. All other names, logos, and other trademarks are the property of their respective owners. The BaseSpace® RNA-Seq Alignment v1.0 App supports RNA sequencing read alignments, variants calling, gene fusions detection, and novel script assembly with Cufflinks. The app can run STAR and tools from the Tuxedo Suite (Bowtie, TopHat, Cufflinks) to produce aligned reads, variant calls, and FPKM abundance estimates of reference genes and transcripts. Also, the app can perform fusion calling using Mantafusion or TopHat-fusion tools. The Cufflinks Assembly & DE v2.0 App uses the analysis results from the RNA-Seq Alignment App to perform novel transcript merging and differential expression. Compatible Libraries See the BaseSpace support page for a list of library types that are compatible with the RNA-Seq Alignment App. Workflow Requirements } } } } Supports 100,000–400,000,000 reads per sample. Supports 40 billion reads across all samples in a single analysis. Supports 35–500 bp read lengths. Requires paired-end reads for fusion detection. Versions The following components are used in the RNA-Seq Alignment App. Software Version Isis (Analysis Software) 2.6.25.11 TopHat (Aligner) 2.1.0 STAR (Aligner) 2.5.0a Bowtie (Aligner) 0.12.9 Bowtie2 (Aligner) 2.2.6 Isaac Variant Caller 2.3.13-31-g3c98c29 IONA (Annotation Service) 1.0.10.37 BEDTools 2.17.0 Cufflinks 2.2.1 BLAST 2.2.26+ Reference Genomes The following reference genomes are available for alignment: } Homo sapiens UCSC hg19 (RefSeq & Gencode gene annotations) } Homo sapiens UCSC hg38 (RefSeq & Gencode gene annotations) RNA-Seq Alignment v1.0 App Guide 3 Introduction Introduction } } } } } } } } } } } } } 4 The human reference genome is PAR-Masked, which means that the Y chromosome sequence has the Pseudo Autosomal Regions (PAR) masked (set to N) to avoid mismapping of reads in the duplicate regions of sex chromosomes. Arabidopsis thaliana Ensembl TAIR10 (Ensembl gene annotation) Bos taurus UCSC bosTau6 (RefSeq gene annotation) Caenorhabditis elegans UCSC ce10 (RefSeq gene annotation) Danio rerio UCSC danRer7 (RefSeq gene annotation) Drosophila melanogaster UCSC dm3 (RefSeq gene annotation) Gallus gallus UCSC galGal4 (RefSeq gene annotation) Mus musculus UCSC mm9 (RefSeq gene annotation) Mus musculus UCSC mm10 (RefSeq gene annotation) Oryza sativa japonica Ensembl IRGSP-1.0 (Ensembl gene annotation) Rattus norvegicus UCSC rn5 (RefSeq gene annotation) Saccharomyces cerevisiae Ensembl R64-1-1 (Ensembl gene annotation) Sus scrofa UCSC susScr3 (RefSeq gene annotation) Zea mays Ensembl AGPv3 (Ensembl gene annotation) 1000000006111 v00 } } } } } } } Filtering—Bowtie filters the input reads against abundant sequences, such as mitochondrial or ribosomal sequences, as defined by iGenomes at support.illumina.com/sequencing/sequencing_software/igenome.html. } Only sequences that do not align against abundant sequences are passed through to the next phase of the analysis. Bowtie filters read pairs when at least 1 read aligns to an abundant sequence. Also, Bowtie trims off 2 bases from the 5’ end of the read because of a high mismatch rate from these 2 bases in the RNA-Seq libraries. See Bowtie on page 11. Alignment—The STAR or TopHat2 aligner performs a spliced alignment of the filtered reads against the genome. Based on the user-specified genome, STAR or TopHat aligns reads against known transcripts and splice junctions. See STAR on page 11. Alignment to ERCC—If selected, STAR aligns all reads to the ERCC RNA spike-in sequences, independent of alignment to the transcriptome. The aligner counts reads that align to each spike-in sequence, calculates FPKMs, and computes the correlation between FPKMs and the expected spike-in concentrations. Fusion Calling—If selected, the STAR aligner supports Manta-fusion and the TopHat aligner supports TopHat-fusion. First, TopHat2 is used to detect fused alignments. Then, a post-alignment analysis script identifies candidate fusion genes from the fused alignments. See Manta on page 11. Variant Calling—The Isaac Variant Caller performs variant calling, which produces gVCF output files. For stranded library preps, the strand bias filter is disabled. } Also, the Isaac variant caller uses the -bsnp-diploid-het-bias parameter to expand the range for the heterozygous variant call, in order to account for allele-specific expression. } The Isaac tool uses a RNA-specific random-forest-based variant scoring model, which was built using Platinum Genomes data as a reference. See Isaac Variant Caller on page 12. Quantification—Cufflinks quantifies reference genes and transcripts. RnaReadCounter counts the number of aligned reads matching each annotated gene. See Cufflinks on page 12 and RnaReadCounter on page 13. Novel Transcript Assembly—If selected, transcripts are assembled and quantified independently for each sample. RNA-Seq Alignment v1.0 App Guide 5 Workflow Workflow Workflow Diagram Figure 1 RNA-Seq Alignment Workflow 6 1000000006111 v00 1 Navigate to BaseSpace, and then click the Apps tab. 2 In Categories, click RNA-Seq, and then click RNA-Seq Alignment. 3 From the drop-down list, select version 1.0.0, and then click Launch to open the app. 4 In the App Session Name field, enter the analysis name. By default, the analysis name includes the app name, followed by the date and time that the analysis session starts. 5 From the Save Results To field, select the project that stores the app results. 6 From the Samples field, browse to the sample you want to analyze and select the checkbox. You can analyze multiple samples. Select the checkbox to identify samples prepared with the TruSeq Stranded Total RNA and TruSeq Stranded mRNA library prep kits for the first strand and the ScriptSeq v2.0 RNA-Seq library prep kit for the second strand. 7 From the Reference Genome field, select the reference genome to be used for alignment. The default is Homo sapiens (PAR-masked)/hg19 (RefSeq). 8 From the Panel field, select from the following: } None (default) } TruSight RNA Pan-Cancer The TruSight RNA Pan-Cancer library prep kit only supports the Human, UCSC hg19 (RefSeq & Gencode) reference genome. 9 From the Aligner field, select from the following methods: } STAR (default) } TopHat (Bowtie) } TopHat (Bowtie2) 10 [Optional] Select the QC Mode checkbox to analyze only the subset of a read pairs for each sample. If selected, enter the number of read pairs for each sample. 11 [Optional] Select the Novel Transcript Assembly checkbox for Cufflinks to assemble novel transcripts. If selected, select the Adjust Transcript Assembly for Samples Without PolyA Selection checkbox if the samples are prepared without PolyA selection (TruSeq Total RNA kit). NOTE By default, the Call Fusions checkbox is checked if you selected a panel and an aligner that supports fusion calling. Paired-end reads are required. The TopHat (Bowtie) aligner supports TopHat-fusion. The STAR aligner supports Manta-fusion. The TopHat (Bowtie2) aligner does not support fusion calling. For best results, use the STAR aligner with the TrusSight RNA Pan-Cancer panel. 12 From the ERCC Spike-In Controls field, select from the following options: } None (default) } Mix 1 } Mix 2 13 [Optional] Select the Trim TruSeq Adapters checkbox. This option trims adapter sequences from the FASTQ file. Use this option if adapter trimming was not performed in the demultiplexing. RNA-Seq Alignment v1.0 App Guide 7 Set Analysis Parameters Set Analysis Parameters 14 [Optional] Select the Set Advanced Options checkbox to enable the advanced options and then specify the values for the appropriate options. 15 Click Continue. The RNA-Seq Alignment App begins analysis. When analysis is complete, the app updates the status of the session and sends an email to notify you. Advanced Options [Optional] Specify the values for the advanced options. Table 1 TopHat Options Table 8 Option Description Read Mismatches Enter a number between 0 and 5. The default is 2. Alignments with more than the number of mismatches are discarded. Read Gap Length Enter a number between 0 and 5. The default is 2. Alignments with more than the total length of gaps are discarded. Read Edit Distance Enter a number between 0 and 10. The default is 2. Alignments with more than the selected edit distance are discarded. Mate Inner Distance Enter a number between 0 and 300. The default is 50. The expected (mean) inner distance between mate pairs. For paired-end runs with fragments selected at 300 bp, where each end is 50 bp, set this value at 200. Mate Standard Deviation Enter a number between 1 and 100. The default is 20. The standard deviation of the distribution on inner distances between mate pairs. Minimum Intron Length Enter a number between 10 and the maximum intron length. The default is 70. TopHat ignores donor/acceptor pairs closer than the specified number of bases. Maximum Intron Length Enter a number between the minimum intron length and 1,000,000. The default is 500,000. When searching for junctions, TopHat ignores donor/acceptor pairs farther than the specified number of bases, except when a pair is supported by a split segment alignment of a long read. Maximum Insertion Length Enter a number between 0 and 5. The default is 3. Maximum Deletion Length Enter a number between 0 and 5. The default is 3. 1000000006111 v00 Set Analysis Parameters Table 2 STAR Options Table Option Description Score Difference to Filter Multimapping Alignments Enter a number between 1 and 5. The default is 1. When a read aligns to multiple loci, the alignment is reported if its score is in the range of (s - value, s], where "s" is the highest alignment score and "value" is the number that you entered. Maximum Mismatches Enter a number between 1 and 21. The default is 10. The output includes alignments that have fewer mismatches than this value. Maximum Mismatches Over Read Length Enter a number between 0 and 0.5. The default is 0.3. The output includes alignments that have a ratio of mismatches to mapped length that is less than this value. Minimum Score Over Read Length Enter a number between 0.33 and 1. The default is 0.66. The output includes alignments that have a score higher than this value. The score is normalized by read length, which is the length sum of mates for paired-end reads. Minimum Matches Over Read Length Enter a number between 0.33 and 1. The default is 0.66. The output includes alignments that have matched bases higher than this value. The number is normalized by reads length, which is the length sum of mates for paired-end reads. Maximum Seed Search Step: Enter a number between 30 and 1000. The default is 50. The seed search starts at position 1 and the step length determines the next start position. The step length cannot be longer than this value. Minimum Intron Length Enter a number between 10 and the maximum intron size. The default is 21. The genomic gap is considered intron when its length is greater or equal to this value. Otherwise, the gap is considered a deletion. Maximum Intron Length Enter 0 or a number between the minimum intron size and 1,000,000. The default is 0. If the value is set to 0, STAR calculates the maximum intron size. Use Annotation Select the checkbox to use splice junction information in the annotation. By default, the checkbox is selected. Two-Pass Mode Select the STAR 2-pass alignment. The default is Basic. Table 3 Cuffnorm Options Table Field Hits Normalization Description Compatible—Cuffnorm counts only those fragments compatible with some reference transcript towards the number of mapped fragments used in the FPKM denominator. This option is the default. Total—Cuffnorm counts all fragments, including those not compatible with any reference transcript, towards the number of mapped fragments used in the FPKM denominator. RNA-Seq Alignment v1.0 App Guide 9 Table 4 Cufflinks Options Table Option Description Hits Normalization Compatible—Cufflinks counts only those fragments compatible with some reference transcript towards the number of mapped fragments used in the FPKM denominator. Total—Cufflinks counts all fragments, including the fragments not compatible with any reference transcript, towards the number of mapped fragments used in the FPKM denominator. This option is the default. Minimum Isoform Fraction Enter a number between 0.05 and 1. The default is 0.1. After calculating isoform abundance for a gene, Cufflinks filters out transcripts that are low abundance. Isoforms that are expressed at low levels often cannot be reliably assembled. The isoforms can be artifacts of incompletely spliced precursors of processed transcripts. This parameter filters out introns that have fewer supporting sliced alignments. Pre-mRNA Fraction Enter a number between 0 and 1. The default is 0.15. Some RNA-seq protocols produce a significant number of reads that come from incompletely spliced transcripts. These reads can confound the assembly of fully spliced mRNAs. Cufflinks uses this parameter to filter out alignments that are within the intronic intervals. The minimum depth of coverage in the intronic region covered by the alignment is divided by the number of spliced reads. If the result is lower than the value in this parameter, the intronic alignments are ignored. Minimum Intron Length Enter a number between 10 and the maximum intron length. The default is 50. Maximum Intron Length Enter a number between the minimum intron length and 600,000. The default is 300,000. When the intron length is longer than this value, Cufflinks does not report transcripts with introns and excludes SAM alignments with REF_SKIP CIGAR operations. Minimum Fragments per Transfrag Enter a number between 5 and 100. The default is 10. Assembled transcript fragments supported by fewer than this value of aligned RNA-seq fragments are not reported. Table 5 Cuffquant/Cufflinks Options Table 10 Option Description Fragment Bias Correction Cuffquant/Cufflinks runs bias detection and correction algorithm, which can improve accuracy of transcript abundance estimates. Multi-read Correction Cuffquant/Cufflinks runs an initial estimation procedure to weight reads mapping to multiple locations in the genome more accurately. No Effective Length Correction Cuffquant/Cufflinks disables effective length normalization to transcript FPKM. 1000000006111 v00 The RNA-Seq Alignment App uses the following methods to analyze the sequencing data. } STAR } Manta } Tuxedo Suite, which includes Bowtie, Bowtie2, TopHat, and Cufflinks } Isaac Variant Caller STAR Spliced Transcripts Alignment to a Reference (STAR) is a fast RNA-seq read mapper, with support for splice-junction and fusion read detection. STAR aligns reads by finding the Maximal Mappable Prefix (MMP) hits between reads (or read pairs) and the genome, using a Suffix Array index. Different parts of a read can be mapped to different genomic positions, corresponding to splicing or RNA-fusions. The genome index includes known splice-junctions from annotated gene models, allowing for sensitive detection of spliced reads. STAR performs local alignment, automatically soft clipping ends of reads with high mismatches. We recommend using STAR because it can quickly align more reads than other aligner methods. For more information, see https://github.com/alexdobin/STAR. Manta Manta calls structural variants (SVs) from mapped paired-end sequencing reads. Manta discovers candidate SVs from discordant pair and split-read alignments, followed by local assembly and realignment to refine candidates. The app uses Manta on RNA-seq data to detect gene fusions in combination with STAR, which appear like translocations in the RNA alignments. The Manta workflow is followed by RNA-specific filtering and scoring, which is based on the following: } Read counts across the fusion and alignment qualities. } Genome-wide realignment of fusion contigs to filter candidates that can be explained by a local alignment elsewhere in the genome. } Length of coverage around the breakpoints, indicating presence of stable fusion transcripts. For more information, see https://github.com/Illumina/manta. Tuxedo Suite The Tuxedo suite offers a set of tools for analyzing a variety of RNA-Seq data, including short-read mapping, identification of splice junctions, transcript and isoform detection, differential expression, visualizations, and quality control metrics. Bowtie Bowtie is an ultrafast, memory-efficient aligner designed to quickly align large sets of short reads to large genomes. Bowtie indexes the genome to keep its memory footprint small: for the human genome, the index is typically about 2.2 GB for single-read alignment or 2.9 GB for paired-end alignment. Multiple processors can be used simultaneously to achieve greater alignment speed. RNA-Seq Alignment v1.0 App Guide 11 Analysis Methods Analysis Methods Bowtie forms the basis for other tools like TopHat, a fast splice junction mapper for RNA-seq reads, and Cufflinks, a tool for transcriptome assembly and isoform quantitation from RNA-seq reads. For more information, see http://bowtie-bio.sourceforge.net/index.shtml. Bowtie 2 Bowtie can quickly align large sets of short DNA sequences to large genomes. You can use Bowtie 2 to align reads of about 50 to 100s or 1,000s of characters. For human genome, the memory footprint is approximately 3.2 GB. Bowtie 2 forms the basis for other tools like Tophat, a fast splice junction mapper for RNA-seq reads, and Cufflinks, a tool for transcriptome assembly and isoform quantitation from RNA-seq reads. For more information, see http://bowtie-bio.sourceforge.net/bowtie2/index.shtml. TopHat TopHat is a fast splice junction mapper for RNA-seq reads that can only be used with Bowtie or Bowtie2. TopHat uses Bowtie or Bowtie2 to map RNA-seq reads, and then it analyzes the mapping results to identify splice junctions between exons. For more information, see http://ccb.jhu.edu/software/tophat/index.shtml. Cufflinks Cufflinks assembles aligned RNA-Seq reads into transcripts, estimates their abundances, and test for differential expression and regulation of transcriptome. For more information, see cole-trapnell-lab.github.io/cufflinks/. Isaac Variant Caller The Isaac variant caller identifies single nucleotide variants (SNVs) and small indels using the following steps: } Read filtering—Filters reads failing quality checks. } Indel calling—Identifies a set of possible indel candidates and realigns all reads overlapping the candidates using a multiple sequence aligner. } SNV calling—Computes the probability of each possible genotype given the aligned read data and a prior distribution of variation in the genome. } Indel genotypes—Calls indel genotypes and assigns probabilities. } Variant call output—Generates output in a VCF and a compressed genome variant call (gVCF) file. For more information, see https://github.com/sequencing/isaac_variant_caller Read Filtering Input reads are filtered under the following conditions: } Reads that failed base calling quality checks. } Reads marked as PCR duplicates. } Paired-end reads not marked as a proper pair. } Reads with a mapping quality < 20. 12 1000000006111 v00 The variant caller proceeds with candidate indel discovery and generates alternate read alignments based on the candidate indels. As part of the realignment process, the variant caller selects a representative alignment to be used for site genotype calling and depth summarization by the SNV caller. SNV Calling The variant caller runs a series of filters on the set of filtered and realigned reads for SNV calling without affecting indel calls. First, any contiguous trailing sequence of N base calls is trimmed from the ends of reads. Using a mismatch density filter, reads having an unexpectedly high number of disagreements with the reference are masked, as follows: } The variant caller treats each insertion or deletion as a single mismatch. } Base calls with more than 2 mismatches to the reference sequence within 20 bases of the call are ignored. } If the call occurs within the first or last 20 bases of a read, the mismatch limit is applied to a 41-base window at the corresponding end of the read. } The mismatch limit is applied to the entire read when the read length is 41 or shorter. Indel Genotypes The variant caller filters all bases marked by the mismatch density filter and any N base calls that remain after the end-trimming step. These filtered base calls are not used for site-genotyping but appear in the filtered base call counts in the variant caller output for each site. All remaining base calls are used for site-genotyping. The genotyping method heuristically adjusts the joint error probability that is calculated from multiple observations of the same allele on each strand of the genome. This correction accounts for the possibility of error dependencies. This method treats the highest-quality base call from each allele and strand as an independent observation and leaves the associated base call quality scores unmodified. Quality scores for subsequent base calls for each allele and strand are then adjusted. This adjustment increases the joint error probability of the given allele above the error expected from independent base call observations. Variant Call Output After the SNV and indel genotyping methods are complete, the variant caller applies a final set of heuristic filters to produce the final set of calls in the output. The output in the genome variant call (gVCF) file captures the genotype at each position and the probability that the consensus call differs from reference. This score is expressed as a Phred-scaled quality score. RnaReadCounter The RnaReadCounter, an internal tool, counts the number of aligned reads in an RNASeq sample that match each annotated gene. The RnaReadCounter method is similar to "htseq-count" in “union-mode”, using a “chromsweep” algorithm. RNA-Seq Alignment v1.0 App Guide 13 Analysis Methods Indel Calling Read counts are based on the overlapping of both reads in a pair with exons of a single gene and do not consider individual transcripts separately. Reads are not counted if they map to more than 1 genomic position or to a position with overlapping exons from more than one gene. 14 1000000006111 v00 Analysis Output Analysis Output 1 Navigate to the BaseSpace site. 2 To view the results, click the Projects tab, then the project name, and then the analysis. Figure 2 RNA-Seq Alignment Output Navigation Bar Use the left navigation bar to access the following analysis output: } Analysis Info—Information about the analysis session, including log files. } Inputs—Overview of input settings. } Output Files—Output files for the samples. } Analysis Reports } Summary—Analysis metrics for the aggregate results. } Sample Analysis—Analysis reports for each sample. Analysis Info The Analysis Info page displays the analysis settings and execution details. Row Heading Definition Name Name of the analysis session. Application App that generated this analysis. Date Started Date and time the analysis session started. Date Completed Date and time the analysis session completed. Duration Duration of the analysis. Session Type Multi-Node or Single-Node Status Status of the analysis session. The status shows either Running or Complete and the number of nodes used. RNA-Seq Alignment v1.0 App Guide 15 Log Files File Name Description CompletedJobInfo.xml Contains information about the completed analysis session. Logging.zip Contains all detailed log files for each step of the workflow. SampleSheet.csv Sample sheet. SampleSheetUsed.csv A copy of the sample sheet. WorkflowError.txt Contains error messages created when running the workflow. Output Files The Output Files page provides access to the output files for each sample analysis. } } } } } } } BAM Files VCF Files gVCF Files FPKM Files Coverage.bedGraph.gz Files Coverage.bw Files Junctions.bed Files BAM File Format A BAM file (*.bam) is the compressed binary version of a SAM file that is used to represent aligned sequences up to 128 Mb. SAM and BAM formats are described in detail at https://samtools.github.io/hts-specs/SAMv1.pdf. BAM files use the file naming format of SampleName_S#.bam, where # is the sample number determined by the order that samples are listed for the run. In multi-node mode, the S# is set to S1, regardless the order of the sample. BAM files contain a header section and an alignment section: } Header—Contains information about the entire file, such as sample name, sample length, and alignment method. Alignments in the alignments section are associated with specific information in the header section. } Alignments—Contains read name, read sequence, read quality, alignment information, and custom tags. The read name includes the chromosome, start coordinate, alignment quality, and the match descriptor string. The alignments section includes the following information for each or read pair: } RG: Read group, which indicates the number of reads for a specific sample. } BC: Barcode tag, which indicates the demultiplexed sample ID associated with the read. } SM: Single-end alignment quality. } AS: Paired-end alignment quality. } NM: Edit distance tag, which records the Levenshtein distance between the read and the reference. 16 1000000006111 v00 XN: Amplicon name tag, which records the amplicon tile ID associated with the read. BAM index files (*.bam.bai) provide an index of the corresponding BAM file. VCF File Format Variant Call Format (VCF) is a widely used file format developed by the genomics scientific community that contains information about variants found at specific positions in a reference genome. VCF files use the file naming format SampleName_S#.vcf, where # is the sample number determined by the order that samples are listed for the run. VCF File Header—Includes the VCF file format version and the variant caller version. The header lists the annotations used in the remainder of the file. If MARS is listed, the Illumina internal annotation algorithm annotated the VCF file. The VCF header includes the reference genome file and BAM file. The last line in the header contains the column headings for the data lines. VCF File Data Lines—Each data line contains information about a single variant. VCF File Headings Heading Description CHROM The chromosome of the reference genome. Chromosomes appear in the same order as the reference FASTA file. POS The single-base position of the variant in the reference chromosome. For SNPs, this position is the reference base with the variant; for indels or deletions, this position is the reference base immediately before the variant. ID The rs number for the SNP obtained from dbSNP.txt, if applicable. If there are multiple rs numbers at this location, the list is semicolon delimited. If no dbSNP entry exists at this position, a missing value marker ('.') is used. REF The reference genotype. For example, a deletion of a single T is represented as reference TT and alternate T. An A to T single nucleotide variant is represented as reference A and alternate T. ALT The alleles that differ from the reference read. For example, an insertion of a single T is represented as reference A and alternate AT. An A to T single nucleotide variant is represented as reference A and alternate T. QUAL A Phred-scaled quality score assigned by the variant caller. Higher scores indicate higher confidence in the variant and lower probability of errors. For a quality score of Q, the estimated probability of an error is 10-(Q/10). For example, the set of Q30 calls has a 0.1% error rate. Many variant callers assign quality scores based on their statistical models, which are high in relation to the error rate observed. RNA-Seq Alignment v1.0 App Guide 17 Analysis Output } VCF File Annotations 18 Heading Description FILTER If all filters are passed, PASS is written in the filter column. • LowDP—Applied to sites with depth of coverage below a cutoff. • LowGQ—The genotyping quality (GQ) is below a cutoff. • LowQual—The variant quality (QUAL) is below a cutoff. • LowVariantFreq—The variant frequency is less than the given threshold. • R8—For an indel, the number of adjacent repeats (1-base or 2-base) in the reference is greater than 8. • SB—The strand bias is more than the given threshold. Used with the Somatic Variant Caller and GATK. INFO Possible entries in the INFO column include: • AC—Allele count in genotypes for each ALT allele, in the same order as listed. • AF—Allele Frequency for each ALT allele, in the same order as listed. • AN—The total number of alleles in called genotypes. • CD—A flag indicating that the SNP occurs within the coding region of at least 1 RefGene entry. • DP—The depth (number of base calls aligned to a position and used in variant calling). • Exon—A comma-separated list of exon regions read from RefGene. • FC—Functional Consequence. • GI—A comma-separated list of gene IDs read from RefGene. • QD—Variant Confidence/Quality by Depth. • TI—A comma-separated list of transcript IDs read from RefGene. FORMAT The format column lists fields separated by colons. For example, GT:GQ. The list of fields provided depends on the variant caller used. Available fields include: • AD—Entry of the form X,Y, where X is the number of reference calls, and Y is the number of alternate calls. • DP—Approximate read depth; reads with MQ=255 or with bad mates are filtered. • GQ—Genotype quality. • GQX—Genotype quality. GQX is the minimum of the GQ value and the QUAL column. In general, these values are similar; taking the minimum makes GQX the more conservative measure of genotype quality. • GT—Genotype. 0 corresponds to the reference base, 1 corresponds to the first entry in the ALT column, and so on. The forward slash (/) indicates that no phasing information is available. • NL—Noise level; an estimate of base calling noise at this position. • PL—Normalized, Phred-scaled likelihoods for genotypes. • SB—Strand bias at this position. Larger negative values indicate less bias; values near 0 indicate more bias. Used with the Somatic Variant Caller and GATK. • VF—Variant frequency; the percentage of reads supporting the alternate allele. SAMPLE The sample column gives the values specified in the FORMAT column. 1000000006111 v00 Genome VCF (gVCF) files are VCF v4.1 files that follow a set of conventions for representing all sites within the genome in a reasonably compact format. The gVCF files include all sites within the region of interest in a single file for each sample. The gVCF file shows no-calls at positions with low coverage, or where a low-frequency variant (< 3%) occurs often enough (> 1%) that the position cannot be called to the reference. A genotype (GT) tag of ./. indicates a no-call. For more information, see sites.google.com/site/gvcftools/home/about-gvcf. FPKM Files Fragments Per Kilobase of sequence per Million mapped reads (FPKM) normalizes the number of aligned reads by the size of the sequence feature and the total number of mapped reads. In each output directory, this app creates the following output files: } genes.fpkm_tracking—Quantifies the expression of genes specified in the GTF annotation file. } isoforms.fpkm_tracking—Quantifies the expression of transcripts specified in the GTF annotation file. Analysis Reports The RNA-Seq Alignment App provides an aggregate summary for all the samples and a summary of statistics per sample. Summary Analysis Report The RNA-Seq Alignment App provides an aggregate summary for all the samples. Table 6 Summary Table Statistic Definition Sample ID The sample ID. Read Length The length of reads. Number of Reads The total number of reads passing filter for this sample. % Total Aligned Percentage of reads passing filter that aligned to the reference, including abundant reads. % Abundant Percentage of reads that align to abundant transcripts, such as mitochondrial and ribosomal sequences. % Unaligned Percentage of reads that do not align to the reference. Median CV Coverage Uniformity The median coefficient of variation of coverage of the 1000 most highly expressed transcript, as reported by the CollectRnaSeqMetrics utility from Picard tools. Ideal value = 0. % Stranded Percentage of reads that are stranded. Summary Plots } To save plot as scalable vector graphics (SVG), click Save Plot as SVG. RNA-Seq Alignment v1.0 App Guide 19 Analysis Output Genome VCF Files } To export data from plot as comma-separated values (CSV), click Export Data as CSV. Table 7 Summary Plots Table Plot Name Description Insert Length Distribution The plot summarizes the insert length distribution for pairedend reads. The 3 vertical lines of the box represent the quartiles and the whiskers represent the 5th and 95th percentiles. The insert length for this box plot is capped at 600 bp. Alignment Distribution If a panel is selected, the app calculates the alignment information based on the on-target genes. The plot shows the percentage for the color-coded genomic regions, which include coding, UTR, intron, and non-targeted or intergenic. The app reports "non-targeted" if a panel is selected. Transcript Coverage If a panel is selected, the app calculates the alignment information based on the on-target genes. The plot shows the transcript coverage position as reported by the CollectRnaSeqMetrics utility from Picard tools. A vertical bar shows the relative coverage at the position in each row. The numbers between 0 and 100 represents the normalized position along a transcript. ERCC Spike-Ins The ERCC spike-ins analysis summary is available when an ERCC mix is selected. Table 8 ERCC Spike-Ins Table Statistic Definition Total Spike-In Reads (% Reads) Percentage of reads passing filter that aligned to the ERCC sequences. Pearson Correlation Pearson correlation of log RNA FPKM and log spike-in molar concentration. Spearman Correlation Spearman correlation of log RNA FPKM and log spike-in molar concentration. Sample Analysis Reports The RNA-Seq Alignment App provides an overview of statistics per sample in the Analysis Reports section. To download the statistics, click PDF Summary Report. Primary Analysis Information Table 9 Primary Analysis Information Table 20 Statistic Definition Read Length Length of reads. 1000000006111 v00 Definition Number of Reads Total number of reads passing filter for this sample. Bases (GB) The total number of bases for this sample. Q30 Bases (GB) The total number of bases with a quality score of 30 or higher. Analysis Output Statistic Insert Information Table 10 Insert Information Table Statistics Definition Insert Length Median Median length of a sequenced fragment. The fragment length is calculated based on the locations at which a read pair aligns to the reference. The read mapping information is parsed from the BAM files. Insert Length S.D. Standard deviation of the sequenced fragment length. Duplicates (% Reads) Percentage of paired reads that have duplicates from a subsampled set of 4 million reads or from the total number of reads when there are less than 4 million reads. Alignment Summary Table 11 Alignment Quality Table Statistic Definition Total Aligned Reads (% Reads) Percentage of reads passing filter that aligned to the reference. Abundant Reads (% Reads) Percentage of reads that aligns to abundant transcripts, such as mitochondrial and ribosomal sequences. Unaligned Reads (% Reads) Percentage of reads passing filter that are not aligned to the reference. Reads with spliced alignment (% Aligned Reads) Percentage of aligned reads with a spliced alignment. Reads aligned at multiple loci (% Aligned Reads) Percentage of aligned reads to multiple loci. If a panel is selected, the app calculates the alignment information based on the on-target genes. Table 12 Alignment Information Table Statistics Definition Coding Metrics based on coding bases. RNA-Seq Alignment v1.0 App Guide 21 Statistics Definition UTR Metrics based on bases in untranslated regions (UTR). Intron Metrics based on bases in introns. Intergenic or Nontargeted Metrics based on bases in intergenic or non-targeted regions. The app reports "non-targeted" if a panel is selected. Coverage Summary If a panel is selected, the app calculates the coverage information based on the on-target genes. Table 13 Coverage Uniformity Information Table Statistic Definition Median CV The median coefficient of variation of coverage of the 1000 most highly expressed transcripts, as reported by the CollectRnaSeqMetrics utility from Picard tools. Ideal value = 0. Median 3' The median uniformity of coverage of the 1000 most highly expressed transcripts at the 3' end, as reported by the CollectRnaSeqMetrics utility from Picard tools. 3' bias is calculated per transcript as the mean coverage of the 3' most 100 bases divided by the mean coverage of the whole transcript. Median 5' The median uniformity of coverage of the 1000 most highly expressed transcripts at the 5' end, as reported by the CollectRnaSeqMetrics utility from Picard tools. 5' bias is calculated per transcript as the mean coverage of the 5' most 100 bases divided by the mean coverage of the whole transcript. Reads aligned to correct strand Percentage of reads that align to the correct strand, as reported by the CollectRnaSeqMetrics utility from Picard tools. If a panel is selected, the app calculates the gene-level coverage information based on the on-target genes. Table 14 Gene-Level Coverage Table Coverage Number of Genes 1X Number of genes covered at the mean base coverage level or deeper. 10X 30X 100X 22 1000000006111 v00 Analysis Output Variants Summary Table 15 Variant Calls Table Statistic Definition Homozygous reference Number of homozygous reference calls. Heterozygous Number of heterozygous variant calls. Homozygous variant Number of homozygous variant calls. SNV Total number of single nucleotide variants (SNVs) detected for the sample. Indel The number of indels detected for the sample. Tn/Tv The number of transition SNVs that pass the quality filters divided by the number of transversion SNVs that pass the quality filters. Transitions are interchanges of purines (A, G) or of pyrimidines (C, T). Transversions are interchanges of purine and pyrimidine bases (for example, A to T). Sample Plots } } To save plot as scalable vector graphics (SVG), click Save Plot as SVG. To export data from plot as comma-separated values (CSV), click Export Data as CSV. Table 16 Sample Plots Table Plot Name Description Insert Length Distribution The diagram summarizes the insert length distribution for paired-end reads. The insert length for this diagram is capped at 600 bp. Alignment Distribution If a panel is selected, the app calculates the alignment information based on the on-target genes. The plot shows the percentage for the color-coded genomic regions, which include coding, UTR, intron, and non-targeted or intergenic. The app reports non-targeted when a panel is selected. Transcript Coverage If a panel is selected, the app calculates the alignment information based on the on-target genes. The plot shows the transcript coverage position as reported by the CollectRnaSeqMetrics utility from Picard tools. A vertical bar shows the relative coverage at the position in each row. The numbers between 0 and 100 represents the normalized position along a transcript. ERCC Spike-Ins The plot shows the log RNA FPKM versus log spike-in molar concentration. Each dot corresponds to a gene. The least squares method calculates the fitted red line. Small Variants Summary The small variants analysis summary is available when you select a panel. RNA-Seq Alignment v1.0 App Guide 23 Table 17 Small Variants Summary Table Statistic Definition Gene The genes harboring the variant. Chr The chromosome name. Position The position on the chromosome. Depth The read depth at this position. Ref The reference allele. Alt The alternative allele. Alt Freq The observed alternate allele frequency. Variant Type The specific variant type. For more information, see http://uswest.ensembl.org/info/genome/variation/predicted_ data.html#consequences. dbSNP The dbSNP ID. COSMIC Catalog of somatic mutations in cancer. Because of SNP filtering, some entries are not available on the COSMIC website. For more information, see http://cancer.sanger.ac.uk/cosmic/analyses. ClinVar ClinVar annotation. Fusion Calls The fusion calls analysis summary is available when fusion calling is turned on. The app highlights the on-target genes when a panel is selected. Table 18 Fusion Calls Table Statistic 24 Definition Gene1 The gene on the 5' end and is highlighted when it is on target. Chr1 The chromosome of gene 1. Pos1 The position of gene 1. Str1 The strand of gene 1. Gene2 The gene on the 3' end and is highlighted when it is on target. Chr2 The chromosome of gene 2. Pos2 The position of gene 2. Str2 The strand of gene 2. Paired Read The number of read pairs when one read aligns to the left gene and the other read aligns to the right gene. 1000000006111 v00 Split Read RNA-Seq Alignment v1.0 App Guide Analysis Output Statistic Definition The number of read pairs when one of the reads spans the junction. 25 Notes For technical assistance, contact Illumina Technical Support. Table 19 Illumina General Contact Information Website Email www.illumina.com [email protected] Table 20 Illumina Customer Support Telephone Numbers Region Contact Number Region North America 1.800.809.4566 Japan Australia 1.800.775.688 Netherlands Austria 0800.296575 New Zealand Belgium 0800.81102 Norway China 400.635.9898 Singapore Denmark 80882346 Spain Finland 0800.918363 Sweden France 0800.911850 Switzerland Germany 0800.180.8994 Taiwan Hong Kong 800960230 United Kingdom Ireland 1.800.812949 Other countries Italy 800.874909 Contact Number 0800.111.5011 0800.0223859 0800.451.650 800.16836 1.800.579.2745 900.812168 020790181 0800.563118 00806651752 0800.917.0041 +44.1799.534000 Safety data sheets (SDSs)—Available on the Illumina website at support.illumina.com/sds.html. Product documentation—Available for download in PDF from the Illumina website. Go to support.illumina.com, select a product, then select Documentation & Literature. RNA-Seq Alignment v1.0 App Guide Technical Assistance Technical Assistance Illumina 5200 Illumina Way San Diego, California 92122 U.S.A. +1.800.809.ILMN (4566) +1.858.202.4566 (outside North America) [email protected] www.illumina.com