* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Documentation - Broad Institute
Survey
Document related concepts
Transcript
RC454 USERS GUIDE General Description RC454 is a tool to clean 454 reads based on their alignment to a reference consensus assembly. The correction process is aggressive, and as such requires an assembly that is highly representative of the population represented by the read data. It is highly recommended that the user use a de novo assembly generated from the read set being corrected. The reference should also not contain homopolymer indels or other frame-shifts (unless of course these are what is truly present in majority in the analyzed biological population). This package includes the following scripts. A more detailed description of each along with their options will follow: 1) RC454. RC454 is a program that takes a set of 454 read and quality files as well as a consensus assembly for those reads and corrects for known 454 error modes such as homopolymer indels and carry forward/incomplete extension (CAFIE). It will also correct for any indel that breaks the reading frames, unless it occurs in more than 25% of the reads (this option can be turned off). If an optional list of sequencing primer positions is supplied, it will trim all reads that start within these positions to start after the primer. RC454 uses Mosaik (http://bioinformatics.bc.edu/marthlab/Mosaik) to align the corrected reads between each step, and as such it is required to run the script. 2) runMosaik2 and the .qlx format (including samToQlx.pl and qlxToSam.pl). runMosaik2 is a perl utility script to help running Mosaik with the various options required by RC454 and to convert the output in the .qlx format, which is used by V-Phaser and V-Profiler. The .qlx file format will be described more in details under this section. RC454 will also output its results in .sam format. 3) convertPos. This is a perl utility script that helps converting base positions (such as start and end of genes) in a particular consensus sequence to their position in another. 4) configPath. This script is meant to modify the paths in the other scripts to make sure that everything can be found on your computer. Citing RC454 Henn MR, Boutwell CL, Charlebois P, Lennon NJ, Power KA, et al. (2012) Whole Genome Deep Sequencing of HIV-1 Reveals the Impact of Early Minor Variants Upon Immune Recognition During Acute Infection. PLoS Pathogens 8(3): e1002529. Required External Software 1) MUSCLE v3.8. The MUSCLE aligner (http://www.drive5.com/muscle/), which is used by the other scripts when global alignments are required. Using a different version of MUSCLE might require modifying command lines in the scripts, but should not be a major endeavor. 2) Mosaik. Mosaik is a reference-guided assembler and short-read aligner using a Smith-Waterman based algorithm. The read alignment process is used by RC454. It can be found at http://bioinformatics.bc.edu/marthlab/Mosaik. The version used and tested extensively with these other tools is 2.1.33. If you still want to use the runMosaik.pl script from the previous release, the version was 1.1.0013 Suggested External Software 1) AV454. AV454 is a module of the Arachne assembler to assemble viral genomes. It is currently one of the best assemblers available to handle diverse viral genomes with 454 data. The Arachne assembler can be obtained at http://www.broadinstitute.org/crd/wiki/index.php/Arachne_Main_Page. All the documentation on how to install and run the assembler can be found there too. RC454 is designed to run on an assembly of the reads, not an external/reference sequence. Another assembler of your choice could be used; no script in the package directly requires AV454 to function. AV454 is the assembler that gave us the best results while developing the softwares. It is preferable to verify if manual curating of the assembly is required before running the rest of the tool-kit. Script Usage 1) RC454: RC454 is a program that takes a set of 454 read and quality files, a consensus assembly for those reads as well as their alignment to the assembly in .qlx format (see runMosaik2.pl section) and corrects for known 454 error modes such as homopolymer indels and carry forward/incomplete extension (CAFIE). It will also correct for any indel that breaks the reading frames, unless it occurs in more than 25% of the reads (this option can be turned off). If you supply a list of primers and their positions, it will also trim each read that starts in these positions to start after the primer. RC454 uses Mosaik (http://bioinformatics.bc.edu/marthlab/Mosaik) to align the corrected reads between each step, and it is therefore required to run the script. RC454 is correcting reads in an aggressive manner, and it is designed more with the mindset of being conservative (not allowing suspicious variants in) rather than capturing the maximum number of variants possible. As such, it is important to align your reads against an assembly that was based off them instead of a separate reference genome, because misalignments can have a significant effect in the result. NOTE : The previous version of rc454.pl using runMosaik.pl and version 1.1 from Mosaik has been renamed rc454_mosaik1.pl Command Line : The basic command line for running RC454 is: rc454.pl <inputalign.qlx> <reads.fasta> <reads.qual> <assembly.fasta> <output> Options : There is multiple switches that can be used (with their default value) : minhomosize => 2 Minimum size of an homopolymer to be considered as such in the homopolymer correction step. Default is 2, which means that a case like: CAAT CA-T would get corrected. nqvalue => 1 Q-Score given to a N that gets added by the script in the quality file. Default is 1. The important part here is that all bases that have the nqvalue will get ignored when looking at the neighborhood of the base in the NQS filter. This is to prevent an inserted N from ‘flagging down’ all adjacent bases. The score 1 is chosen by default because usually no base gets assigned a score this low. nocafie => 0 If the –nocafie flag is turned on, the CAFIE correction step is skipped and they will not get corrected. nqsmainqual => 20 Minimum quality required for a base to pass the NQS filter nqsareaqual => 15 Minimum quality required for the neighborhood bases to pass NQS nqssize => 5 Size of the neighborhood on each side of the central base that is considered for the NQS filter details => 0 If the -details switch is on, all intermediate files are kept. Mostly valuable for debugging purposes or if you want to see which variants were corrected by each step. The details switch also creates a data dump of all changes that were made by the script in the reads. slicesize => 5 Parameter used as part of the algorithm to determine how many times a variant is seen. It is basically how many bases around the variant you watch on each side on the read to see if the sequence remains the same. gap3window => 10 Size of the window to dermine if a gap can form a multiple of 3 when joined with others. The window is the number of aligned bases (including gaps) that will be looked at on each side of the central gap being analyzed. primers => '' A file can be given containing the primer positions (in the same format as a gene list). If this option is activated, every read that starts inside the primer will be trimmed to remove the primer sequence. This is to prevent bias that could be caused by sequencing the primer sequence. primerbuffer => 2 When specifying a primer file, the primer buffer will be added to the positions of the primer such as any read starting within the primer position OR the primer position +/- primer buffer will be trimmed. This is to make up for occasional issues where you can have misalignments or homopolymer miscalls at the end of the primer sequence that could modify its losition. minpctvar => 0.25 Minimum percentage of the reads that must have a variant (including indels) in order for this particular variant to be ignored when correcting for nonhomopolymer errors. The number of times you see the variant must exceed both the minpctvar and minnbvar values in order to pass correction (if you want to ignore one or the other, simply set its value to 0). minnbvar => 5 Minimum amount of reads that must have a variant (including indels) in order for this particular variant to be ignored when correcting for non-homopolymer errors. The number of times you see the variant must exceed both the minpctvar and minnbvar values in order to pass correction (if you want to ignore one or the other, simply set its value to 0). genelist => '' File that contains the list of genes in the current genome. If you supply a gene list, the corrections to prevent gaps from breaking the reading frame will only be applied within the genes listed and will ignore the rest of the genome. noorf => '' Setting the noorf flag will turn off the 3 rd step of the read clean-up entirely, which means that only clear CAFIE and homopolymer mistakes will be corrected, other gaps breaking the reading frame will be ignored. bam => '' Setting the bam flag will return .sam and .bam output files for the read alignment, otherwise those will be removed at the end of the run. noclean => '' Setting the noclean flag will delete the final cleaned reads and quals file upon termination of the program, only keeping the cleaned read alignment. Output Files: RC454 has 3 main output files: a) <output>_final.qlx: This is the final .qlx file with all the cleaned reads aligned. b) <output>_cleaned.fasta: This is the final cleaned read set in fasta format c) <output>_cleaned.qual: This is the final cleaned qual set matching the reads If you use the –details option, you will also get a list of all rejected reads and quals, and a set of fasta, qual and qlx files for each step, named _ie.qlx/fasta/qual after the CAFIE correction step and _homo.qlx/fasta/qual after the homopolymer correction step. A filed named _changes.txt will also be created, containing a list of all corrections done to each read over the course of the program. 2) runMosaik2 and the .qlx file format: runMosaik2.pl is a utility script that serves 2 purposes : the first is to act as a wrapper script for MosaikBuild and MosaikAligner. It can also call the samToQlx.pl script if the -qlx parameter is set to generate the .qlx file format that is used by RC454, V-Phaser and V-Profiler from the .sam file that is an output of Mosaik. The .qlx file format: The .qlx file format is a read alignment format based on .axt that includes quality information for each base as well as a quality flag to indicate if a base passed the Neighborhood Quality Score (NQS) criteria or not, which is used by V-Phaser to call trusted variants. The NQS is calculated for each base of a read depending on its quality and the quality of the bases adjacent to it. The base must pass a minimum quality score of q and its n adjacent bases on each side must pass a quality score of q’. By default those values are q = 20, n = 5 and q’ = 15. The format is the following: >Read [ReadID] [read start] [read stop] [read length] [strand] [reference name] [reference start] [reference stop] Reference Sequence Read Sequence NQS String Quality String (in ASCII format, the ASCII character being Quality Score + 33). Note that assembly here is whatever you aligned against. Each line above corresponds to one line in the .qlx file (i.e., reads contain no newline breaks within their sequence). The fields of the header line are separated by single spaces. Two scripts included in the package, samToQlx.pl and qlxToSam.pl, allow one to convert the .qlx files to or from the .sam format (using samtools can then allow conversion to/from bam). Note that converting a .qlx to sam and then back will result in a new .qlx identical to the original, but the same is not true for sam to .qlx and back to sam, as the sam format contains data not stored in the .qlx. You can use them with: perl qlxToSam.pl <qlxinput.qlx> <reference.fasta> <samoutput.sam> -readfa <reads.fasta> -readq <reads.qual> perl samToQlx.pl <saminput.sam> <reference.fasta> <qlxoutput.qlx> Command Line : runMosaik2 can take multiple input formats. It can function either with reads in fasta and qual formats (often returned by 454 sequencing), paired or not, and in fastq format (often returned by illumina sequencing), paired or not. In the command line, include either -fa/-qual, -fa/-fa2/-qual/-qual2, -fq or -fq/-fq2. The main command line for runMosaik2.pl is: perl runMosaik2.pl -fa <reads.fasta> <assembly/reference.fasta> -o <output> -qual <reads.qual> -ref To use with 454 data and RC454, it is highly recommended to use the – param454 and -qlx parameters. Input File: -fa <reads.fasta> : reads in fasta format -qual <reads.qual> : reads quality scores in spaced-delimited integers format or -fa <reads.fasta> : first mates in paired reads in fasta format -qual <reads.qual> : first mates in paired reads quality scores in spaced-delimited integers format -fa2 <reads2.fasta> : second mates in paired reads in fasta format -qual2 <reads2.qual> : second mates in paired reads quality scores in spaceddelimited integers format or -fq <reads.fastq> : reads in fastq format or -fq <reads.fastq> : first mates in paired reads in fastq format -fq2 <reads2.fastq> : second mates in paired reads in fastq format -ref <assembly/reference.fasta> : assembly or reference that you want to align against, in fasta format Options: hs => 10 Hash size for Mosaik alignment. See the documentation for more details. act => 15 Alignment candidate threshold for Mosaik alignment. See the documentation for more details. mmp => 0.25 Maximum percentage of the read length that can be errors minp => 0.25 Minimum percentage of a read that has to be aligned to keep it ms => 10 Match score of the Smith-Waterman algorithm. mms => -9 Mismatch penalty of the Smith-Waterman algorithm. hgop => 20 Penalty for opening a gap in an homopolymer for the SW algorithm gop => 40 Gap opening penalty for the SW algorithm gep => 10 Gap extending penalty for the SW algorithm nqsmq => 20 Minimum quality required for a base to pass the NQS filter nqsaq => 15 Minimum quality required for the neighborhood bases to pass NQS nqssize => 5 Size of the neighborhood on each side of the central base that is considered for the NQS filter bw => 29 Uses the banded Smith-Waterman algorithm. This greatly increases alignment speed, but seems to slightly reduce accuracy in highly diverse samples. nqvalue => 1 Q-Score given to a N that gets added by the script in the quality file. Default is 1. The important part here is that all bases that have the nqvalue will get ignored when looking at the neighborhood of the base in the NQS filter. This is to prevent an inserted N from ‘flagging down’ all adjacent bases. The score 1 is chosen by default because usually no base gets assigned a score this low. -m => ‘unique’ Only keeps uniquely aligned reads st => 'illumina' Sequencing technology. See Mosaik manual for more details. qlx => ‘’ Generates the qlx format. Requires samToQlx.pl qlxonly => 0 Only keeps the .qlx output and not the .bam or .sam fakequals => 0 If set to a positive integer number, will fake the quality of all nucleotides to the given value. This speeds up the creation of the .qlx file greatly if you do not care about the quality scores. mfl => 600 Medium fragment length. Only necessary for paired reads. param454 => ‘’ Sets the following parameters to a value that is suitable to 454: -gop <15> -hgop <4> -gep <6.66> -st <454> paramillu => ‘’ Sets the following parameters to a value that is suitable to illumina: -gop <40> -hgop <20> -gep <10> annpe => ‘’ Network file. This file is actually included in the Mosaik distribution. It should be located in the networkFile folder. For Mosaik 2.1.26, this file is named: 2.1.26.pe.100.0065.ann annse => ‘’ Network file. This file is actually included in the Mosaik distribution. It should be located in the networkFile folder. For Mosaik 2.1.26, this file is named: 2.1.26.se.100.005.ann Outputs: <output>.qlx : .qlx alignment file <output>.bam : bam format output file <output>.sam : sam format output file 3) configPaths: configPaths.pl is more of an installer than anything else. It will modify the hardcoded paths in all the scripts (vphaser.pl, vprofiler.pl, convertPos.pl, rc454.pl) to find the required program on your system. Command Line : perl configPaths.pl <configfile.txt> Input File: The input file contains the list of paths for each variable. scriptpath = '<scriptpath>’ mosaikpath = '<mosaikpath> mosaik1path = '<mosaik1path> musclepath = '<musclepath>' perlpath = '<perlpath>' Rpath = '<Rpath>' samtoolspath = '<samtoolspath>' mosaiknetworkpath = '<mosaiknetworkpath>' All the scripts in this package should be located in scriptpath. mosaikpath is the path for Mosaik v2 for alignment with RC454 (not necessary if using version 1) mosaik1path is the path for Mosaik v1 for alignment with RC454 (not necessary if using version 2) musclepath is the path where you have muscle installed. If you are not sure that your version of muscle will be compatible with the scripts (they were developed on v3.8) you can download v3.8 from their website at http://www.drive5.com/muscle/downloads.htm perlpath is the path where perl is located. Note that you may also need to change permissions on the scripts if you want to use #! execution instead of invoking perl explicitly. Rpath is the path for R 2.9 or higher to create the Heatmaps samtoolspath is the path for Samtools to manipulate bam and sam file formats mosaiknetworkpath is the path for Mosaik’s network files for alignments from contigMerger.pl and contigs2assembly.pl. This should be the networkFile folder of your Mosaik distribution Note that if you have the paths in your environment variables and that you can run the different programs like R or muscle without specifying any path when typing on a command line, you do not need to specify a path here either. The only one that absolutely requires a path is scriptpath. Output File: There is no output file for configPaths. It will modify the paths in the various scripts in this package. 4) convertPos: convertPos.pl is a script that takes gene (or any feature) positions from a particular genome assembly and returns what those positions are in another genome that gets aligned to the first one. It is meant to be used to quickly generate files containing the gene or haplotype positions for v-profiler or RC454 when you have them determined for one. Command Line : perl convertPos.pl <genePositionsInput.txt> <reference.fasta> <targetgenome.fasta> <genePositionsOutput.txt> Input File: The <genePositionsInput.txt> input file contains the name of the gene (or feature) and its start and end positions in the genome, such as: Gene1 –tab- StartPos1 –tab- EndPos1 Gene2 –tab- StartPos2 –tab- EndPos2 … The reference file is a fasta of the genome that has the gene positions in genePositionsInput that will be aligned against. The targetgenome file is a fasta of the genome that you want the gene positions for in fasta format. The genePositionsOutput.txt is the output file. Output Files: The script will generate a file identical to the input file except for the modified positions. Note that it is possible in the output to have positions ending with a +, or to have them replaced by ‘BeforeStart’ or ‘AfterEnd’. A position ending with a + (for example 650+) means that the position lands in an indel when aligning the new genome to the reference assembly. It is not possible because of this for the script to determine for sure the exact position, and it is better to check the alignment manually and see which position makes the most sense. BeforeStart or AfterEnd simply means that the position is out of the limits of your current genome when aligning to the reference. The alignment of the new genome to the reference will also be kept and will be named : <targetgenome.fasta>.convafa Example Data The folder TestData included in the RC454_SoftwarePackage contains all files required to test the scripts in this package as well as the expected result files. You can follow the instructions in the RC454Test_Commandlines.txt file to go through the process.