Download Documentation - Broad Institute

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Whole genome sequencing wikipedia , lookup

Genome evolution wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Genomics wikipedia , lookup

Metagenomics wikipedia , lookup

RNA-Seq wikipedia , lookup

Smith–Waterman algorithm wikipedia , lookup

Transcript
RC454 USERS GUIDE
 General Description 
RC454 is a tool to clean 454 reads based on their alignment to a reference
consensus assembly. The correction process is aggressive, and as such requires
an assembly that is highly representative of the population represented by the
read data. It is highly recommended that the user use a de novo assembly
generated from the read set being corrected. The reference should also not
contain homopolymer indels or other frame-shifts (unless of course these are
what is truly present in majority in the analyzed biological population).
This package includes the following scripts. A more detailed description of each
along with their options will follow:
1) RC454. RC454 is a program that takes a set of 454 read and quality files as
well as a consensus assembly for those reads and corrects for known 454 error
modes such as homopolymer indels and carry forward/incomplete extension
(CAFIE). It will also correct for any indel that breaks the reading frames, unless it
occurs in more than 25% of the reads (this option can be turned off). If an
optional list of sequencing primer positions is supplied, it will trim all reads that
start within these positions to start after the primer. RC454 uses Mosaik
(http://bioinformatics.bc.edu/marthlab/Mosaik) to align the corrected reads
between each step, and as such it is required to run the script.
2) runMosaik2 and the .qlx format (including samToQlx.pl and
qlxToSam.pl). runMosaik2 is a perl utility script to help running Mosaik with the
various options required by RC454 and to convert the output in the .qlx format,
which is used by V-Phaser and V-Profiler. The .qlx file format will be described
more in details under this section. RC454 will also output its results in .sam
format.
3) convertPos. This is a perl utility script that helps converting base positions
(such as start and end of genes) in a particular consensus sequence to their
position in another.
4) configPath. This script is meant to modify the paths in the other scripts to
make sure that everything can be found on your computer.
 Citing RC454 
Henn MR, Boutwell CL, Charlebois P, Lennon NJ, Power KA, et al. (2012) Whole
Genome Deep Sequencing of HIV-1 Reveals the Impact of Early Minor
Variants Upon Immune Recognition During Acute Infection. PLoS Pathogens
8(3): e1002529.
 Required External Software 
1) MUSCLE v3.8. The MUSCLE aligner (http://www.drive5.com/muscle/), which
is used by the other scripts when global alignments are required. Using a
different version of MUSCLE might require modifying command lines in the
scripts, but should not be a major endeavor.
2) Mosaik. Mosaik is a reference-guided assembler and short-read aligner using
a Smith-Waterman based algorithm. The read alignment process is used by
RC454. It can be found at http://bioinformatics.bc.edu/marthlab/Mosaik. The
version used and tested extensively with these other tools is 2.1.33. If you still
want to use the runMosaik.pl script from the previous release, the version was
1.1.0013
 Suggested External Software 
1) AV454. AV454 is a module of the Arachne assembler to assemble viral
genomes. It is currently one of the best assemblers available to handle diverse
viral genomes with 454 data. The Arachne assembler can be obtained at
http://www.broadinstitute.org/crd/wiki/index.php/Arachne_Main_Page. All the
documentation on how to install and run the assembler can be found there too.
RC454 is designed to run on an assembly of the reads, not an external/reference
sequence. Another assembler of your choice could be used; no script in the
package directly requires AV454 to function. AV454 is the assembler that gave
us the best results while developing the softwares. It is preferable to verify if
manual curating of the assembly is required before running the rest of the tool-kit.
 Script Usage 
1) RC454:
RC454 is a program that takes a set of 454 read and quality files, a consensus
assembly for those reads as well as their alignment to the assembly in .qlx format
(see runMosaik2.pl section) and corrects for known 454 error modes such as
homopolymer indels and carry forward/incomplete extension (CAFIE). It will also
correct for any indel that breaks the reading frames, unless it occurs in more than
25% of the reads (this option can be turned off). If you supply a list of primers
and their positions, it will also trim each read that starts in these positions to start
after
the
primer.
RC454
uses
Mosaik
(http://bioinformatics.bc.edu/marthlab/Mosaik) to align the corrected reads
between each step, and it is therefore required to run the script.
RC454 is correcting reads in an aggressive manner, and it is designed more with
the mindset of being conservative (not allowing suspicious variants in) rather
than capturing the maximum number of variants possible. As such, it is important
to align your reads against an assembly that was based off them instead of a
separate reference genome, because misalignments can have a significant effect
in the result.
NOTE : The previous version of rc454.pl using runMosaik.pl and
version 1.1 from Mosaik has been renamed rc454_mosaik1.pl
Command Line :
The basic command line for running RC454 is:
rc454.pl <inputalign.qlx> <reads.fasta> <reads.qual> <assembly.fasta>
<output>
Options :
There is multiple switches that can be used (with their default value) :
minhomosize
=> 2
Minimum size of an homopolymer to be considered as such in the homopolymer
correction step. Default is 2, which means that a case like:
CAAT
CA-T
would get corrected.
nqvalue => 1
Q-Score given to a N that gets added by the script in the quality file. Default is 1.
The important part here is that all bases that have the nqvalue will get ignored
when looking at the neighborhood of the base in the NQS filter. This is to prevent
an inserted N from ‘flagging down’ all adjacent bases. The score 1 is chosen by
default because usually no base gets assigned a score this low.
nocafie => 0
If the –nocafie flag is turned on, the CAFIE correction step is skipped and they
will not get corrected.
nqsmainqual => 20
Minimum quality required for a base to pass the NQS filter
nqsareaqual => 15
Minimum quality required for the neighborhood bases to pass NQS
nqssize => 5
Size of the neighborhood on each side of the central base that is considered for
the NQS filter
details => 0
If the -details switch is on, all intermediate files are kept. Mostly valuable for
debugging purposes or if you want to see which variants were corrected by each
step. The details switch also creates a data dump of all changes that were made
by the script in the reads.
slicesize => 5
Parameter used as part of the algorithm to determine how many times a variant
is seen. It is basically how many bases around the variant you watch on each
side on the read to see if the sequence remains the same.
gap3window => 10
Size of the window to dermine if a gap can form a multiple of 3 when joined with
others. The window is the number of aligned bases (including gaps) that will be
looked at on each side of the central gap being analyzed.
primers => ''
A file can be given containing the primer positions (in the same format as a gene
list). If this option is activated, every read that starts inside the primer will be
trimmed to remove the primer sequence. This is to prevent bias that could be
caused by sequencing the primer sequence.
primerbuffer => 2
When specifying a primer file, the primer buffer will be added to the positions of
the primer such as any read starting within the primer position OR the primer
position +/- primer buffer will be trimmed. This is to make up for occasional
issues where you can have misalignments or homopolymer miscalls at the end of
the primer sequence that could modify its losition.
minpctvar => 0.25
Minimum percentage of the reads that must have a variant (including indels) in
order for this particular variant to be ignored when correcting for nonhomopolymer errors. The number of times you see the variant must exceed both
the minpctvar and minnbvar values in order to pass correction (if you want to
ignore one or the other, simply set its value to 0).
minnbvar
=> 5
Minimum amount of reads that must have a variant (including indels) in order for
this particular variant to be ignored when correcting for non-homopolymer errors.
The number of times you see the variant must exceed both the minpctvar and
minnbvar values in order to pass correction (if you want to ignore one or the
other, simply set its value to 0).
genelist
=> ''
File that contains the list of genes in the current genome. If you supply a gene
list, the corrections to prevent gaps from breaking the reading frame will only be
applied within the genes listed and will ignore the rest of the genome.
noorf
=> ''
Setting the noorf flag will turn off the 3 rd step of the read clean-up entirely, which
means that only clear CAFIE and homopolymer mistakes will be corrected, other
gaps breaking the reading frame will be ignored.
bam
=> ''
Setting the bam flag will return .sam and .bam output files for the read alignment,
otherwise those will be removed at the end of the run.
noclean
=> ''
Setting the noclean flag will delete the final cleaned reads and quals file upon
termination of the program, only keeping the cleaned read alignment.
Output Files:
RC454 has 3 main output files:
a) <output>_final.qlx:
This is the final .qlx file with all the cleaned reads aligned.
b) <output>_cleaned.fasta:
This is the final cleaned read set in fasta format
c) <output>_cleaned.qual:
This is the final cleaned qual set matching the reads
If you use the –details option, you will also get a list of all rejected reads and
quals, and a set of fasta, qual and qlx files for each step, named
_ie.qlx/fasta/qual after the CAFIE correction step and _homo.qlx/fasta/qual after
the homopolymer correction step. A filed named _changes.txt will also be
created, containing a list of all corrections done to each read over the course of
the program.
2) runMosaik2 and the .qlx file format:
runMosaik2.pl is a utility script that serves 2 purposes : the first is to act as a
wrapper script for MosaikBuild and MosaikAligner. It can also call the
samToQlx.pl script if the -qlx parameter is set to generate the .qlx file format that
is used by RC454, V-Phaser and V-Profiler from the .sam file that is an output of
Mosaik.
The .qlx file format:
The .qlx file format is a read alignment format based on .axt that includes quality
information for each base as well as a quality flag to indicate if a base passed the
Neighborhood Quality Score (NQS) criteria or not, which is used by V-Phaser to
call trusted variants. The NQS is calculated for each base of a read depending
on its quality and the quality of the bases adjacent to it. The base must pass a
minimum quality score of q and its n adjacent bases on each side must pass a
quality score of q’. By default those values are q = 20, n = 5 and q’ = 15.
The format is the following:
>Read [ReadID] [read start] [read stop] [read length] [strand] [reference name]
[reference start] [reference stop]
Reference Sequence
Read Sequence
NQS String
Quality String (in ASCII format, the ASCII character being Quality Score + 33).
Note that assembly here is whatever you aligned against.
Each line above corresponds to one line in the .qlx file (i.e., reads contain no
newline breaks within their sequence). The fields of the header line are
separated by single spaces.
Two scripts included in the package, samToQlx.pl and qlxToSam.pl, allow one to
convert the .qlx files to or from the .sam format (using samtools can then allow
conversion to/from bam). Note that converting a .qlx to sam and then back will
result in a new .qlx identical to the original, but the same is not true for sam to
.qlx and back to sam, as the sam format contains data not stored in the .qlx.
You can use them with:
perl qlxToSam.pl <qlxinput.qlx> <reference.fasta> <samoutput.sam> -readfa
<reads.fasta> -readq <reads.qual>
perl samToQlx.pl <saminput.sam> <reference.fasta> <qlxoutput.qlx>
Command Line :
runMosaik2 can take multiple input formats. It can function either with reads in
fasta and qual formats (often returned by 454 sequencing), paired or not, and in
fastq format (often returned by illumina sequencing), paired or not. In the
command line, include either -fa/-qual, -fa/-fa2/-qual/-qual2, -fq or -fq/-fq2.
The main command line for runMosaik2.pl is:
perl
runMosaik2.pl
-fa
<reads.fasta>
<assembly/reference.fasta> -o <output>
-qual
<reads.qual>
-ref
To use with 454 data and RC454, it is highly recommended to use the –
param454 and -qlx parameters.
Input File:
-fa <reads.fasta> : reads in fasta format
-qual <reads.qual> : reads quality scores in spaced-delimited integers format
or
-fa <reads.fasta> : first mates in paired reads in fasta format
-qual <reads.qual> : first mates in paired reads quality scores in spaced-delimited
integers format
-fa2 <reads2.fasta> : second mates in paired reads in fasta format
-qual2 <reads2.qual> : second mates in paired reads quality scores in spaceddelimited integers format
or
-fq <reads.fastq> : reads in fastq format
or
-fq <reads.fastq> : first mates in paired reads in fastq format
-fq2 <reads2.fastq> : second mates in paired reads in fastq format
-ref <assembly/reference.fasta> : assembly or reference that you want to align
against, in fasta format
Options:
hs => 10
Hash size for Mosaik alignment. See the documentation for more details.
act => 15
Alignment candidate threshold for Mosaik alignment. See the documentation for
more details.
mmp => 0.25
Maximum percentage of the read length that can be errors
minp => 0.25
Minimum percentage of a read that has to be aligned to keep it
ms => 10
Match score of the Smith-Waterman algorithm.
mms => -9
Mismatch penalty of the Smith-Waterman algorithm.
hgop => 20
Penalty for opening a gap in an homopolymer for the SW algorithm
gop => 40
Gap opening penalty for the SW algorithm
gep => 10
Gap extending penalty for the SW algorithm
nqsmq => 20
Minimum quality required for a base to pass the NQS filter
nqsaq => 15
Minimum quality required for the neighborhood bases to pass NQS
nqssize => 5
Size of the neighborhood on each side of the central base that is considered for
the NQS filter
bw => 29
Uses the banded Smith-Waterman algorithm. This greatly increases alignment
speed, but seems to slightly reduce accuracy in highly diverse samples.
nqvalue => 1
Q-Score given to a N that gets added by the script in the quality file. Default is 1.
The important part here is that all bases that have the nqvalue will get ignored
when looking at the neighborhood of the base in the NQS filter. This is to prevent
an inserted N from ‘flagging down’ all adjacent bases. The score 1 is chosen by
default because usually no base gets assigned a score this low.
-m => ‘unique’
Only keeps uniquely aligned reads
st => 'illumina'
Sequencing technology. See Mosaik manual for more details.
qlx => ‘’
Generates the qlx format. Requires samToQlx.pl
qlxonly => 0
Only keeps the .qlx output and not the .bam or .sam
fakequals => 0
If set to a positive integer number, will fake the quality of all nucleotides to the
given value. This speeds up the creation of the .qlx file greatly if you do not care
about the quality scores.
mfl => 600
Medium fragment length. Only necessary for paired reads.
param454 => ‘’
Sets the following parameters to a value that is suitable to 454:
-gop <15>
-hgop <4>
-gep <6.66>
-st <454>
paramillu => ‘’
Sets the following parameters to a value that is suitable to illumina:
-gop <40>
-hgop <20>
-gep <10>
annpe => ‘’
Network file. This file is actually included in the Mosaik distribution. It should be
located in the networkFile folder. For Mosaik 2.1.26, this file is named:
2.1.26.pe.100.0065.ann
annse => ‘’
Network file. This file is actually included in the Mosaik distribution. It should be
located in the networkFile folder. For Mosaik 2.1.26, this file is named:
2.1.26.se.100.005.ann
Outputs:
<output>.qlx : .qlx alignment file
<output>.bam : bam format output file
<output>.sam : sam format output file
3) configPaths:
configPaths.pl is more of an installer than anything else. It will modify the hardcoded paths in all the scripts (vphaser.pl, vprofiler.pl, convertPos.pl, rc454.pl) to
find the required program on your system.
Command Line :
perl configPaths.pl <configfile.txt>
Input File:
The input file contains the list of paths for each variable.
scriptpath = '<scriptpath>’
mosaikpath = '<mosaikpath>
mosaik1path = '<mosaik1path>
musclepath = '<musclepath>'
perlpath = '<perlpath>'
Rpath = '<Rpath>'
samtoolspath = '<samtoolspath>'
mosaiknetworkpath = '<mosaiknetworkpath>'
All the scripts in this package should be located in scriptpath.
mosaikpath is the path for Mosaik v2 for alignment with RC454 (not necessary if
using version 1)
mosaik1path is the path for Mosaik v1 for alignment with RC454 (not necessary if
using version 2)
musclepath is the path where you have muscle installed. If you are not sure that
your version of muscle will be compatible with the scripts (they were developed
on
v3.8)
you
can
download
v3.8
from
their
website
at
http://www.drive5.com/muscle/downloads.htm
perlpath is the path where perl is located. Note that you may also need to change
permissions on the scripts if you want to use #! execution instead of invoking perl
explicitly.
Rpath is the path for R 2.9 or higher to create the Heatmaps
samtoolspath is the path for Samtools to manipulate bam and sam file formats
mosaiknetworkpath is the path for Mosaik’s network files for alignments from
contigMerger.pl and contigs2assembly.pl. This should be the networkFile folder
of your Mosaik distribution
Note that if you have the paths in your environment variables and that you can
run the different programs like R or muscle without specifying any path when
typing on a command line, you do not need to specify a path here either. The
only one that absolutely requires a path is scriptpath.
Output File:
There is no output file for configPaths. It will modify the paths in the various
scripts in this package.
4) convertPos:
convertPos.pl is a script that takes gene (or any feature) positions from a
particular genome assembly and returns what those positions are in another
genome that gets aligned to the first one. It is meant to be used to quickly
generate files containing the gene or haplotype positions for v-profiler or RC454
when you have them determined for one.
Command Line :
perl convertPos.pl <genePositionsInput.txt> <reference.fasta>
<targetgenome.fasta> <genePositionsOutput.txt>
Input File:
The <genePositionsInput.txt> input file contains the name of the gene (or feature)
and its start and end positions in the genome, such as:
Gene1 –tab- StartPos1 –tab- EndPos1
Gene2 –tab- StartPos2 –tab- EndPos2
…
The reference file is a fasta of the genome that has the gene positions in
genePositionsInput that will be aligned against.
The targetgenome file is a fasta of the genome that you want the gene positions
for in fasta format.
The genePositionsOutput.txt is the output file.
Output Files:
The script will generate a file identical to the input file except for the modified
positions.
Note that it is possible in the output to have positions ending with a +, or to have
them replaced by ‘BeforeStart’ or ‘AfterEnd’.
A position ending with a + (for example 650+) means that the position lands in an
indel when aligning the new genome to the reference assembly. It is not possible
because of this for the script to determine for sure the exact position, and it is
better to check the alignment manually and see which position makes the most
sense. BeforeStart or AfterEnd simply means that the position is out of the limits
of your current genome when aligning to the reference.
The alignment of the new genome to the reference will also be kept and will be
named : <targetgenome.fasta>.convafa
 Example Data 
The folder TestData included in the RC454_SoftwarePackage contains all files
required to test the scripts in this package as well as the expected result files.
You can follow the instructions in the RC454Test_Commandlines.txt file to go
through the process.