Download Running External Programs using Perl

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Relational model wikipedia , lookup

Clusterpoint wikipedia , lookup

Database model wikipedia , lookup

Object-relational impedance mismatch wikipedia , lookup

Transcript
Running External Programs using
Perl
Running External Programs with Perl
• Many clever people have written software for use in
bioinformatics. Often they have used sophisticated
algorithms, and tested them with a wide variety of
data.
• Since we want to avoid endlessly re-inventing and
debugging programs that do standard procedures, we
want to use their programs.
• But, we want to use Perl to get the input data into the
proper form, parse the output, and (often) run the
program many different times with slight variations.
• Most common program: BLAST, which we will use as an
example.
BLAST+
• Recently, NCBI has “improved” the BLAST interface by heavily
revising the syntax needed to invoke the program. The new
version is called BLAST+.
– Manual: http://www.ncbi.nlm.nih.gov/books/NBK1763/
– It is still possible use the old methods, by using the legacy_blast.pl
program. An example:
legacy_blast.pl blastall –I query -d nr -o blast.out
– However, it probably best to just adjust to the new syntax.
makeblastdb
• The subject sequences for a BLAST search must be put into a
special format, a BLAST database, before they can be used.
The makeblastdb program is used for this purpose
– This was formerly done with the program formatdb.
• Only three input parameters of note:
– -in specifies the FASTA-formatted input file, usually containing multiple
sequences.
– -out specifies the name of the blast database you wish to create
– -dbtype specifies the molecule type: “nucl” for DNA or RNA and “prot”
for protein.
• Example:
makeblastdb -in scaSeq.fasta -out scaSeq -dbtype nucl
BLAST Programs
• In the past, “blastall” was the basic command, with the particular
type of blast specified with the –p option.
• With BLAST+, the basic commands are blastn, blastp, blastx, tblastx,
tblastn, psiblast, rpsblast, and rpstblastn.
• Within the blastn and blastp basic commands are several options,
specified with the –task flag.
– For blastp,
• -task blastp is the standard blastp program,
• –task blastp-short is optimized for queries shorted than 30 amino acids
– For blastn,
• -task blastn is the standard blastn program,
• –task blastn-short is optimized for queries shorted than 50 nucleotides
• -task megablast uses a BLAST program with a linear gap penalty instead of an
affine gap penalty; it is used for matching very similar sequences and it is
much faster than blastn
• -task dc-megablast is discontinuous megablast, used for more distant matches.
Necessary Parameters
• All blast programs have several necessary parameters
– -db specifies the BLAST database to be used
– -query specifies the FAST-formatted file to be used as the query sequence
– -out specifies a file name for the output. In the absence of a –out file, the
output goes to STDOUT.
BLAST Options
• It is also possible to control parameters of BLAST itself. There are lots of
options, but here are some I have found useful:
– -evalue specifies that worst e-value that is reported. The default is 10, which is way too
much, so I often use –evalue 1e-3, which is more realistic.
– -matrix specifies the scoring matrix. For blastp with is BLOSUM62
– -max_target_seqs specifies the maximum number of subject sequences that will appear in
the output. Note that you can have more than one hit on the same subject sequence, and
all of them count as just a single target. But, this is often quite useful if you just want the
top hit.
– -seg no turns off the SEG filter, for simple sequences
– -dust no turns off the DUST simple sequence filter for blastn.
– You can also fool around with the gap penalties, word sizes, and other parameters if you
like. Not worth the trouble, in my opinion.
Output Formatting
• Output format can be specified with the –outfmt option.
– The default value (which is 0) gives the standard pairwise alignment format used in the
old BLAST program
– For tabular output equivalent to the old –m8 output, use –outfmt 6
• A list of all possible options can be found by using the command “blastn –
help”. For tabular output, it is possible to specify exactly which pieces of
data you will see as well as their order.
– The default for –outfmt 6 is : -outfmt 6 'qseqid sseqid pident length mismatch gapopen
qstart qend sstart send evalue bitscore‘
– There are quite a few other data items you can see if you like.
“system” command
•
The most commonly used Perl command for running external programs is
“system”. This command executes the program specified by its arguments, then
returns control to the next line in your Perl program.
•
“system” returns the signal number resulting from the process (program) it
executed. If all went well, this is 0. Numbers other than 0 indicate some kind of
error.
– You should always test the results of interacting with the external world.
– Thus, reasonable syntax for using the system command:
if ( system(“whatever see below”) != 0) {
die “error executing system command\n”;
}
– Surprisingly often you will find that you get an error when you first start running an
external program, due to lack of proper permissions, or files not being where you
thought they were, or improper syntax, etc.
More on the system command
• The simplest way to use “system” is to simply enclose the
command line you need in quotes:
system( “blastn -task megablast -db all_genes -query temp_qu_file
-out temp_out_file -outfmt 6 -evalue 1e-3 -perc_identity 97” )
– The above line invokes the bash shell to interpret the command line,
converting each group of symbols separated by spaces into a separate
argument.
• You can avoid invoking the shell at all (a somewhat more
secure method), by separating out the individual arguments
yourself:
system( “blastall”, “–p blastn”, “–I my_input_file”, “–d my_database”,
“–o my_blast_output.txt” )
– I rarely do this.
Backticks
• Another way to run external programs is to enclose the command line in
backticks.
– the backtick looks like: `. It is on the keyboard just to the left of 1, same key as
~.
– The program’s output to STDOUT is the return value of the command enclosed
in backticks:
$output_string = ` blastn –task blastn –i my_input_file –d my_database`;
– In this case, the output string needs to be split at newline characters, since the
entire output is contained with in a single string.
my @output_lines = split /\n/, $output_string;
– Backticks are very useful when the program you want to run doesn’t have the
option of writing to a file.
BLAST types
•
•
•
A BLAST search involves taking a single sequence (the query sequence) and comparing it to a
database of other sequences (subject sequences), then reporting the best matches.
In general, protein searches are more capable of detecting more distant and imperfect
matches than DNA searches, because synonymous and similar amino acids are more
conserved evolutionarily than are nucleotide sequences.
A number of different BLAST programs are available. The main ones differ primarily by the
type of query and subject sequences used:
–
–
–
–
–
•
blastn: query is DNA, subject is DNA
blastp: query is protein, subject is protein
blastx: query is nucleic acid that is translated by the program into protein sequences (all 6 reading
frames); subject database is protein. Useful for imperfect query sequences such as ESTs
tblastn: query is protein; database is DNA translated into protein sequences in all 6 reading frames.
Useful for low quality databases.
tblastx: query is DNA translated into protein, subject is nucleotide translated into protein. Both are
translated into all 6 frames. It is very slow relative to the other BLAST types.
Several other types of BLAST search exist: see the NCBI site FAQs to learn more about them.
Query Sequence
•
•
•
•
•
•
The query sequence must be written in “FASTA” format. This consists of a single comment line that starts
with a >, followed by one or more lines of sequence.
Comments usually start with a unique identifier for the sequence, then some other descriptive
information. To save trouble later, the unique identifier should contain letters, numbers, undescores,
hyphens and periods only, and contain no spaces. The unique identifier ends at the first space.
The sequence should contain only sequence characters: no spaces or numbers allowed.
Either upper or lower case is acceptable,
Some suggest that lines of 80 or fewer characters should be used. I prefer to put the entire sequence on a
single line, which can be quite long.
Nucleic acid sequences can be either DNA or RNA. In general, A, C, G, T and U are used, along with N for
“any base” and a single – (hyphen) for gaps of any length.
–
–
Some sequences also contain a variety of other letters, such as R for pyrimidine or M for amino-containing. These are
not widely used and may well confuse others.
However, the existence of these characters makes it necessary that you check unknown sequences for the presence of
letters other than ACGTUN-.
•
Protein sequence use the standard 20 amino acid single letter codes. In addition, X means any amino acid,
* is the translational stop codon at the end of the sequence, and there are a few other oddities.
•
Note that FASTA format is not rigidly defined, and it is always worth checking sequence files for what
exactly is being used as a format, and the information for any new program for what is an acceptable
format.
FASTA format examples
•
>gi|282349|pir||A41961 chitinase (EC 3.2.1.14) D - Bacillus circulans
LNQAVRFRPVITFALAFILIITWFAPRADAAAQWQAGTAYKQGDLVTYLNKDYECIQPHTALTG
WEPSNVPALWKYVGEGTGGGTPTPDTTPPTVPAGLTSSLVTDTSVNLTWASTDNVGVTGYEV
YRNGTLVANTSTTTAVVTGLTAGTTYVFTVKAKDAAGNLSAASTSLSVTTSTGSSNPGPSGSKW
LIGYWHNFDNGSTNIKLRNVSTAYDVINVSFAEPISPGSGTLAFTPYNATVEEFKSDIAYLQSQG
KKVLISMGGANGRIELTDATKKRQQFEDSLKSIISTYGFNGLDIDLEGSSLSLNAGDTDFRSPTTP
KIVNLINGVKALKSHFGANFVLTAAPETAYVQGGYLNYGGPWGAYLPVIHALRNDLTLLHVQH
YNTGSMVGLDGRSYAQGTADFHVAMAQMLLQGFNVGGSSGPFFSPLRPDQIAIGVPASQQ
AAGGGYTAPAELQKALNYLIKGVSYGGSYTLRQLRAMSVSRAL
•
>U03518 Aspergillus awamori internal transcribed spacer 1 (ITS1)
AACCTGCGGAAGGATCATTACCGAGTGCGGGTCCTTTGGGCCCAACCTCCCATCCGTGTCT
ATTGTACCCTGTTGCTTCGGCGGGCCCGCCGCTTGTCGGCCGCCGGGGGGGCGCCTCTGC
CCCCCGGGCCCGTGCCCGCCGGAGACCCCAACACGAACACTGTCTGAAAGCGTGCAGTCT
GAGTTGATTGAATGCAATCAGTTAAAACTTTCAACAATGGATCTCTTGGTTCCGGC
Subject Databases
•
•
•
•
•
BLAST uses a special database format to speed up the search operation.
Several pre-packaged databases exist, most notably “nr” which is the nonredundant database consisting of all sequences in GenBank. See the BLAST
program selection guide at http://www.ncbi.nlm.nih.gov/blast/producttable.shtml
.
We mostly make our own databases, which is done with the “formatdb” program.
Useful information can be found in the file
/home/share/blast.dir/README.formatdb.
Databases are located either in the same directory you are running BLAST from, or
in /home/share/blast.dir/data.
To use the /home/share/blast.dir/data directory without having to specify the
entire path every time, you need to create a file called “.ncbirc” in your home
directory that contains these two lines:
[NCBI]
Data=/home/share/blast.dir/data
Tabular BLAST Output
•
•
•
•
The –m 8 option gives a simple table as output. Often this information is all you
need.
The table is tab-separated, with each hit on a separate line.
You can easily parse these data by splitting each line at tabs: my @data = split /\t/;
Starting at 0, the columns are:
0: query name
1: subject name
2: percent identities
3: aligned length
4: number of mismatched positions
5: number of gap positions
6: query sequence start
7: query sequence end
8: subject sequence start
9: subject sequence end
10: e-value
11: bit score
More Tabular BLAST Output
•
•
•
•
•
•
•
The query and subject names come from the FASTA comment lines: everything up to the first space
character.
Aligned length can include gaps in both the query and the subject, so it is not necessarily the same length
as either of these sequences.
Percent identities looks at every position in the aligned sequences and counts the number that have the
same base or amino acid. This count is then divided by the aligned length and then multiplied by 100.
Mismatches are positions in the aligned sequences where both query and subject have a base or amino
acid. Gap positions are not included.
The number of gap positions is the number of positions in the aligned sequence with a base or amino acid
in one sequence and a hyphen (gap) in the other. (NOT the number of gaps regardless of length).
The query sequence start and end positions are always written with the start less than the end. However,
the subject sequences can have the start position greater than the end, if the alignment is on the opposite
DNA strand from the way the subject sequence was originally put into the database.
The e-value is the expected number of random sequences that would have at least this good a match in a
random database of the same size as the subject. Thus, an e-value of 1 means that 1 match would be
expected to occur by chance alone using this database.
–
–
•
•
It is worth noting that a probability of 0.05 is often considered significant in biological experiments. This value would
almost never be considered a significant hit in BLAST searches. Perhaps a rule-of-thumb worst significant score would
be 10-6 (1e-6).
Scores less than 1e-180 are given a value of 0.0.
The bit score is a normalized (corrected for the mismatch and gap penalties) version of the score derived
from the Smith-Waterman search. It is related to the e-value through an exponential function that also
takes into account the length of both the query and the subject database. That is, the e-value already
takes into account query and subject lengths, while the bit score doesn’t.
I prefer to deal with e-values.
BLAST Output
• If you don’t specify –m 8, you get a longer, detailed output file.
• The output starts with basic info, identifier for query (FASTA comment line
up to the first space) and the database you listed on the command line
with the –d option.
• Then a summary of each database sequence that had a significant
alignment(s). Two numbers appear: the bit score (an integer) and the evalue (a floating point number with a minimum of 0 and a maximum of
whatever you set –e to be, defaulting to 10.0).
• The summary lines add scores for all hits within each subject sequence.
This is useful and meaningful if the subject database contains discrete
sequences such as individual genes or peptides. However, if the subject
database is genomic chromosomal sequences, the summary lines can be
quite misleading.
• After the summary lines are detailed views of each hit.
More BLAST Output: Detailed View
•
•
At the start of the details for each subject sequence, there is a > in the first column, followed by the
subject’s FASTA comment line. The next line gives the subject sequence’s length.
Note that there can be multiple hits on each subject, and each will be listed separately following these
subject identifier and length lines.
•
Then, some summary information about this hit. For example:
Score = 1594 bits (804), Expect = 0.0
Identities = 804/804 (100%)
Strand = Plus / Plus
•
•
•
The Score is the raw bit score followed by the normalized bit score in parentheses. If you use bit scores to
compare sequences, you want to use the normalized bit score.
Expect is the e-value.
Identities is the number of matching positions between the aligned sequences, followed by the
percentage of identical bases.
–
–
–
•
Sometimes the Identities line will also have a “Gaps = “ section, which gives the number of gap positions.
For protein alignments, there is also a “Positives = “ section, which gives the number of amino acid matches that are
not identical but which count as similar in the scoring matrix (e.g. isolecine and valine).
The order on this line is Identities, Positives, Gaps. However, Positives and/or Gaps may be absent.
Strand is the orientation of the query sequence, then the subject sequence. Query is always “Plus”, but
the subject will be “Minus” if the alignment is on the opposite strand from the original sequence put into
the subject database. This line appears in nucleic acid searches but not in protein searches (which only
match
More BLAST Output: Detailed View
• After the summary lines, the aligned sequence appears, in
groups of 3 lines. The query sequence is on top, followed by a
line indicating matches, followed by the subject line.
• For blastn, the match line just consists of vertical lines where
the bases are identical and a blank space where they are not
identical.
• For blastp, the match line gives the amino acid code for
indentical amino acids, a “+” for similar (positives) amino
acids, and a blank space otherwise.
• After all hits have been displayed, the end of the output file
has some statistical information about the subject database
and parameters of the blast search.
Advanced Parsing of BLAST
output
Parsing BLAST Output
•
•
•
•
•
•
•
Once you have run BLAST, its output is in a file. You need to open this file, then read it lineby-line with a while statement.
Because you are reading line-by-line, most of the information you gather needs to be stored
in global variables: variables declared with “my” outside the while loop.
Each new line needs to be examined for the type of information it contains using regular
expressions. It is important to be able to identify the start and end points of various sections,
and respond appropriately.
Most information will be gathered using regular expression memory parentheses.
Variables need to be re-initialized at the end of each section (or the beginning of the next
one if the end is not easily discerned).
We will store the information in a hash whose keys are the subject identifiers. All hits for that
subject will be stored in an array.
Assuming that you want the information in the detailed analysis of each hit, note that there
are 3 important sections:
–
–
–
1. each new subject sequence starts with a “>” in the first column, followed by the rest of its FASTA
comment line. This can spill over multiple lines in the output. Immediately following this line(s) is
the “Length = “ line.
2. Within each subject sequence there are one or more hits. Each hit starts with a “Score =“ line,
followed by an “Identities = “ line, followed (for blastn but not blastp) with a “Strand = “ line.
3. Within each hit there are groups of 3 lines representing the aligned sequence, including the
coordinates of the start and end position of each line. There can be several such groups, for long
hits.
1. Parsing New Subject Sequence
Information
•Each new subject sequence
starts with a line having the
entire FASTA comment line on
it (i.e. a “>” in the first column),
followed by zero or more
additional comment lines, and
ends with a “Length = “ line.
•The useful pieces of
information here are the
comment line itself, the unique
identifier for this subject
sequence (part of the comment
line), and the length.
•Because these sections have
easily recognized start and end
points, you can deal with them
as a unit, inputting new lines
separately from the main while
loop.
my ($subject_ident, $comment_line, $subject_length);
my %hits; # to store all data from this file;
while (<INFILE>) {
# code for processing other sections
if (/^>/) { # finds the “>” in the first column of the line
$subject_length = 0; # reset to 0
$comment_line = $_;
while (1) { # endless loop to get all of comment line
my $next_line = <INFILE>;
if ($next_line =~ /Length = (\d+)/ ) {
$subject_length = $1;
last; # break out of loop
}
else { # continuation of comment line
chomp $next_line;
$next_line =~ s/^\s+/ /; # remove leading
$next_line =~ s/\s+$/ /; # and trailing multiple spaces
$comment_line .= $next_line;
} # end else
} # end while(1) loop
$comment_line =~ />(\S+)\s/;
$subject_ident = $1;
$hits{$subject_ident}{subject_length} = $subject_length;
$hits{$subject_ident}{comment_line} = $comment_line;
} # end subject sequence info section
# additional processing of BLAST output file
} # end while <INFILE> loop
2. Parsing Hit Information
•
Each hit starts with the same 3 lines (2 for blastp), Score, Identities, and Strand. If you detect the Score
line, you can get the next two lines without using the overall while loop:
my ($e_value, $identities, $gaps, $aligned_length, $orientation);
while (<INFILE>) {
if (/Score = /) {
my $score_line = $_;
my $identities_line = <INFILE>;
my $strand_line = <INFILE>;
# now process the information in these 3 lines
$score_line =~ /Score.*Expect = ([-.e\d]+)/; # note the regex here!
$e_value = $1;
$identities_line =~ /Identities = (\d+)\/(\d+)/; # note the escaped slash
($identities, $aligned_length) = ($1, $2);
$gaps = 0; # set a default
if ($identities_line =~ /Gaps = (\d+)/ ) { # Gaps are not always present
$gaps = $1;
}
$orientation = 'Plus';
if ($strand_line =~ /Plus \/ (Plus|Minus)/) { # only for blastn
$orientation = $1;
}
push @{ $hits{$subject_ident}{hit_data} }, [$e_value, $identities, $gaps, $aligned_length, $orientation];
} # end if /Score/
} # end while <INFILE>
3. Parsing the Aligned Sequences
•
•
•
•
This section is the hardest to deal with, because the end point is not clearly delimited.
We solve this problem by noting that there are 3 ways the aligned sequence section can end: it can be
followed by another hit (a Score = line), by a new subject (a line with > in the first column) or by the
statistical information at the end of the file (identified by “Lambda” in the first column).
So, near the beginning of the overall while loop we will insert a section that recognizes any of these
patterns, and writes the aligned sequence information to the array of information from the previous hit.
However, note that this should only be done if there is a previous hit, that is, not for the first hit in the first
subject sequence. We can detect this because the $subject_ident variable will be undef before the first
subject line is processed, and there will be no $hits{$subject_ident}{hit_data}.
my ($query_start, $query_end, $subject_start, $subject_end);
while (<INFILE>) {
if ($subject_ident && exists $hits{$subject_ident}{hit_data} && (/^>/ || /Score =/ || /^Lambda/ ) ) { # but
not for the first hit of the first subject
push @{ $hits{$subject_ident}[-1] }, $query_start, $query_end, $subject_start, $subject_end;
last if (/^Lambda/); # end of hit processing
$query_start = $query_end = $subject_start = $subject_end = 0; # reset
}
# process the file
} # end while <INFILE>
3. Parsing Aligned Sequences: 2
•
•
•
The aligned sequences come in groups of 3 lines, Query, the match line, and Subject. It is easiest to
process each group of 3 as a unit.
We will gather the start and end positions for both query and subject. It is of course possible to collect
more detailed information about the sequences if desired.
There will be one or more aligned sequence units. Since the start and end points are set to 0 at the
beginning of whatever section follows, the start positions are only modified if their variables are at 0.
while (<INFILE>) {
# previous code
if (/^Query/) {
my $query_line = $_;
my $match_line = <INFILE>;
my $subject_line = <INFILE>;
$query_line =~ /Query:\s*(\d+)\D+(\d+)\s*$/;
$query_start = $1 unless $query_start;
$query_end = $2;
$subject_line =~ /Sbjct:\s*(\d+)\D+(\d+)\s*$/; # note wierd spelling
$subject_start = $1 unless $subject_start;
$subject_end = $2;
} # end /^Query/
# code for other sections
} # end while <INFILE>
Printing Out the Information
•
•
•
After the end of the while <INFILE> loop.
Very simple format.
This whole program could be a subroutine, with a reference to the %hits hash
returned.
foreach my $subject_ident (sort keys %hits) {
print "$subject_ident:\n";
print " $hits{$subject_ident}{comment_line}\n";
print " Length = $hits{$subject_ident}{subject_length}\n";
foreach my $hit ( @{$hits{$subject_ident}{hit_data}} ) {
print "@$hit\n";
}
print "\n";
}