Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
part 4
Arrays:
•Stacks
•foreach command
Regular expressions:
•String structure analysis and substrings
extractions and substitutions
Command line arguments:
•@ARGV array
Functions/Subroutines:
•Repetitive use of functional blocks
Modules in Perl:
•How to use/share libraries of functions
Error messages:
•How to interrupt program on a mistake
•die statement
part 4
Arrays as a “FIRST-COME … LAST-SERVED” storage
@a = (7,-1,2,4,5);
5 numbers array
push
pop
5
5
4
2
Jar of 5 numbers
-1
7
# zero array
@a = ();
# store numbers
push @a, 7;
push @a, -1;
push @a, 2;
push @a, 4;
push @a, 5;
$lastNumber = pop @a;
print “last number stored in @a was $lastNumber\n”;
part 4
When push/pop commands are useful?
1
#!/usr/local/bin/perl
18
# storing file data
@fileLines = ();
open (INP, “ < data.txt”);
while ($line = <INP>) {
chomp($line);
push @fileLines, $line;
}
close(INP);
2
23
Finding potential regulatory elements in
noncoding regions of the human genome is
a challenging problem. Analyzing novel
sequences for the presence of known
transcription factor binding sites or their
weight matrices produces a huge number of
# calculating number of lines in the file
$nLines = $#fileLines + 1;
print “There are $nLines lines in data.txt file\n”;
# printing out data.txt file content
foreach $line (@fileLines) {
print “$line\n”;
}
@a = (1..6);
foreach $d (@a) {
print “$d “;
}
print “\n”;
1 2 3 4 5 6
part 4
Command line arguments
printFile.pl -- program, which prints out contents of a file
1
18
23
2
-123
numbers.txt
printFile.pl
words.txt
Finding potential regulatory elements in
noncoding regions of the human genome is
a challenging problem. Analyzing novel
sequences for the presence of known
transcription factor binding sites or their
numbers.txtweight matrices produces a huge number of
printFile.pl words.txt
@ARGV -- array of arguments following program name
@ARGV = (“numbers.txt”);
#!/usr/local/bin/perl
# determine file name
$fName = $ARGV[0];
# open, read and print out file
open (INP, “ < $fName”);
while ($line = <INP>) {
print $line;
}
close(INP);
part 4
Example. Print out N-th line of the file
words.txt
Finding potential regulatory elements in
noncoding regions of the human genome is
a challenging problem. Analyzing novel
sequences for the presence of known
transcription factor binding sites or their
weight matrices produces aprintFile.pl
huge number of
words.txt 3
a challenging problem. Analyzing novel
#!/usr/local/bin/perl
# determine file name, and line index
$fName = $ARGV[0];
$lineNo = $ARGV[1];
# open and read file
open (INP, “ < $fName”);
while ($line = <INP>) {
push @fileLines, $line;
}
close(INP);
# print out N-th line
print $fileLines[ $lineNo-1 ];
part 4
Error messages
How to stop correctly a program with an indication of
a run problem?
Example problem:
printFile.pl words.txt 3
Program should be executed with 2 arguments,
but user specifies only 1:
printFile.pl 3
Program should stop and report about an error
#!/usr/local/bin/perl
# check whether we’ve got 2 arguments or not
if ($#ARGV != 1) {
die “Error. Incorrect number of arguments\n”;
}
...
Print out a message and stop the program
Stop on incorrect indication of a line number:
...
if ($ARGV[1] <= 0) {
die “Error. Incorrect line number: $ARGV[1]\n”;
}
...
part 4
Defining novel functions and commands
Function is a “mini computer” inside a program,
it gets input data and produces output results
INPUT
2 Hello 3 4
7 Everybody
33 57
OUTPUT
FUNCTION
(filtering out numbers)
Defining min function, which returns
minimum of 2 numbers:
$x = min(5,3);
print “Smallest of 5 and 3 is: $x\n”;
# Function min
INPUT parameters
sub min {
($a, $b) = @_;
if ($a < $b) {
$small = $a;
} else {
$small = $b;
}
return $small;
}
Hello
Everybody
part 4
Regular expressions
$string1 = “Total: 576 genes, 2763 exons, some introns”;
How to extract 2 numbers?
$string2 = “human -G-ACT---TTGC------AA----A---A----”;
How to extract just DNA sequence?
Special symbols substituting groups of common type
characters (called patterns):
\s
\S
\d
\D
Match
Match
Match
Match
a
a
a
a
whitespace character
non-whitespace character
digit character
non-digit character
^
.
$
Match the beginning of the line
Match any character (except newline)
Match the end of the line
\t
\n
Tabulation symbol (HT, TAB)
Newline (LF, NL)
part 4
Grouping options:
*
+
[]
Match 0 or more times
Match 1 or more times
Character class
Patterns management:
$string = “Total: 576 genes, 2763 exons, some introns”;
$string =~ s/\d+/some/g;
--> “Total: some genes, some exons, some introns”;
$string =~ s/\s+/#/g;
--> “Total:#576#genes,#2763#exons,#some#introns”;
$string =~ s/\D+/\*/g;
--> “* 576 * 2763 * * *”;
part 4
Localizing substrings:
human
mouse
human
mouse
10
20
-G-ACT---TTGC------AA----A---A-----CG-----G-AT-------TGGG--| |||
|||
||
|
|
||
| ||
||||
TGAACTCAAGTGCTATTTTAATTCCATTCATTCTCCGTGGCTGCATCAGGGCCTGGGGCT
10
20
30
40
50
60
30
---------------C----GG------GA-------TG-AG--AGG------------|
||
||
|| || |||
CTACCTCCTGACAAACATTTGGTCTCTAGAAGGCTTCTGAAGTTAGGCAAGTCTGAAAAT
70
80
90
100
110
120
alignment.blast
How to extract only the lines starting with ‘mouse’ ?
while ($line = <INP>) {
if ($line =~ /^mouse/) {
print $line;
}
}
mouse
mouse
TGAACTCAAGTGCTATTTTAATTCCATTCATTCTCCGTGGCTGCATCAGGGCCTGGGGCT
CTACCTCCTGACAAACATTTGGTCTCTAGAAGGCTTCTGAAGTTAGGCAAGTCTGAAAAT
part 4
Obtaining substrings after localization:
human
mouse
human
mouse
10
20
-G-ACT---TTGC------AA----A---A-----CG-----G-AT-------TGGG--| |||
|||
||
|
|
||
| ||
||||
TGAACTCAAGTGCTATTTTAATTCCATTCATTCTCCGTGGCTGCATCAGGGCCTGGGGCT
10
20
30
40
50
60
30
---------------C----GG------GA-------TG-AG--AGG------------|
||
||
|| || |||
CTACCTCCTGACAAACATTTGGTCTCTAGAAGGCTTCTGAAGTTAGGCAAGTCTGAAAAT
70
80
90
100
110
120
alignment.blast
How to extract human and mouse sequences?
/...(xxx)...(xxx)../ -- substrings enclosed into parenthesizes
are available after a search in a format of variables $1, $2, ...
$humanSeq = “”;
$mouseSeq = “”;
while ($line = <INP>) {
if ($line =~ /^mouse (\S+)$/) {
$mouseSeq .= $1;
} elsif ($line =~ /^human (\S+)$/) {
$humanSeq .= $1;
}
}
print “Human sequence: $humanSeq\n”;
print “Mouse sequence: $mouseSeq\n”;
part 4
Modules:
Perl does not have functions for all the cases, but majority
of those functions are already programmed by other
people… And they share their libraries of functions, which
are called modules
use X; command indicates that functions from X
module should be used
Perl does not know how to create pictures,
use GD; -- now it knows
How to communicate with databases?
use DBI;
How to do DNA sequence analysis?
use BioPerl;
How to extract command line options?
use Getopt;
http://cpan.org/ -- storage of Perl modules