Download Introduction to Perl - e

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Introduction to Perl
Perl>
Giorgos Georgakilas
•
•
•
Graduated from C.E.I.D.
M.Sc. degree in ITMB
Ph.D. student in DIANA-Lab
[email protected]
Regular Expressions
Matching
literals
• Simple match
my $string = “sequence1: atcgtagcgtacaggcatgctagctagtcgatc”;
if ($string =~ m/atcg/){ #use !~ for checking if it does not match
print "match\n";
}
• Loop for multiple matching
my $string = “sequence1: atcgtagcgtacaggcatgctagctagtcgatc”;
my @motifs = qw(att cgg tgg aaa);
for my $trimer (@motifs) {
if ($string =~ m/$trimer/) {
print "$trimer is found\n";
}
}
Regular Expressions
Wildcards
• In most cases we don’t know exactly what to look for
Symbol
Meaning
.
Any character except newline, including spaces
\d
Any digit
\D
Any non digit
\w
Any alphanumeric or underscore
\W
Any non-alphanumeric or underscore
\s
Any whitespace
[atgc]
Any character inside the square brackets
[^atgc]
Any character NOT inside the square brackets
• A simple example of wildcard use
my $seq = "atacgatmcagct";
if ($seq =~ /[^atcg]/) {
print "\$seq contains non atcg characters\n";
}
Regular Expressions
Quantifiers
& Anchors
• Special characters that reflect quantity in string matching
Symbol
Matches
Example
?
0 or 1 times**
tc?t matches tct and tt
+
1 or more times
tc+t matches tccccct
*
0, 1 or more times
tc*t matches tct, tcct, tt
{3}
exactly 3 times
tc{3} matches tccct only
{3,}
3 or more times
tc{3,}t matches tccct, tcccct, etc
{,3}
up to 3 times
tc{,3}t matches tct, tcct, tccct only
{2,3}
between 2 and 3 times
tc{2,3}t matches tcct, tccct only
• A simple example of anchors and quantifiers
my $string = “sequence1: atcgtagcgtacaggcatgctagctagtcgatc”;
if ($string =~ /^\w+:\s[atcg]+$/){
print "sequence OK\n";
}
Regular Expressions
Pattern
Modifiers
• Special characters that modify the regex interpretation
Modifier
Meaning
i
Makes match case-insensitive
x
Ignores literal space & permits
comments
s
Allows . to match a newline
m
Lets ^ and $ match to a \n
within a multiline string
g
Allows multiple matches
e
Allows code execution in regex
• An example of i modifier
"aTGCGAGct" =~ /atgc/i; #will match
Regular Expressions
Pattern
Modifiers
• An example of x modifier
$string =~ /sequence(\d+)
# match the id
:\s*
# colon and variable whitespace
(\w+)
# the sequence
/x;
• An example of s modifier
my $string = "actg\ntcag";
$string =~ /ac.+ag/s;
• An example of m modifier
my $string = "actg\ntcat";
$string =~ /g$/; #doesn't match
$string =~ /g$/m; #matches
• An example of g modifier
my $a = 'atgctagtctagcgatgcatgtgttgtgcgtatgtga';
my @matches = $a =~/t\wg/g;
• An example of e modifier
my $pattern = “atgc”;
$string =~ /reverse($pattern)/e;
Regular Expressions
Capturing
text
• Backreferences
$string=~/(\d+)\s(\w+)\s(\d{2,3})\s(\w*)/;
$1, $2, $3, $4, $+, $&
• Match indices
@- match start indices; $-[0], $-[1] … (index of first matching character)
@+ match end indices; $+[0], $+[0] … (index of last matching character +1)
• Leftovers (:-O)
$’ holds the part of the string after the regex match
$` holds the part of the string before the regex match
• Attention!
Do not overuse them or the performance will slow down
Read-only that persist until the next regex match
Regular Expressions
Substitutions / s
• s/PATTERN/REPLACEMENT/
my $dna = “atgctagtctagcgatgcatgtgttgtgcgtatgtga”;
• replace 1st 'a' with 't'
$dna =~ s/a/t/;
print “$dna\n”;
• replace all 'a's with 't's - a global substitution
$dna =~ s/a/t/g;
• replace with e modifier
my $reversed;
($reversed = $dna) =~ s/(\w+)/reverse($1)/eg;
print "\$dna is $dna, \$reversed is $reversed\n";
my $reversed = $dna =~ s/(\w+)/reverse($1)/eg;
print "\$dna is $dna, \$reversed is $reversed\n";
$dna remains intact
and the substitution
happens in $reverse
The substitution happens
in $dna and $reversed
gets the status
Regular Expressions
Substitutions / tr
• tr/CHARACTERS/REPLACEMENTS/
my $dna = “atgctagtctagcgatgcatgtgttgtgcgtatgtga”;
• replace all ‘a’s with ‘t’
$dna =~ tr/a/t/;
• replace all ‘a’s and ‘g’s with ‘t’s
$dna =~ tr/ag/t/;
• mind the number of replacement characters
$dna =~ tr/a/tg/; #’a’s will be replaced by ‘t’
• no need to use the g modifier!
Handling Files
I/O
• open(FILEHANDLE,MODE,FILENAME) error check;
open(FILE,”>”,”/home/username/file_to_write.txt”) or die “$!\n”;
“>” create/overwrite content
“<“ read from (default operand if mode is not selected)
“>>” create/add content
• reading from file
my $first_line=<FILE>;
my $second_line=<FILE>;
my @all_lines=<FILE>;
while(my $line=<FILE>){
chomp $line; #chop!!
…
}
Handling Files
I/O
• parsing the lines
while(my $line=<FILE>){
chomp $line; #chop!!
my @tempLine=split(/\t/,$line); #with split (!!)
my $temp=~/\w+\t\d+/;
#with plain regex
}
• writing to a file
open(OUTFILE,”>out_file_name.txt”) or die “$!\n”;
print OUTFILE “This will be printed in the file\n”;
• appending to a file
open(OUTFILE,”>>new_or_existing_file_name.txt”) or die “$!\n”;
print OUTFILE “This will be printed in the end of the existing file\n”;
Biology basics
The looks Double Helix
DNA
The
Chemistry
The
Dogma
Uncovering the code
DNA => Proteins
• Scientists conjectured that proteins came from DNA; but how did DNA
code for proteins?
• If one nucleotide codes for one amino acid, then there’d be 41 amino
acids
• However, there are 20 amino acids, so at least 3 bases codes for one
amino acid, since 42 = 16 and 43 = 64
• This triplet of bases is called a “codon”
64 different codons and only 20 amino acids means that the coding
is degenerate: more than one codon sequence code for the same
amino acid
Central Dogma Revisited
•
In going from DNA to proteins, there is an
intermediate step where mRNA is made from
DNA, which then makes protein
•
Why the intermediate step?
DNA is kept in the nucleus, while protein
sythesis happens in the cytoplasm, with the
help of ribosomes
Genetic Code
Revisiting the Code
(Open?) Reading Frames
Reading Frames
•
Since nucleotide sequences are “read” three bases at a time, there are three
possible “frames” in which a given nucleotide sequence can be “read” (in the
forward direction)
•
Taking the complement of the sequence and reading in the reverse direction gives
three more reading frames
Open Reading Frames
•
Concept: Region of DNA or RNA sequence that could be translated into a peptide
sequence (open refers to absence of stop codons)
•
Prerequisite: A specific genetic code
•
Definition:
(start codon) (amino acid coding codon)n (stop codon)
•
Note: Not all ORFs are actually used
Exercise
Regexps!!
Objective 1
• Open the file named exercise_1_random_sequence.dat with Perl.
• Find its reverse complement!
Objective 2
• Open the file (yersinia_genome.fasta) with the complete Yersinia
genome and find the possible start and end positions of its genes!
Tips:
• The beginning of each gene is mapped by the following pattern. There
is an 8 letter consensus known as the Shine-Dalgarno sequence
(TAAGGAGG) followed by 4-10 bases downstream before the initiation
codon (ATG). However there are variants of the Shine-Dalgarno
sequence with the most common of which being
[TA][AC]AGGA[GA][GA].
• The end of the gene is specified by the stop codon TAA, TAG and TGA.
It must be taken care the stop codon is found after the correct Open
Reading Frame (ORF).
• Don’t forget to check the reverse complement!
Web Sources for Perl

www.perl.com

www.perldoc.com

www.perl.org

www.perlmonks.org
Related documents