Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Introduction to Perl Perl> Giorgos Georgakilas • • • Graduated from C.E.I.D. M.Sc. degree in ITMB Ph.D. student in DIANA-Lab [email protected] Regular Expressions Matching literals • Simple match my $string = “sequence1: atcgtagcgtacaggcatgctagctagtcgatc”; if ($string =~ m/atcg/){ #use !~ for checking if it does not match print "match\n"; } • Loop for multiple matching my $string = “sequence1: atcgtagcgtacaggcatgctagctagtcgatc”; my @motifs = qw(att cgg tgg aaa); for my $trimer (@motifs) { if ($string =~ m/$trimer/) { print "$trimer is found\n"; } } Regular Expressions Wildcards • In most cases we don’t know exactly what to look for Symbol Meaning . Any character except newline, including spaces \d Any digit \D Any non digit \w Any alphanumeric or underscore \W Any non-alphanumeric or underscore \s Any whitespace [atgc] Any character inside the square brackets [^atgc] Any character NOT inside the square brackets • A simple example of wildcard use my $seq = "atacgatmcagct"; if ($seq =~ /[^atcg]/) { print "\$seq contains non atcg characters\n"; } Regular Expressions Quantifiers & Anchors • Special characters that reflect quantity in string matching Symbol Matches Example ? 0 or 1 times** tc?t matches tct and tt + 1 or more times tc+t matches tccccct * 0, 1 or more times tc*t matches tct, tcct, tt {3} exactly 3 times tc{3} matches tccct only {3,} 3 or more times tc{3,}t matches tccct, tcccct, etc {,3} up to 3 times tc{,3}t matches tct, tcct, tccct only {2,3} between 2 and 3 times tc{2,3}t matches tcct, tccct only • A simple example of anchors and quantifiers my $string = “sequence1: atcgtagcgtacaggcatgctagctagtcgatc”; if ($string =~ /^\w+:\s[atcg]+$/){ print "sequence OK\n"; } Regular Expressions Pattern Modifiers • Special characters that modify the regex interpretation Modifier Meaning i Makes match case-insensitive x Ignores literal space & permits comments s Allows . to match a newline m Lets ^ and $ match to a \n within a multiline string g Allows multiple matches e Allows code execution in regex • An example of i modifier "aTGCGAGct" =~ /atgc/i; #will match Regular Expressions Pattern Modifiers • An example of x modifier $string =~ /sequence(\d+) # match the id :\s* # colon and variable whitespace (\w+) # the sequence /x; • An example of s modifier my $string = "actg\ntcag"; $string =~ /ac.+ag/s; • An example of m modifier my $string = "actg\ntcat"; $string =~ /g$/; #doesn't match $string =~ /g$/m; #matches • An example of g modifier my $a = 'atgctagtctagcgatgcatgtgttgtgcgtatgtga'; my @matches = $a =~/t\wg/g; • An example of e modifier my $pattern = “atgc”; $string =~ /reverse($pattern)/e; Regular Expressions Capturing text • Backreferences $string=~/(\d+)\s(\w+)\s(\d{2,3})\s(\w*)/; $1, $2, $3, $4, $+, $& • Match indices @- match start indices; $-[0], $-[1] … (index of first matching character) @+ match end indices; $+[0], $+[0] … (index of last matching character +1) • Leftovers (:-O) $’ holds the part of the string after the regex match $` holds the part of the string before the regex match • Attention! Do not overuse them or the performance will slow down Read-only that persist until the next regex match Regular Expressions Substitutions / s • s/PATTERN/REPLACEMENT/ my $dna = “atgctagtctagcgatgcatgtgttgtgcgtatgtga”; • replace 1st 'a' with 't' $dna =~ s/a/t/; print “$dna\n”; • replace all 'a's with 't's - a global substitution $dna =~ s/a/t/g; • replace with e modifier my $reversed; ($reversed = $dna) =~ s/(\w+)/reverse($1)/eg; print "\$dna is $dna, \$reversed is $reversed\n"; my $reversed = $dna =~ s/(\w+)/reverse($1)/eg; print "\$dna is $dna, \$reversed is $reversed\n"; $dna remains intact and the substitution happens in $reverse The substitution happens in $dna and $reversed gets the status Regular Expressions Substitutions / tr • tr/CHARACTERS/REPLACEMENTS/ my $dna = “atgctagtctagcgatgcatgtgttgtgcgtatgtga”; • replace all ‘a’s with ‘t’ $dna =~ tr/a/t/; • replace all ‘a’s and ‘g’s with ‘t’s $dna =~ tr/ag/t/; • mind the number of replacement characters $dna =~ tr/a/tg/; #’a’s will be replaced by ‘t’ • no need to use the g modifier! Handling Files I/O • open(FILEHANDLE,MODE,FILENAME) error check; open(FILE,”>”,”/home/username/file_to_write.txt”) or die “$!\n”; “>” create/overwrite content “<“ read from (default operand if mode is not selected) “>>” create/add content • reading from file my $first_line=<FILE>; my $second_line=<FILE>; my @all_lines=<FILE>; while(my $line=<FILE>){ chomp $line; #chop!! … } Handling Files I/O • parsing the lines while(my $line=<FILE>){ chomp $line; #chop!! my @tempLine=split(/\t/,$line); #with split (!!) my $temp=~/\w+\t\d+/; #with plain regex } • writing to a file open(OUTFILE,”>out_file_name.txt”) or die “$!\n”; print OUTFILE “This will be printed in the file\n”; • appending to a file open(OUTFILE,”>>new_or_existing_file_name.txt”) or die “$!\n”; print OUTFILE “This will be printed in the end of the existing file\n”; Biology basics The looks Double Helix DNA The Chemistry The Dogma Uncovering the code DNA => Proteins • Scientists conjectured that proteins came from DNA; but how did DNA code for proteins? • If one nucleotide codes for one amino acid, then there’d be 41 amino acids • However, there are 20 amino acids, so at least 3 bases codes for one amino acid, since 42 = 16 and 43 = 64 • This triplet of bases is called a “codon” 64 different codons and only 20 amino acids means that the coding is degenerate: more than one codon sequence code for the same amino acid Central Dogma Revisited • In going from DNA to proteins, there is an intermediate step where mRNA is made from DNA, which then makes protein • Why the intermediate step? DNA is kept in the nucleus, while protein sythesis happens in the cytoplasm, with the help of ribosomes Genetic Code Revisiting the Code (Open?) Reading Frames Reading Frames • Since nucleotide sequences are “read” three bases at a time, there are three possible “frames” in which a given nucleotide sequence can be “read” (in the forward direction) • Taking the complement of the sequence and reading in the reverse direction gives three more reading frames Open Reading Frames • Concept: Region of DNA or RNA sequence that could be translated into a peptide sequence (open refers to absence of stop codons) • Prerequisite: A specific genetic code • Definition: (start codon) (amino acid coding codon)n (stop codon) • Note: Not all ORFs are actually used Exercise Regexps!! Objective 1 • Open the file named exercise_1_random_sequence.dat with Perl. • Find its reverse complement! Objective 2 • Open the file (yersinia_genome.fasta) with the complete Yersinia genome and find the possible start and end positions of its genes! Tips: • The beginning of each gene is mapped by the following pattern. There is an 8 letter consensus known as the Shine-Dalgarno sequence (TAAGGAGG) followed by 4-10 bases downstream before the initiation codon (ATG). However there are variants of the Shine-Dalgarno sequence with the most common of which being [TA][AC]AGGA[GA][GA]. • The end of the gene is specified by the stop codon TAA, TAG and TGA. It must be taken care the stop codon is found after the correct Open Reading Frame (ORF). • Don’t forget to check the reverse complement! Web Sources for Perl www.perl.com www.perldoc.com www.perl.org www.perlmonks.org