Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Bioinformatics 生物信息学理论和实践 唐继军 [email protected] 13928761660 Exercise 1 • • • • Ask for a protein file in fasta format Ask for an amino acid Count the frequency of that amino acid TKFHSNAHFYDCWRMLQYQLDMRCMRAISTF SPHCGMEHMPDQTHNQGEMCKPRMWQVS MNQSCNHTPPFRKTYVEWDYMAKALIAPYTL GWLASTCFIW Exercise 2 • • • • • Ask for an RNA file in fasta format Convert it to RNA Ask for a codon Count the frequency of that codon TCGTACTTAGAAATGAGGGTCCGCTTTTGCCC ACGCACCTGATCGCTCCTCGTTTGCTTTTAAG AACCGGACGAACCACAGAGCATAAGGAGAA CCTCTAGCTGCTTTACAAAGTACTGGTTCCCT TTCCAGCGGGATGCTTTATCTAAACGCAATGA Subroutine • Some code needs to be reused • A good way to organize code • Called "function" in some languages • Name • Return • Parameters (@_) sub codon2aa { my($codon) = @_; if ( $codon =~ /GC./i) { return 'A' } Alanine elsif ( $codon =~ /TG[TC]/i) { return 'C' } Cysteine elsif ( $codon =~ /GA[TC]/i) { return 'D' } Aspartic Acid elsif ( $codon =~ /GA[AG]/i) { return 'E' } Glutamic Acid elsif ( $codon =~ /TT[TC]/i) { return 'F' } Phenylalanine elsif ( $codon =~ /GG./i) { return 'G' } Glycine elsif ( $codon =~ /CA[TC]/i) { return 'H' } Histidine elsif ( $codon =~ /AT[TCA]/i) { return 'I' } Isoleucine elsif ( $codon =~ /AA[AG]/i) { return 'K' } Lysine elsif ( $codon =~ /TT[AG]|CT./i) { return 'L' } Leucine elsif ( $codon =~ /ATG/i) { return 'M' } Methionine elsif ( $codon =~ /AA[TC]/i) { return 'N' } Asparagine elsif ( $codon =~ /CC./i) { return 'P' } Proline elsif ( $codon =~ /CA[AG]/i) { return 'Q' } Glutamine elsif ( $codon =~ /CG.|AG[AG]/i) { return 'R' } Arginine elsif ( $codon =~ /TC.|AG[TC]/i) { return 'S' } Serine elsif ( $codon =~ /AC./i) { return 'T' } Threonine elsif ( $codon =~ /GT./i) { return 'V' } Valine elsif ( $codon =~ /TGG/i) { return 'W' } Tryptophan elsif ( $codon =~ /TA[TC]/i) { return 'Y' } Tyrosine elsif ( $codon =~ /TA[AG]|TGA/i) { return '_' } Stop else {print STDERR "Bad codon \"$codon\"!!\n"; exit; } } !/usr/bin/perl –w print "Please type the filename: "; $dna_filename = <STDIN>; chomp $dna_filename; open(DNAFILE, $dna_filename); $name = <DNAFILE>;@DNA = <DNAFILE>;close DNAFILE; $DNA = join( '', @DNA);$DNA =~ s/\s//g; print "First print "Second print "Third ", dna2peptide($DNA), "\n"; ", dna2peptide(substr($DNA, 1)), "\n"; ", dna2peptide(substr($DNA, 2)), "\n"; $DNA = reverse print "Fourth print "Fifth print "Sixth $DNA; ", dna2peptide($DNA), "\n"; ", dna2peptide(substr($DNA, 1)), "\n"; ", dna2peptide(substr($DNA, 2)), "\n"; sub dna2peptide { my ($dna) = @_; my $protein = ""; for(my $i=0; $i < (length($dna) - 2) ; $i += 3) { $codon = substr($dna,$i,3); $protein .= codon2aa($codon); } return $protein; } sub codon2aa { ... } Modules • A Perl Module is a self-contained pieceof [Perl] code that can be used by a Perl program later • Like a library • End with extension .pm • Needs a 1 at the end Bio.pm sub codon2aa { .... .... } sub dna2peptide { .... .... } 1 !/usr/bin/perl -w use Bio; print "Please type the filename: "; $dna_filename = <STDIN>; chomp $dna_filename; open(DNAFILE, $dna_filename); $name = <DNAFILE>;@DNA = <DNAFILE>;close DNAFILE; $DNA = join( '', @DNA);$DNA =~ s/\s//g; print "First print "Second print "Third ", dna2peptide($DNA), "\n"; ", dna2peptide(substr($DNA, 1)), "\n"; ", dna2peptide(substr($DNA, 2)), "\n"; $DNA = reverse $DNA; $DNA =~ tr/ACGTacgt/TGCAtgca/; print "Fourth print "Fifth print "Sixth ", dna2peptide($DNA), "\n"; ", dna2peptide(substr($DNA, 1)), "\n"; ", dna2peptide(substr($DNA, 2)), "\n"; Bio.pm sub codon2aa { .... .... } sub dna2peptide { .... .... } sub fasta_read { print "Please type the filename: "; my $dna_filename = <STDIN>; chomp $dna_filename; unless (open(DNAFILE, $dna_filename)) { print "Cannot open file ", $dna_filename, "\n"; } $name = <DNAFILE>;@DNA = <DNAFILE>;close DNAFILE; $DNA = join( '', @DNA);$DNA =~ s/\s//g; return $DNA; } 1 !/usr/bin/perl -w use Bio; $DNA = fasta_read(); print "First print "Second print "Third ", dna2peptide($DNA), "\n"; ", dna2peptide(substr($DNA, 1)), "\n"; ", dna2peptide(substr($DNA, 2)), "\n"; $DNA = reverse $DNA; $DNA =~ tr/ACGTacgt/TGCAtgca/; print "Fourth print "Fifth print "Sixth ", dna2peptide($DNA), "\n"; ", dna2peptide(substr($DNA, 1)), "\n"; ", dna2peptide(substr($DNA, 2)), "\n"; Scope • my provides lexical scoping; a variable declared with my is visible only within the block in which it is declared. • Blocks of code are hunks within curly braces {}; files are blocks. • Use use vars qw([list of var names]) or our ([var_names]) to create package globals. !/usr/bin/perl -w use Bio; use strict; use warnings; $DNA = fasta_read(); print "First print "Second print "Third ", dna2peptide($DNA), "\n"; ", dna2peptide(substr($DNA, 1)), "\n"; ", dna2peptide(substr($DNA, 2)), "\n"; $DNA = reverse $DNA; $DNA =~ tr/ACGTacgt/TGCAtgca/; print "Fourth print "Fifth print "Sixth ", dna2peptide($DNA), "\n"; ", dna2peptide(substr($DNA, 1)), "\n"; ", dna2peptide(substr($DNA, 2)), "\n"; Variable "$DNA" is not imported at frame2.pl line 6. Variable "$DNA" is not imported at frame2.pl line 8. Variable "$DNA" is not imported at frame2.pl line 9. Variable "$DNA" is not imported at frame2.pl line 10. Variable "$DNA" is not imported at frame2.pl line 12. Variable "$DNA" is not imported at frame2.pl line 12. Variable "$DNA" is not imported at frame2.pl line 13. Variable "$DNA" is not imported at frame2.pl line 14. Variable "$DNA" is not imported at frame2.pl line 15. Global symbol "$DNA" requires explicit package name at frame2.pl Global symbol "$DNA" requires explicit package name at frame2.pl Global symbol "$DNA" requires explicit package name at frame2.pl Global symbol "$DNA" requires explicit package name at frame2.pl Global symbol "$DNA" requires explicit package name at frame2.pl Global symbol "$DNA" requires explicit package name at frame2.pl Global symbol "$DNA" requires explicit package name at frame2.pl Global symbol "$DNA" requires explicit package name at frame2.pl Global symbol "$DNA" requires explicit package name at frame2.pl Execution of frame2.pl aborted due to compilation errors. line line line line line line line line line 6. 8. 9. 10. 12. 12. 13. 14. 15. !/usr/bin/perl -w use Bio; use strict; use warnings; my $DNA = fasta_read(); print "First print "Second print "Third ", dna2peptide($DNA), "\n"; ", dna2peptide(substr($DNA, 1)), "\n"; ", dna2peptide(substr($DNA, 2)), "\n"; $DNA = reverse $DNA; $DNA =~ tr/ACGTacgt/TGCAtgca/; print "Fourth print "Fifth print "Sixth ", dna2peptide($DNA), "\n"; ", dna2peptide(substr($DNA, 1)), "\n"; ", dna2peptide(substr($DNA, 2)), "\n"; my $x = 10; for (my $x = 0; $x < 5; $x++) { Scope(); print $x, "\n"; } print $x, "\n"; sub Scope { my $x = 0; } sub get_file_data { my($filename) = @_; use strict; use warnings; # Initialize variables my @filedata = ( ); unless( open(GET_FILE_DATA, $filename) ) { print STDERR "Cannot open file \"$filename\"\n\n"; exit; } @filedata = <GET_FILE_DATA>; close GET_FILE_DATA; return @filedata; } sub extract_sequence_from_fasta_data { my(@fasta_file_data) = @_; my $sequence = ''; foreach my $line (@fasta_file_data) { if ($line =~ /^\s*$/) { next; } elsif($line =~ /^\s*#/) { next; } elsif($line =~ /^>/) { next; } else { $sequence .= $line; } } # remove non-sequence data (in this case, whitespace) from $sequence string $sequence =~ s/\s//g; return $sequence; } Molecular Scissors Molecular Cell Biology, 4th edition Discovering Restriction Enzymes • HindII - first restriction enzyme – was discovered accidentally in 1970 while studying how the bacterium Haemophilus influenzae takes up DNA from the virus • Recognizes and cuts DNA at sequences: • • GTGCAC GTTAAC Recognition Sites of Restriction Enzymes Molecular Cell Biology, 4th edition Uses of Restriction Enzymes • Recombinant DNA technology • Cloning • cDNA/genomic library construction • DNA mapping Restriction Enzyme Database • http://rebase.neb.com/rebase/rebase.html http://rebase.neb.com/rebase/rebase.files.html R Y M K S W B D H V N = = = = = = = = = = = G or A C or T A or C G or T G or C A or T not A (C or not C (A or not G (A or not T (A or A or C or G G or G or C or C or or T T) T) T) G) sub IUB_to_regexp { my($iub) = @_; my $regular_expression = ‘’; my %iub2character_class = ( A C G T R Y M K S W B D H V N => => => => => => => => => => => => => => => 'A', 'C', 'G', 'T', '[GA]', '[CT]', '[AC]', '[GT]', '[GC]', '[AT]', '[CGT]', '[AGT]', '[ACT]', '[ACG]', '[ACGT]', ); $iub =~ s/\^//g; for ( my $i = 0 ; $i < length($iub) ; ++$i ) { $regular_expression .= $iub2character_class{substr($iub, $i, 1)}; } return $regular_expression; } Hash • Initialize: my %hash = (); • Add key/value pair: $hash{$key} • Add more keys: = $value; • %hash = ( 'key1', 'value1', 'key2', 'value2 ); • %hash = ( key1 => 'value1', key2 => 'value2', ); • Delete: delete $hash{$key}; while ( my ($key, $value) = each(%hash) ) { print "$key => $value\n"; } for my $key ( keys %hash ) { my $value = $hash{$key}; print "$key => $value\n"; } sub codon2aa { my($codon) = @_; $codon = uc $codon; my %genetic_code = ( 'TCA' => 'TCC' => 'TCG' => 'TCT' => 'TTC' => 'TTT' => 'TTA' => 'TTG' => #Many more ); 'S', 'S', 'S', 'S', 'F', 'F', 'L', 'L', # # # # # # # # Serine Serine Serine Serine Phenylalanine Phenylalanine Leucine Leucine if(exists $genetic_code{$codon}) { return $genetic_code{$codon}; }else{ print STDERR "Bad codon \"$codon\"!!\n"; exit; } } sub parseREBASE { my($rebasefile) = @_; my @rebasefile = ( ); my %rebase_hash = ( ); my $name; my $site; my $regexp; open($rebase_filehandle, $rebasefile) or die "Cannot open file\n"; while(<$rebase_filehandle>) { # Discard header lines ( 1 .. /Rich Roberts/ ) and next; # Discard blank lines /^\s*$/ and next; # Split the two (or three if includes parenthesized name) fields my @fields = split( " ", $_); $name = shift @fields; $site = pop @fields; # Translate the recognition sites to regular expressions $regexp = IUB_to_regexp($site); # Store the data into the hash $rebase_hash{$name} = "$site $regexp"; } # Return the hash containing the reformatted REBASE data return %rebase_hash; } Range • ( 1 .. /Rich Roberts/ ) and next • from first line till some line containing Rich Roberts • If that is true, it will check the statement after "and" • If that is not true, it will not check the statement after "and" • open(…) or die • If can open, the statement is already true, no need to check the statement after "or" • If cannot open, the statement is false, need to check the statement after "or" to see if it can be true @fred = (1,2,3); @barney = @fred; @huh = 1; @fred = qw(one two); @barney = (4,5,@fred,6,7); @barney = (8,@barney); ($a,$b,$c) = (1,2,3); @fred = (@barney = (2,3,4)); @fred = @barney = (2,3,4); @fred = (1,2,3); $fred[3] = "hi"; $fred[6] = "ho"; # @fred is now (1,2,3,"hi",undef,undef,"ho") Array operators • push and pop (right-most element) • @mylist = (1,2,3); push(@mylist,4,5,6); • $oldvalue = pop(@mylist); • shift and unshift (left-most element) • @fred = (5,6,7); unshift(@fred,2,3,4); • $x = shift(@fred); • reverse: @a = (7,8,9); @b = reverse(@a); • sort: @a = (7,9,9); @b = sort(@a); sub match_positions { my($regexp, $sequence) = @_; use BeginPerlBioinfo; my @positions = ( ); while ( $sequence =~ /$regexp/ig ) { push ( @positions, pos($sequence) - length($&) + 1); } return @positions; } use BeginPerlBioinfo; my %rebase_hash = ( ); my @file_data = ( $recognition_site = ''; my $regexp = ''; my @locations = ( ); ); my $query = ''; my $dna = ''; my @file_data = get_file_data("sample.dna"); $dna = extract_sequence_from_fasta_data(@file_data); %rebase_hash = parseREBASE('bionet'); do { print "Search for what restriction site for (or quit)?: "; $query = <STDIN>; chomp $query; if ($query =~ /^\s*$/ ) { exit; } if ( exists $rebase_hash{$query} ) { ($recognition_site, $regexp) = split ( " ", $rebase_hash{$query}); @locations = match_positions($regexp, $dna); if (@locations) { print "Searching for $query $recognition_site $regexp\n"; print "Restriction site for $query at :", join(" ", @locations), "\n"; } else { print "A restriction enzyme $query is not in the DNA:\n"; } } } until ( $query =~ /quit/ ); exit;