Download Three main topics for this Intro lecture:

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Bioinformatics
生物信息学理论和实践
唐继军
[email protected]
13928761660
Exercise 1
•
•
•
•
Ask for a protein file in fasta format
Ask for an amino acid
Count the frequency of that amino acid
TKFHSNAHFYDCWRMLQYQLDMRCMRAISTF
SPHCGMEHMPDQTHNQGEMCKPRMWQVS
MNQSCNHTPPFRKTYVEWDYMAKALIAPYTL
GWLASTCFIW
Exercise 2
•
•
•
•
•
Ask for an RNA file in fasta format
Convert it to RNA
Ask for a codon
Count the frequency of that codon
TCGTACTTAGAAATGAGGGTCCGCTTTTGCCC
ACGCACCTGATCGCTCCTCGTTTGCTTTTAAG
AACCGGACGAACCACAGAGCATAAGGAGAA
CCTCTAGCTGCTTTACAAAGTACTGGTTCCCT
TTCCAGCGGGATGCTTTATCTAAACGCAATGA
Subroutine
• Some code needs to be reused
• A good way to organize code
• Called "function" in some languages
• Name
• Return
• Parameters (@_)
sub codon2aa {
my($codon) = @_;
if ( $codon =~ /GC./i)
{ return 'A' }
Alanine
elsif ( $codon =~ /TG[TC]/i)
{ return 'C' }
Cysteine
elsif ( $codon =~ /GA[TC]/i)
{ return 'D' }
Aspartic Acid
elsif ( $codon =~ /GA[AG]/i)
{ return 'E' }
Glutamic Acid
elsif ( $codon =~ /TT[TC]/i)
{ return 'F' }
Phenylalanine
elsif ( $codon =~ /GG./i)
{ return 'G' }
Glycine
elsif ( $codon =~ /CA[TC]/i)
{ return 'H' }
Histidine
elsif ( $codon =~ /AT[TCA]/i)
{ return 'I' }
Isoleucine
elsif ( $codon =~ /AA[AG]/i)
{ return 'K' }
Lysine
elsif ( $codon =~ /TT[AG]|CT./i) { return 'L' }
Leucine
elsif ( $codon =~ /ATG/i)
{ return 'M' }
Methionine
elsif ( $codon =~ /AA[TC]/i)
{ return 'N' }
Asparagine
elsif ( $codon =~ /CC./i)
{ return 'P' }
Proline
elsif ( $codon =~ /CA[AG]/i)
{ return 'Q' }
Glutamine
elsif ( $codon =~ /CG.|AG[AG]/i) { return 'R' }
Arginine
elsif ( $codon =~ /TC.|AG[TC]/i) { return 'S' }
Serine
elsif ( $codon =~ /AC./i)
{ return 'T' }
Threonine
elsif ( $codon =~ /GT./i)
{ return 'V' }
Valine
elsif ( $codon =~ /TGG/i)
{ return 'W' }
Tryptophan
elsif ( $codon =~ /TA[TC]/i)
{ return 'Y' }
Tyrosine
elsif ( $codon =~ /TA[AG]|TGA/i) { return '_' }
Stop
else {print STDERR "Bad codon \"$codon\"!!\n"; exit; }
}
!/usr/bin/perl –w
print "Please type the filename: ";
$dna_filename = <STDIN>; chomp $dna_filename;
open(DNAFILE, $dna_filename);
$name = <DNAFILE>;@DNA = <DNAFILE>;close DNAFILE;
$DNA = join( '', @DNA);$DNA =~ s/\s//g;
print "First
print "Second
print "Third
", dna2peptide($DNA), "\n";
", dna2peptide(substr($DNA, 1)), "\n";
", dna2peptide(substr($DNA, 2)), "\n";
$DNA = reverse
print "Fourth
print "Fifth
print "Sixth
$DNA;
", dna2peptide($DNA), "\n";
", dna2peptide(substr($DNA, 1)), "\n";
", dna2peptide(substr($DNA, 2)), "\n";
sub dna2peptide {
my ($dna) = @_;
my $protein = "";
for(my $i=0; $i < (length($dna) - 2) ; $i += 3) {
$codon = substr($dna,$i,3);
$protein .= codon2aa($codon);
}
return $protein;
}
sub codon2aa {
...
}
Modules
• A Perl Module is a self-contained pieceof
[Perl] code that can be used by a Perl
program later
• Like a library
• End with extension .pm
• Needs a 1 at the end
Bio.pm
sub codon2aa {
....
....
}
sub dna2peptide {
....
....
}
1
!/usr/bin/perl -w
use Bio;
print "Please type the filename: ";
$dna_filename = <STDIN>; chomp $dna_filename;
open(DNAFILE, $dna_filename);
$name = <DNAFILE>;@DNA = <DNAFILE>;close DNAFILE;
$DNA = join( '', @DNA);$DNA =~ s/\s//g;
print "First
print "Second
print "Third
", dna2peptide($DNA), "\n";
", dna2peptide(substr($DNA, 1)), "\n";
", dna2peptide(substr($DNA, 2)), "\n";
$DNA = reverse $DNA;
$DNA =~ tr/ACGTacgt/TGCAtgca/;
print "Fourth
print "Fifth
print "Sixth
", dna2peptide($DNA), "\n";
", dna2peptide(substr($DNA, 1)), "\n";
", dna2peptide(substr($DNA, 2)), "\n";
Bio.pm
sub codon2aa {
....
....
}
sub dna2peptide {
....
....
}
sub fasta_read {
print "Please type the filename: ";
my $dna_filename = <STDIN>; chomp $dna_filename;
unless (open(DNAFILE, $dna_filename)) {
print "Cannot open file ", $dna_filename, "\n";
}
$name = <DNAFILE>;@DNA = <DNAFILE>;close DNAFILE;
$DNA = join( '', @DNA);$DNA =~ s/\s//g;
return $DNA;
}
1
!/usr/bin/perl -w
use Bio;
$DNA = fasta_read();
print "First
print "Second
print "Third
", dna2peptide($DNA), "\n";
", dna2peptide(substr($DNA, 1)), "\n";
", dna2peptide(substr($DNA, 2)), "\n";
$DNA = reverse $DNA;
$DNA =~ tr/ACGTacgt/TGCAtgca/;
print "Fourth
print "Fifth
print "Sixth
", dna2peptide($DNA), "\n";
", dna2peptide(substr($DNA, 1)), "\n";
", dna2peptide(substr($DNA, 2)), "\n";
Scope
• my provides lexical scoping; a variable
declared with my is visible only within the
block in which it is declared.
• Blocks of code are hunks within curly
braces {}; files are blocks.
• Use use vars qw([list of var names]) or our
([var_names]) to create package globals.
!/usr/bin/perl -w
use Bio;
use strict;
use warnings;
$DNA = fasta_read();
print "First
print "Second
print "Third
", dna2peptide($DNA), "\n";
", dna2peptide(substr($DNA, 1)), "\n";
", dna2peptide(substr($DNA, 2)), "\n";
$DNA = reverse $DNA;
$DNA =~ tr/ACGTacgt/TGCAtgca/;
print "Fourth
print "Fifth
print "Sixth
", dna2peptide($DNA), "\n";
", dna2peptide(substr($DNA, 1)), "\n";
", dna2peptide(substr($DNA, 2)), "\n";
Variable "$DNA" is not imported at frame2.pl line 6.
Variable "$DNA" is not imported at frame2.pl line 8.
Variable "$DNA" is not imported at frame2.pl line 9.
Variable "$DNA" is not imported at frame2.pl line 10.
Variable "$DNA" is not imported at frame2.pl line 12.
Variable "$DNA" is not imported at frame2.pl line 12.
Variable "$DNA" is not imported at frame2.pl line 13.
Variable "$DNA" is not imported at frame2.pl line 14.
Variable "$DNA" is not imported at frame2.pl line 15.
Global symbol "$DNA" requires explicit package name at frame2.pl
Global symbol "$DNA" requires explicit package name at frame2.pl
Global symbol "$DNA" requires explicit package name at frame2.pl
Global symbol "$DNA" requires explicit package name at frame2.pl
Global symbol "$DNA" requires explicit package name at frame2.pl
Global symbol "$DNA" requires explicit package name at frame2.pl
Global symbol "$DNA" requires explicit package name at frame2.pl
Global symbol "$DNA" requires explicit package name at frame2.pl
Global symbol "$DNA" requires explicit package name at frame2.pl
Execution of frame2.pl aborted due to compilation errors.
line
line
line
line
line
line
line
line
line
6.
8.
9.
10.
12.
12.
13.
14.
15.
!/usr/bin/perl -w
use Bio;
use strict;
use warnings;
my $DNA = fasta_read();
print "First
print "Second
print "Third
", dna2peptide($DNA), "\n";
", dna2peptide(substr($DNA, 1)), "\n";
", dna2peptide(substr($DNA, 2)), "\n";
$DNA = reverse $DNA;
$DNA =~ tr/ACGTacgt/TGCAtgca/;
print "Fourth
print "Fifth
print "Sixth
", dna2peptide($DNA), "\n";
", dna2peptide(substr($DNA, 1)), "\n";
", dna2peptide(substr($DNA, 2)), "\n";
my $x = 10;
for (my $x = 0; $x < 5; $x++) {
Scope();
print $x, "\n";
}
print $x, "\n";
sub Scope {
my $x = 0;
}
sub get_file_data {
my($filename) = @_;
use strict;
use warnings;
# Initialize variables
my @filedata = ( );
unless( open(GET_FILE_DATA, $filename) ) {
print STDERR "Cannot open file \"$filename\"\n\n";
exit;
}
@filedata = <GET_FILE_DATA>;
close GET_FILE_DATA;
return @filedata;
}
sub extract_sequence_from_fasta_data {
my(@fasta_file_data) = @_;
my $sequence = '';
foreach my $line (@fasta_file_data) {
if ($line =~ /^\s*$/) {
next;
} elsif($line =~ /^\s*#/) {
next;
} elsif($line =~ /^>/) {
next;
} else {
$sequence .= $line;
}
}
# remove non-sequence data (in this case, whitespace) from $sequence string
$sequence =~ s/\s//g;
return $sequence;
}
Molecular Scissors
Molecular Cell Biology, 4th edition
Discovering Restriction Enzymes
• HindII - first restriction enzyme – was
discovered accidentally in 1970 while
studying how the bacterium Haemophilus
influenzae takes up DNA from the virus
• Recognizes and cuts DNA at sequences:
•
•
GTGCAC
GTTAAC
Recognition Sites of Restriction Enzymes
Molecular Cell Biology, 4th edition
Uses of Restriction Enzymes
• Recombinant DNA technology
• Cloning
• cDNA/genomic library construction
• DNA mapping
Restriction Enzyme Database
• http://rebase.neb.com/rebase/rebase.html
http://rebase.neb.com/rebase/rebase.files.html
R
Y
M
K
S
W
B
D
H
V
N
=
=
=
=
=
=
=
=
=
=
=
G or A
C or T
A or C
G or T
G or C
A or T
not A (C or
not C (A or
not G (A or
not T (A or
A or C or G
G or
G or
C or
C or
or T
T)
T)
T)
G)
sub IUB_to_regexp {
my($iub) = @_;
my $regular_expression = ‘’;
my %iub2character_class = (
A
C
G
T
R
Y
M
K
S
W
B
D
H
V
N
=>
=>
=>
=>
=>
=>
=>
=>
=>
=>
=>
=>
=>
=>
=>
'A',
'C',
'G',
'T',
'[GA]',
'[CT]',
'[AC]',
'[GT]',
'[GC]',
'[AT]',
'[CGT]',
'[AGT]',
'[ACT]',
'[ACG]',
'[ACGT]',
);
$iub =~ s/\^//g;
for ( my $i = 0 ; $i < length($iub) ; ++$i ) {
$regular_expression .= $iub2character_class{substr($iub, $i, 1)};
}
return $regular_expression;
}
Hash
• Initialize: my %hash = ();
• Add key/value pair: $hash{$key}
• Add more keys:
= $value;
• %hash = ( 'key1', 'value1', 'key2', 'value2 );
• %hash = ( key1 => 'value1', key2 => 'value2', );
• Delete: delete
$hash{$key};
while ( my ($key, $value) = each(%hash) )
{
print "$key => $value\n";
}
for my $key ( keys %hash ) {
my $value = $hash{$key};
print "$key => $value\n";
}
sub codon2aa {
my($codon) = @_;
$codon = uc $codon;
my %genetic_code = (
'TCA' =>
'TCC' =>
'TCG' =>
'TCT' =>
'TTC' =>
'TTT' =>
'TTA' =>
'TTG' =>
#Many more
);
'S',
'S',
'S',
'S',
'F',
'F',
'L',
'L',
#
#
#
#
#
#
#
#
Serine
Serine
Serine
Serine
Phenylalanine
Phenylalanine
Leucine
Leucine
if(exists $genetic_code{$codon}) {
return $genetic_code{$codon};
}else{
print STDERR "Bad codon \"$codon\"!!\n";
exit;
}
}
sub parseREBASE {
my($rebasefile) = @_;
my @rebasefile = ( ); my %rebase_hash = (
); my $name; my $site;
my $regexp;
open($rebase_filehandle, $rebasefile) or die "Cannot open file\n";
while(<$rebase_filehandle>) {
# Discard header lines
( 1 .. /Rich Roberts/ ) and next;
# Discard blank lines
/^\s*$/ and next;
# Split the two (or three if includes parenthesized name) fields
my @fields = split( " ", $_);
$name = shift @fields;
$site = pop @fields;
# Translate the recognition sites to regular expressions
$regexp = IUB_to_regexp($site);
# Store the data into the hash
$rebase_hash{$name} = "$site $regexp";
}
# Return the hash containing the reformatted REBASE data
return %rebase_hash;
}
Range
• ( 1 .. /Rich Roberts/ ) and next
• from first line till some line containing Rich Roberts
• If that is true, it will check the statement after "and"
• If that is not true, it will not check the statement after
"and"
• open(…) or die
• If can open, the statement is already true, no need to
check the statement after "or"
• If cannot open, the statement is false, need to check
the statement after "or" to see if it can be true
@fred = (1,2,3);
@barney = @fred;
@huh = 1;
@fred = qw(one two);
@barney = (4,5,@fred,6,7);
@barney = (8,@barney);
($a,$b,$c) = (1,2,3);
@fred = (@barney = (2,3,4));
@fred = @barney = (2,3,4);
@fred = (1,2,3);
$fred[3] = "hi";
$fred[6] = "ho"; # @fred is now (1,2,3,"hi",undef,undef,"ho")
Array operators
• push and pop (right-most element)
• @mylist = (1,2,3); push(@mylist,4,5,6);
• $oldvalue = pop(@mylist);
• shift and unshift (left-most element)
• @fred = (5,6,7); unshift(@fred,2,3,4);
• $x = shift(@fred);
• reverse: @a = (7,8,9); @b = reverse(@a);
• sort: @a = (7,9,9); @b = sort(@a);
sub match_positions {
my($regexp, $sequence) = @_;
use BeginPerlBioinfo;
my @positions = (
);
while ( $sequence =~ /$regexp/ig ) {
push ( @positions, pos($sequence) - length($&) + 1);
}
return @positions;
}
use BeginPerlBioinfo;
my %rebase_hash = ( ); my @file_data = (
$recognition_site = '';
my $regexp = ''; my @locations = ( );
); my $query = ''; my $dna = ''; my
@file_data = get_file_data("sample.dna");
$dna = extract_sequence_from_fasta_data(@file_data);
%rebase_hash = parseREBASE('bionet');
do {
print "Search for what restriction site for (or quit)?: ";
$query = <STDIN>;
chomp $query;
if ($query =~ /^\s*$/ ) { exit;
}
if ( exists $rebase_hash{$query} ) {
($recognition_site, $regexp) = split ( " ", $rebase_hash{$query});
@locations = match_positions($regexp, $dna);
if (@locations) {
print "Searching for $query $recognition_site $regexp\n";
print "Restriction site for $query at :", join(" ", @locations), "\n";
} else {
print "A restriction enzyme $query is not in the DNA:\n";
}
}
} until ( $query =~ /quit/ );
exit;
Related documents