Download 009

Document related concepts

DNA barcoding wikipedia , lookup

Transcriptional regulation wikipedia , lookup

Exome sequencing wikipedia , lookup

Gene expression profiling wikipedia , lookup

Gel electrophoresis of nucleic acids wikipedia , lookup

DNA sequencing wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Promoter (genetics) wikipedia , lookup

Molecular cloning wikipedia , lookup

RNA-Seq wikipedia , lookup

Restriction enzyme wikipedia , lookup

Gene wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

Real-time polymerase chain reaction wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Non-coding DNA wikipedia , lookup

Genome evolution wikipedia , lookup

Molecular evolution wikipedia , lookup

Community fingerprinting wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
Bioinformatics
生物信息学理论和实践
唐继军
[email protected]
北京林业大学计算生物学中心
www.bjfuccb.edu
Hash
• Initialize: my %hash = ();
• Add key/value pair: $hash{$key}
• Add more keys:
= $value;
• %hash = ( 'key1', 'value1', 'key2', 'value2 );
• %hash = ( key1 => 'value1', key2 => 'value2', );
• Delete: delete
$hash{$key};
Print to file
• Open a file to print
• open FILE, ">filename.txt";
• open (FILE, ">filename.txt“);
• Print to the file
• print FILE $str;
#Append
open(FILE, ">>out") or die "Cannot open file to write";
print FILE "Test\n";
close FILE;
exit;
#!/usr/bin/perl
print "My name is $0 \n";
print "First arg is: $ARGV[0] \n";
print "Second arg is: $ARGV[1] \n";
print "Third arg is: $ARGV[2] \n";
$num = $#ARGV + 1; print "How many args? $num \n";
print "The full argument string was: @ARGV \n";
use BeginPerlBioinfo;
my %rebase_hash = ( ); my @file_data = (
$recognition_site = '';
my $regexp = ''; my @locations = ( );
); my $query = ''; my $dna = ''; my
@file_data = get_file_data($ARGV[0]);
$dna = extract_sequence_from_fasta_data(@file_data);
%rebase_hash = parseREBASE($ARGV[1]);
do {
print "Search for what restriction site for (or quit)?: ";
$query = <STDIN>;
chomp $query;
if ($query =~ /^\s*$/ ) { exit;
}
if ( exists $rebase_hash{$query} ) {
($recognition_site, $regexp) = split ( " ", $rebase_hash{$query});
@locations = match_positions($regexp, $dna);
if (@locations) {
print "Searching for $query $recognition_site $regexp\n";
print "Restriction site for $query at :", join(" ", @locations), "\n";
} else {
print "A restriction enzyme $query is not in the DNA:\n";
}
}
} until ( $query =~ /quit/ );
exit;
Regular Expression
•
•
•
•
•
•
•
•
•
•
•
^ beginning of string
$ end of string
. any character except newline
* match 0 or more times
+ match 1 or more times
? match 0 or 1 times;
| alternative
( ) grouping; “storing”
[ ] set of characters
{ } repetition modifier
\ quote or special
\
[]
$mystring = "[2004/04/13] The date of this article.";
if($mystring =~ m/(\d)/) {
print "The first digit is $1.";
}
if($mystring =~ m/(\d+)/) {
print "The first number is $1.";
}
if($mystring =~ m/(\d+)\/(\d+)\/(\d+)/) {
print "The date is $1-$2-$3";
}
while($mystring =~ m/(\d+)/g) {
print "Found number $1.";
}
@myarray = ($mystring =~ m/(\d+)/g);
print join(",", @myarray);
$mystring = "[2004/04/13] The date of this article.";
if($mystring =~ m/(\d+)\/(\d+)\/(\d+)/) {
print "The date is $1-$2-$3";
}
$mystring = "[2004/04/13] The date of this article.";
if($mystring =~ m/(\d+)\/(\d+)\/(\d+)/) {
print "The date is $1-$2-$3";
}
Download and install programs
• Unzip or untar
• unzip
• If file.tar.gz, tar xvfz file.tar.gz
• Go to the directory and “./configure”
• Then “make”
Excercies
• Download clustalw
• Try to install it
System subroutine
system ("ls –ltr");
Excercies 2
•
•
•
•
Use pro.fasta
Find alignment for each triple of protein
Let’s design the program together
Use “system” in perl
• system ("command parameters");
sub ReadFasta {
my ($fname) = @_;
open(FILE, $fname) or die "Cannot open $fname\n";
my $data = "";
my @dnas = ();
while(my $line = <FILE>) {
if ($line =~ /^>/) {
if ($data ne "") {
push(@dnas, $data);
}
$data = "";
}
$data .= $line;
}
if ($data ne "") {
push(@dnas, $data);
}
close FILE;
return @dnas;
}
print "Please input file name:\n";
my $fname = <STDIN>;
my @dnas = ReadFasta($fname);
my $len = $#dnas + 1;
for (my $i = 0; $i < $len; $i++) {
for (my $j = $i+1; $j < $len; $j++) {
for (my $k = $j+1; $k < $len; $k++) {
$fname = "$i\_$j\_$k";
print $fname;
open(OUT, ">$fname");
print OUT $dnas[$i];
print OUT $dnas[$j];
print OUT $dnas[$k];
close OUT;
system ("./clustalw2 $i\_$j\_$k");
}
}
}
Working with Single DNA Sequences
Learning Objectives
• Discover how to manipulate your DNA sequence
on a computer, analyze its composition, predict
its restriction map, and amplify it with PCR
• Find out about gene-prediction methods, their
potential, and their limitations
• Understand how genomes and sequences and
assembled
Outline
1. Cleaning your DNA of contaminants
2. Digesting your DNA in the computer
3. Finding protein-coding genes in your DNA
sequence
4. Assembling a genome
Cleaning DNA Sequences
• In order to sequence genomes, DNA sequences are often
cloned in a vector (plasmid, YAC, or cosmide)
• Sequences of the vector can be mixed with your DNA sequence
• Before working with your DNA sequence, you should always
clean it with VecScreen
VecScreen
• http://www.ncbi.nlm.nih.gov/VecScreen
/VecScreen.html
• Runs a special version of Blast
• A system for quickly identifying
segments of a nucleic acid sequence that
may be of vector origin
What to do if hits found
• If hits are in the extremity, can just
remove them
• If in the middle, or vectors are not what
you are using, the safest thing is to
throw the sequence away
Computing a Restriction Map
• It is possible to cut DNA sequences using restriction enzymes
• Each type of restriction enzyme recognizes and cuts a different
sequence:
• EcoR1: GAATTC
• BamH1: GGATCC
• There are more than 900 different restriction enzymes, each with a
different specificity
• The restriction map is the list of all potential cleavage sites in a DNA
molecule
• You can compile a restriction map with www.firstmarket.com/cutter
Cannot get it work!
http://biotools.umassmed.edu/tacg4
Making PCR with a Computer
• Polymerase Chain Reaction (PCR) is a method for amplifying DNA
• PCR is used for many applications, including
• Gene cloning
• Forensic analysis
• Paternity tests
• PCR amplifies the DNA between two anchors
• These anchors are called the PCR primer
Designing PCR Primers
• PCR primes are typically 20 nucleotides long
• The primers must hybridize well with the DNA
• On biotools.umassmed.edu, find the best location for
the primers:
• Most stable
• Longest extension
Analyzing DNA Composition
• DNA composition varies a lot
• Stability of a DNA sequence depends on its G+C
content (total guanine and cytosine)
• High G+C makes very stable DNA molecules
• Online resources are available to measure the
GC content of your DNA sequence
• Also for counting words and internal repeats
http://helixweb.nih.gov/emboss/html/
Counting words
•
•
•
•
ATGGCTGACT
A, T, G, G, C, T, G, A, C, T
AT, TG, GG, GC, CT, TG, GA, AC, CT
ATG, TGG, GGC, GCT, CTG, TGA, GAC, ACT
www.genomatix.de/cgi-bin/tools/tools.pl
EMBOSS servers
• European Molecular Biology Open Software
Suite
• http://pro.genomics.purdue.edu/emboss/
ORF
• EMBOSS
• NCBI
ncbi.nlm.nih.gov/gorf/gorf.html
Internal repeats
• A word repeated in the sequence, long
enough to not occur by chance
• Can be imperfect (regular expression)
• Dot plot is the best way to spot it
arbl.cvmbs.colostate.edu/molkit
Predicting Genes
• The most important analysis carried out on DNA
sequences is gene prediction
• Gene prediction requires different methods for
eukaryotes and prokaryotes
• Most gene-prediction methods use hidden
Markov Models
Predicting Genes in Prokaryotic Genome
• In prokaryotes, protein-coding genes are
uninterrupted
• No introns
• Predicting protein-coding genes in prokaryotes
is considered a solved problem
• You can expect 99% accuracy
Finding Prokaryotic Genes
with GeneMark
• GeneMark is the state of the
art for microbial genomes
• GeneMark can
• Find short proteins
• Resolve overlapping genes
• Identify the best start codon
• Use
exon.gatech.edu/GeneMark
• Click the “heutistic models”
Predicting Eukaryotic Genes
• Eukaryotic genes (human, for example) are very hard to predict
• Precise and accurate eukaryotic gene prediction is still an open
problem
• ENSEMBL contains 21,662 genes for the human genome
• There may well be more genes than that in the genome, as yet
unpredicted
• You can expect 70% accuracy on the human genome with
automatic methods
Finding Eukaryotic Genes
with GenomeScan
• GenomeScan is the state of
the art for eukaryotic genes
• GenomeScan works best
with
• Long exons
• Genes with a low GC content
• It can incorporate
experimental information
• Use
genes.mit.edu/genomescan
Producing Genomic Data
• Until recently, sequencing an entire genome was very
expensive and difficult
• Only major institutes could do it
• Today, scientists estimate that in 10 years, it will cost
about $1000 to sequence a human genome
• With sequencing so cheap, assembling your own
genomes is becoming an option
• How could you do it?
Sequencing and Assembling
a Genome (I)
• To sequence a genome, the first task is to cut
it into many small, overlapping pieces
• Then clone each piece
Sequencing and Assembling
a Genome (II)
• Each piece must be sequenced
• Sequencing machines cannot do an entire sequence at
once
• They can only produce short sequences smaller than 1 Kb
• These pieces are called reads
• It is necessary to assemble the reads into contigs
Sequencing and Assembling
a Genome (III)
• The most popular program for assembling reads is
PHRAP
• Available at www.phrap.org
• Other programs exist for joining smaller datasets
• For example, try CAP3 at pbil.univ-lyon1.fr/cap3.php