Download 009

Bioinformatics 生物信息学理论和实践唐继军 [email protected] 北京林业大学计算生物学中心 www.bjfuccb.edu Hash • Initialize: my %hash = (); • Add key/value pair: $hash{$key} • Add more keys: = $value; • %hash = ( 'key1', 'value1', 'key2', 'value2 ); • %hash = ( key1 => 'value1', key2 => 'value2', ); • Delete: delete $hash{$key}; Print to file • Open a file to print • open FILE, ">filename.txt"; • open (FILE, ">filename.txt“); • Print to the file • print FILE $str; #Append open(FILE, ">>out") or die "Cannot open file to write"; print FILE "Test\n"; close FILE; exit; #!/usr/bin/perl print "My name is $0 \n"; print "First arg is: $ARGV[0] \n"; print "Second arg is: $ARGV[1] \n"; print "Third arg is: $ARGV[2] \n"; $num = $#ARGV + 1; print "How many args? $num \n"; print "The full argument string was: @ARGV \n"; use BeginPerlBioinfo; my %rebase_hash = ( ); my @file_data = ( $recognition_site = ''; my $regexp = ''; my @locations = ( ); ); my $query = ''; my $dna = ''; my @file_data = get_file_data($ARGV[0]); $dna = extract_sequence_from_fasta_data(@file_data); %rebase_hash = parseREBASE($ARGV[1]); do { print "Search for what restriction site for (or quit)?: "; $query = <STDIN>; chomp $query; if ($query =~ /^\s*$/ ) { exit; } if ( exists $rebase_hash{$query} ) { ($recognition_site, $regexp) = split ( " ", $rebase_hash{$query}); @locations = match_positions($regexp, $dna); if (@locations) { print "Searching for $query $recognition_site $regexp\n"; print "Restriction site for $query at :", join(" ", @locations), "\n"; } else { print "A restriction enzyme $query is not in the DNA:\n"; } } } until ( $query =~ /quit/ ); exit; Regular Expression • • • • • • • • • • • ^ beginning of string $ end of string . any character except newline * match 0 or more times + match 1 or more times ? match 0 or 1 times; | alternative ( ) grouping; “storing” [ ] set of characters { } repetition modifier \ quote or special \ [] $mystring = "[2004/04/13] The date of this article."; if($mystring =~ m/(\d)/) { print "The first digit is $1."; } if($mystring =~ m/(\d+)/) { print "The first number is $1."; } if($mystring =~ m/(\d+)\/(\d+)\/(\d+)/) { print "The date is $1-$2-$3"; } while($mystring =~ m/(\d+)/g) { print "Found number $1."; } @myarray = ($mystring =~ m/(\d+)/g); print join(",", @myarray); $mystring = "[2004/04/13] The date of this article."; if($mystring =~ m/(\d+)\/(\d+)\/(\d+)/) { print "The date is $1-$2-$3"; } $mystring = "[2004/04/13] The date of this article."; if($mystring =~ m/(\d+)\/(\d+)\/(\d+)/) { print "The date is $1-$2-$3"; } Download and install programs • Unzip or untar • unzip • If file.tar.gz, tar xvfz file.tar.gz • Go to the directory and “./configure” • Then “make” Excercies • Download clustalw • Try to install it System subroutine system ("ls –ltr"); Excercies 2 • • • • Use pro.fasta Find alignment for each triple of protein Let’s design the program together Use “system” in perl • system ("command parameters"); sub ReadFasta { my ($fname) = @_; open(FILE, $fname) or die "Cannot open $fname\n"; my $data = ""; my @dnas = (); while(my $line = <FILE>) { if ($line =~ /^>/) { if ($data ne "") { push(@dnas, $data); } $data = ""; } $data .= $line; } if ($data ne "") { push(@dnas, $data); } close FILE; return @dnas; } print "Please input file name:\n"; my $fname = <STDIN>; my @dnas = ReadFasta($fname); my $len = $#dnas + 1; for (my $i = 0; $i < $len; $i++) { for (my $j = $i+1; $j < $len; $j++) { for (my $k = $j+1; $k < $len; $k++) { $fname = "$i\_$j\_$k"; print $fname; open(OUT, ">$fname"); print OUT $dnas[$i]; print OUT $dnas[$j]; print OUT $dnas[$k]; close OUT; system ("./clustalw2 $i\_$j\_$k"); } } } Working with Single DNA Sequences Learning Objectives • Discover how to manipulate your DNA sequence on a computer, analyze its composition, predict its restriction map, and amplify it with PCR • Find out about gene-prediction methods, their potential, and their limitations • Understand how genomes and sequences and assembled Outline 1. Cleaning your DNA of contaminants 2. Digesting your DNA in the computer 3. Finding protein-coding genes in your DNA sequence 4. Assembling a genome Cleaning DNA Sequences • In order to sequence genomes, DNA sequences are often cloned in a vector (plasmid, YAC, or cosmide) • Sequences of the vector can be mixed with your DNA sequence • Before working with your DNA sequence, you should always clean it with VecScreen VecScreen • http://www.ncbi.nlm.nih.gov/VecScreen /VecScreen.html • Runs a special version of Blast • A system for quickly identifying segments of a nucleic acid sequence that may be of vector origin What to do if hits found • If hits are in the extremity, can just remove them • If in the middle, or vectors are not what you are using, the safest thing is to throw the sequence away Computing a Restriction Map • It is possible to cut DNA sequences using restriction enzymes • Each type of restriction enzyme recognizes and cuts a different sequence: • EcoR1: GAATTC • BamH1: GGATCC • There are more than 900 different restriction enzymes, each with a different specificity • The restriction map is the list of all potential cleavage sites in a DNA molecule • You can compile a restriction map with www.firstmarket.com/cutter Cannot get it work! http://biotools.umassmed.edu/tacg4 Making PCR with a Computer • Polymerase Chain Reaction (PCR) is a method for amplifying DNA • PCR is used for many applications, including • Gene cloning • Forensic analysis • Paternity tests • PCR amplifies the DNA between two anchors • These anchors are called the PCR primer Designing PCR Primers • PCR primes are typically 20 nucleotides long • The primers must hybridize well with the DNA • On biotools.umassmed.edu, find the best location for the primers: • Most stable • Longest extension Analyzing DNA Composition • DNA composition varies a lot • Stability of a DNA sequence depends on its G+C content (total guanine and cytosine) • High G+C makes very stable DNA molecules • Online resources are available to measure the GC content of your DNA sequence • Also for counting words and internal repeats http://helixweb.nih.gov/emboss/html/ Counting words • • • • ATGGCTGACT A, T, G, G, C, T, G, A, C, T AT, TG, GG, GC, CT, TG, GA, AC, CT ATG, TGG, GGC, GCT, CTG, TGA, GAC, ACT www.genomatix.de/cgi-bin/tools/tools.pl EMBOSS servers • European Molecular Biology Open Software Suite • http://pro.genomics.purdue.edu/emboss/ ORF • EMBOSS • NCBI ncbi.nlm.nih.gov/gorf/gorf.html Internal repeats • A word repeated in the sequence, long enough to not occur by chance • Can be imperfect (regular expression) • Dot plot is the best way to spot it arbl.cvmbs.colostate.edu/molkit Predicting Genes • The most important analysis carried out on DNA sequences is gene prediction • Gene prediction requires different methods for eukaryotes and prokaryotes • Most gene-prediction methods use hidden Markov Models Predicting Genes in Prokaryotic Genome • In prokaryotes, protein-coding genes are uninterrupted • No introns • Predicting protein-coding genes in prokaryotes is considered a solved problem • You can expect 99% accuracy Finding Prokaryotic Genes with GeneMark • GeneMark is the state of the art for microbial genomes • GeneMark can • Find short proteins • Resolve overlapping genes • Identify the best start codon • Use exon.gatech.edu/GeneMark • Click the “heutistic models” Predicting Eukaryotic Genes • Eukaryotic genes (human, for example) are very hard to predict • Precise and accurate eukaryotic gene prediction is still an open problem • ENSEMBL contains 21,662 genes for the human genome • There may well be more genes than that in the genome, as yet unpredicted • You can expect 70% accuracy on the human genome with automatic methods Finding Eukaryotic Genes with GenomeScan • GenomeScan is the state of the art for eukaryotic genes • GenomeScan works best with • Long exons • Genes with a low GC content • It can incorporate experimental information • Use genes.mit.edu/genomescan Producing Genomic Data • Until recently, sequencing an entire genome was very expensive and difficult • Only major institutes could do it • Today, scientists estimate that in 10 years, it will cost about $1000 to sequence a human genome • With sequencing so cheap, assembling your own genomes is becoming an option • How could you do it? Sequencing and Assembling a Genome (I) • To sequence a genome, the first task is to cut it into many small, overlapping pieces • Then clone each piece Sequencing and Assembling a Genome (II) • Each piece must be sequenced • Sequencing machines cannot do an entire sequence at once • They can only produce short sequences smaller than 1 Kb • These pieces are called reads • It is necessary to assemble the reads into contigs Sequencing and Assembling a Genome (III) • The most popular program for assembling reads is PHRAP • Available at www.phrap.org • Other programs exist for joining smaller datasets • For example, try CAP3 at pbil.univ-lyon1.fr/cap3.php

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download 009