* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download JOIN2004 Universidade do Minho
Expanded genetic code wikipedia , lookup
Synthetic biology wikipedia , lookup
DNA sequencing wikipedia , lookup
Agarose gel electrophoresis wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Comparative genomic hybridization wikipedia , lookup
DNA barcoding wikipedia , lookup
Maurice Wilkins wikipedia , lookup
Bisulfite sequencing wikipedia , lookup
Gel electrophoresis of nucleic acids wikipedia , lookup
DNA vaccination wikipedia , lookup
Genetic code wikipedia , lookup
Molecular cloning wikipedia , lookup
Community fingerprinting wikipedia , lookup
Biosynthesis wikipedia , lookup
DNA supercoil wikipedia , lookup
Point mutation wikipedia , lookup
Cre-Lox recombination wikipedia , lookup
Non-coding DNA wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
5/5/2017 Why Life is Beautiful James Tisdall JOIN2004 Universidade do Minho 5/5/2017 James Tisdall (Jim) Biocomputing Associates PO Box 9965 Philadelphia PA 19118 USA (610)933-1200 http:/www.biocomputingassociates.com And DuPont Experimental Station [email protected] JOIN2004 Universidade do Minho Who am I? 5/5/2017 Musician Bell Laboratories, Information Principles Research Human Genome Project early Perl program DNA WorkBench Mercator Genetics, Computational Biologist Fox Chase Cancer Center, Bioinformatics Manager Biocomputing Associates, consulting DuPont Genetics Research Author of "Beginning Perl for Bioinformatics" and "Mastering Perl for Bioinformatics" from O'Reilly JOIN2004 Universidade do Minho 5/5/2017 " " Bioinformatics means using computers for biology research, and Perl is a popular computer language for bioinformatics programming JOIN2004 Universidade do Minho 5/5/2017 OUTLINE " Perl resources " Overview of basic Perl " The art of programming " Sequences and strings " Motifs and loops " Subroutines and bugs JOIN2004 Universidade do Minho OUTLINE (Continued) 5/5/2017 " Mutations and Randomization " The Genetic Code " Restriction Maps and Regular Expressions " GenBank " Protein Data Bank " BLAST " Further Topics JOIN2004 Universidade do Minho Advantages of Perl 5/5/2017 1. Ease of Programming • Simplify common programming tasks. A lot of free software for bioinformatics, web programming, … 2. Rapid Prototyping • Often less programming time than other languages 3. Portability • Most operating systems: Windows, Mac, Unix, etc… • Free • Run speed - suitable for most tasks JOIN2004 Universidade do Minho 5/5/2017 Biological Sequence Data Bases and amino acids are represented as letters in Perl, just as in biology literature: " DNA and RNA A C G T (or U) " Amino acids ACDEFGHIKLMNPQRSTVWXY JOIN2004 Universidade do Minho 5/5/2017 A Small Bioinformatics Program Sequences are represented as Strings # First we store the DNA in a variable called $DNA $DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC'; # Next, we print the DNA onto the screen print $DNA; # Finally, we'll specifically tell the program to exit. exit; JOIN2004 Universidade do Minho 5/5/2017 Virtual Transcription: DNA to RNA $DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC'; print "$DNA\n\n"; # Transcribe the DNA to RNA by substituting all T's with U's. $RNA = $DNA; $RNA =~ s/T/U/g; print "$RNA\n"; JOIN2004 Universidade do Minho 5/5/2017 Reverse Complement $DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC'; $revcom = reverse $DNA; # Next substitute all bases by their complements, # A->T, T->A, G->C, C->G, uppercase or lowercase $revcom =~ tr/ACGTacgt/TGCAtgca/; JOIN2004 Universidade do Minho 5/5/2017 Mutations and Randomization JOIN2004 Universidade do Minho 5/5/2017 Modeling mutations randomly " How to randomly select a position in a string " How to randomly select an index in an array " Randomly select a base in DNA " Mutate a base to some other random base " Generate random DNA data sets " Repeatedly mutate DNA JOIN2004 Universidade do Minho 5/5/2017 A program to simulate DNA mutation " Pseudocode: 1. Select a random position in the DNA 2. Choose a random nucleotide 3. Substitute the random nucleotide in the random position JOIN2004 Universidade do Minho 5/5/2017 A program to simulate DNA mutation 1. Select a random position in the DNA $position = int(rand(length($DNA))) 2. Choose a random nucleotide @nucleotides = ('A', 'C', 'G', 'T'); $new = $nucleotides[rand @nucleotides] 3. Substitute the random nucleotide in the random position substr($DNA, $position, 1, $new) JOIN2004 Universidade do Minho 5/5/2017 Generating random DNA sub make_random_DNA { # Collect arguments, declare variables my($length) = @_; my $dna; for (my $i=0 ; $i < $length ; ++$i) { $dna .= $nucleotides[rand @nucleotides]; } return $dna; } JOIN2004 Universidade do Minho 5/5/2017 The Genetic Code JOIN2004 Universidade do Minho 5/5/2017 The genetic code " DNA encodes amino acids – 3 bases of DNA = a codon – Each codon represents an amino acid or a "stop" – There are 64 codons, but 20 amino acids – Most amino acids are coded for by multiple codons, which makes the genetic code a redundant code JOIN2004 Universidade do Minho 5/5/2017 Translating codons to amino acids sub codon2aa { my($codon) = @_; $codon = uc $codon; my(%genetic_code) = ( 'TCA' => 'S', # Serine 'TCC' => 'S', # Serine 'TCG' => 'S', # Serine 'TCT' => 'S', # Serine 'TTC' => 'F', # Phenylalanine … et cetera … JOIN2004 Universidade do Minho 5/5/2017 Translating DNA into proteins sub dna2peptide { my($dna) = @_; # Initialize variables my $protein = ''; # Translate each three-base codon to an amino acid, and append to a protein for(my $i=0; $i < (length($dna) - 2) ; $i += 3) { $protein .= codon2aa( substr($dna,$i,3) ); } return $protein; } JOIN2004 Universidade do Minho 5/5/2017 Reading DNA from FASTA files " " FASTA format is the most common way that DNA is stored in files (but there are many others) Example: > sample dna (This is a typical fasta header.) agatggcggcgctgaggggtcttgggggctctaggccggccacctactgg tttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcct gggaggcgtgactagaagcggaagtagttgtgggcgcctttgcaaccgcc tgggacgccgccgagtggtctgtgcag JOIN2004 Universidade do Minho 5/5/2017 Reading DNA from FASTA files > sample dna (This is a typical fasta header.) agatggcggcgctgaggggtcttgggggctctaggccggccacctactgg tttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcct gggaggcgtgactagaagcggaagtagttgtgggcgcctttgcaaccgcc Tgggacgccgccgagtggtctgtgcag " Notice that first line starts with a > symbol, and contains annotation. " Each other line just contains sequence data. " Simple! JOIN2004 Universidade do Minho 5/5/2017 Reading FASTA files: core code From textbook, p. 170 foreach my $line (@fasta_file_data) { if($line =~ /^>/) {# discard fasta header line next; } else {# keep line, add to sequence string $sequence .= $line; } } # remove non-sequence data (in this case, whitespace) from $sequence string $sequence =~ s/\s//g; JOIN2004 Universidade do Minho 5/5/2017 Regular expressions Regular expressions are a language within the Perl language that describe many kinds of patterns in strings, e.g. motifs in DNA. Regular expressions permit you to find and alter many patterns with relative ease. The excellent regular expressions in Perl are a major reason for Perl's success as a bioinformatics programming language. JOIN2004 Universidade do Minho 5/5/2017 Restriction enzymes, sites, and maps " " " A restriction enzyme is a protein that cuts DNA at short, specific sequences EcoRI cuts at the restriction site GAATTC, between the G and A, on both complementary strands (the reverse complement of GAATTC is GAATTC), leaving "sticky ends" for insertion into a vector for cloning and sequencing. A restriction map is the list of locations where a given restriction site occurs in the sequence JOIN2004 Universidade do Minho 5/5/2017 Finding restriction sites while ( $sequence =~ /$regexp/ig ) { push ( @positions, pos($sequence) - length($&) + 1); } " while loop conditional is a pattern match " ig: i stands for "case insensitive" " g stands for "global", will keep searching entire string " pos is a built-in function that gives the ending position of the last successful match; subtract the length of the match ($&) to get to the beginning of the match; add 1 because biologists call the first base in DNA as position 1, while Perl calls the first letter in a string as position 0 JOIN2004 Universidade do Minho 5/5/2017 GenBank JOIN2004 Universidade do Minho 5/5/2017 GenBank " http://ncbi.nlm.nih.gov/Genbank JOIN2004 Universidade do Minho 5/5/2017 GenBank Libraries " Examples: PRI: primate sequences ROD: rodent sequences PHG: bacteriophage sequences GSS: genome survey sequences JOIN2004 Universidade do Minho 5/5/2017 Protein Data Bank JOIN2004 Universidade do Minho 5/5/2017 PDB - Protein Data Bank " The primary depository for 3D protein structures " http://www.rcsb.org/pdb/ JOIN2004 Universidade do Minho 5/5/2017 BLAST JOIN2004 Universidade do Minho 5/5/2017 BLAST " Basic Local Alignment Search Tool " Most popular bioinformatics program " Discovers sequence similarity between your sequence and large databases " Voluminous output " Need to process output by program JOIN2004 Universidade do Minho 5/5/2017 Similarity vs Homology String matching is computer science term for finding sequence similarity " Exact " Approximate : percent identity Homology " Sequences are or are not related evolutionarily " Yes or no; no percentages JOIN2004 Universidade do Minho 5/5/2017 Parsing BLAST output " " " " Many bioinformatics projects require frequent BLAST searches Need to automate collection and processing BLAST output does not label lines by their function, so use regular expressions Several BLAST parsers are available as Perl modules: see for example Bioperl modules Bio::Tools::BPlite and Bio::Tools::Blast::* JOIN2004 Universidade do Minho 5/5/2017 Bioinformatics Resources 1. Books • Bioinformatics: Sequence and Genome Analysis by Mount • Developing Bioinformatics Computer Skills – Gibas & Jambeck • Introduction to Computational Biology: Maps,Sequences and Genomes, Waterman • Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins. Baxecvanis & Ouellette, editors 2. Government/Public Information Resources • National Center for Biotechnology Information (NCBI) – U.S. Government Center – www.ncbi.nlm.nih.gov • European Molecular Biology Laboratory (EMBL). European Union Laboratory – www.embl.org • European Bioinformatics Institute (EBI)- From UK, part of EMBLwww.ebi.ac.uk • UCSC Genome Resource- serves constructed genome data – www.genome.ucsc.edu JOIN2004 Universidade do Minho 5/5/2017 Bioinformatics Resources (cont.) 3. Conferences • ISMB: Intellegent Systems for Molecular Biology, www.iscb.org/ismb2004/ • Bioinformatics Open Source Conference, www.bioinformatics.org • RECOMB: Conference on Computational Molecular Biology, recomb04.sdsc.edu/ JOIN2004 Universidade do Minho 5/5/2017 Molecular Biology 1. Helpful Reference Books for Programmers • Recombinant DNA, 2nd edition, Watson et al. • Molecular Biology of the Gene, Watson et al. • Molecular Cell Biology, Lodish, et al. JOIN2004 Universidade do Minho 5/5/2017 Approximate string matching As an example of advanced data structures, following is an algorithm that relies on a 2 dimensional array. It finds the best match for a (shorter) pattern in a (longer) text. It is one of several important dynamic programming algorithms in bioinformatics JOIN2004 Universidade do Minho 5/5/2017 Approximate string matching 1 of 4 my $pattern = 'EIQADEVRL'; print "PATTERN:\n$pattern\n"; my $text = 'SVLQDRSMPHQEILAADEVLQESEMRQQDMISHDE'; print "TEXT:\n$text\n"; my $TLEN = length $text; my $PLEN = length $pattern; # D is the Distance matrix, which shows the "edit distance" between # substrings of the pattern and the text. # It is implemented as a reference to an anonymous array. my $D = []; # The rows correspond to the text # Initialize row 0 of D. for (my $t=0; $t <= $TLEN ; ++$t) { $D->[$t][0] = 0; } JOIN2004 Universidade do Minho 5/5/2017 Approximate string matching 2 of 4 # Compute the edit distances. for (my $t=1; $t <= $TLEN ; ++$t) { for (my $p=1; $p <= $PLEN ; ++$p) { $D->[$t][$p] = # Choose whichever alternative has the least cost min3( # Text and pattern may or may not match at this character ... substr($text, $t-1, 1) eq substr($pattern, $p-1, 1) ? $D->[$t-1][$p-1] # If match, no increase in edit distance! : $D->[$t-1][$p-1] + 1, # If the text is missing a character $D->[$t-1][$p] + 1, # If the pattern is missing a character $D->[$t][$p-1] + 1 ) } } JOIN2004 Universidade do Minho 5/5/2017 Approximate string matching 3 of 4 # Print D, the resulting edit distance array for (my $p=0; $p <= $PLEN ; ++$p) { for (my $t=0; $t <= $TLEN ; ++$t) { print $D->[$t][$p], " "; } print "\n"; } # Find the best match(es). # The edit distances appear in the last row. my @matches = (); my $bestscore = 10000000; for (my $t=1 ; $t <= $TLEN ; ++$t) { if( $D->[$t][$PLEN] < $bestscore) { $bestscore = $D->[$t][$PLEN]; @matches = ($t); }elsif( $D->[$t][$PLEN] == $bestscore) { push(@matches, $t); } } JOIN2004 Universidade do Minho 5/5/2017 Approximate string matching 4 of 4 # Report the best match(es). print "\nThe best match for the pattern $pattern\n"; print "has an edit distance of $bestscore\n"; print "and appears in the text ending at location"; print "s" if ( @matches > 1); print " @matches\n"; JOIN2004 Universidade do Minho 5/5/2017 Approximate string matching: Output PATTERN: EIQADEVRL TEXT: SVLQDRSMPHQEILAADEVLQESEMRQQDMISHDE 000000000000000000000000000000000000 111111111111011111011101011111111110 222222222222101222112211112222212221 333323333332211233222222222223322332 444433444443322123333333333333433343 555543455554433222344444444443444434 666654456665444333234545455554455543 776765556776555444323455556665556654 887776566787666555433456665676666765 998787667788776666543456776677777776 The best match for the pattern EIQADEVRL has an edit distance of 3 and appears in the text ending at location 20 JOIN2004 Universidade do Minho 5/5/2017 Bioperl Bioperl is a collection of Perl modules for bioinformatics. It has been developed by a large group of volunteers. Bioperl has become a fairly stable platform with which to develop software; it is growing and changing rapidly, but its core is relatively mature. Bioperl is a must for anyone interested in writing bioinformatics software in Perl Don't reinvent the wheel! (Unless you like to, for the fun or interest of it.) JOIN2004 Universidade do Minho 5/5/2017 What can Bioperl do? Sequence manipulation (DNA, RNA, proteins) " " " " " " " Several kinds of sequence objects Sequence format conversion Translation Revcom Sequence statistics (mol. weights, etc.) Restriction enzyme maps Signal cleavage sites JOIN2004 Universidade do Minho 5/5/2017 What can Bioperl do? Sequence alignment " " " Interfaces with many alignment programs such as fasta, clustalw, gcg Extract subalignment, consensus string Alignment statistics JOIN2004 Universidade do Minho 5/5/2017 What can Bioperl do? Biological databases " " " " Retrieving Genbank, EMBL, acedb, etc. Retrieving from remote databases Indexing and retrieving from local databases Parsing program reports from Blast and variants, HMMer, genefinding programs JOIN2004 Universidade do Minho 5/5/2017 What can Bioperl do? Sequence annotation " " Adding annotation Parsing annotation from sequence databases JOIN2004 Universidade do Minho 5/5/2017 What can Bioperl do? Protein structure " " Read in PDB files Extract chains and their residues JOIN2004 Universidade do Minho