Download In-Lab Handout

BIO 102 Renn_Lab#1 (in-lab handout) Name ___________________ Intro-bio 102 Lab #1 Perl Programming for Pattern Matching. There are 9 exercises in total: USE PERL PROGRAM regex.pl 1. Scrabble in English 2. “Scrabble” in DNA Sequences 3. Direct Repeats 4. Direct Repeats in DNA Sequences 5. Mirror Repeats also called palindromes in English (English Word Play) 6. Mirror Repeats in DNA Sequences USE PERL PROGRAM book_search.pl 7. Pattern Matching in Large Texts. USE PERL PROGRAM omic_search.pl 8. Pattern Matching for Protein Sequences 9. Pattern Matching for Protein Sequences in DNA There is more if you finish early and are still enjoying yourself! Note: If you run a Perl program and it runs on and on ... and on and on, you’ll need to HALT the program. Hold down the keys: Option-Apple-Escape, select the TextWrangler application, and click "Force Quit". In your lab notebook keep a record of the regex’s that you try. Keep track of whether or not they worked the way you expected them to. Write down your errors as diligently as your successes, these are what you will learn from. Whenever you see gray highlighted text, you should be writing in your lab notebook. At the end of lab, you will trade notebooks to learn from each other’s efforts. __ Logon to your computer using your own name and password __ Connect to the courses server and open the week1_Perl_lab folder in Behavior-Renn. __ Drag the folder regexPlay onto your desktop. __ Open this folder and you should see the following files: anagram.pl book_search.pl Perl program to use regex to find anagrams Perl program to use regex to return full paragraphs ecoli_K12_genes.fasta DNA coding sequence for predicted E.coli proteins in FASTA format ecoli_K12_proteins.fasta amino acid sequence for predicted E.coli proteins. ecoli.txt 7-mers found in E.coli genome english.txt English dictionary (list of words one per line) omic_search.pl Perl program to use regex to return genes with specified pattern origin.txt Full text of Darwin’s “Origin of species” regex.pl Perl program to use regex to find patterns Perl is already installed on the intro-bio lab computers for you. pg. 1 of 14 BIO 102 Renn_Lab#1 (in-lab handout) Name ___________________ __ Open the application TextWranglerTM from the Dock (Gold “W”). __ From the FILE menu in TextWrangler, select Open (or use o) and open the Perl program: regex.pl from the folder on your desktop. The Perl program will appear in your programming environment. It will be color coded according to Perl syntax. The text that you will work with will be the color coded pink text under the words “ENTER HERE”. Notice the “#” sign is used to frame the code and include comments. __ You should also see a “Documents Drawer” on the right. If not, go to the View menu and select Show Documents Drawer. __ Before making any changes, make your own version of this script by saving it as regex_yourname.pl From the File menu select Save As, give the file its new name, check that you are saving it in the regexPlay folder on the desktop. __ If you do not see line numbers next to the code use the Edit menu to open Text Options and check the box under Display: for Line Numbers. This is the only section you will be changing: ###################################################################### # ==========ENTER HERE============ my $regex = 'q[û]'; my $filename # my $filename = = "english.txt"; "ecoli.txt"; ###################################################################### (note: The word “my” is Perl’s way of initializing variables. It is necessary here.) To write a new regex, you will change the pink text between the single quotes. Note: you must keep the single quotes! The initial regex supplied in this Perl program means: “match all words that contain a ‘q’ that is followed by a letter that is not ‘u’.” ____You must also choose the file to search: either the English dictionary (english.txt) or the list of 7-mers in E.coli (ecoli.txt). You will make this selection by putting a comment (#) symbol in front of the line that you do not want to use. The initial program is set to search the file "english.txt", thus there is a # to “comment out” (shut off) the "ecoli.txt” file. Note: After you change the regex expression, you must SAVE your changes to the file before you re-run the new program. Your program is NOT saved if it has a dark diamond next to the title in the Documents panel. You can save your Perl Program with S or using the pulldown menu from File. ____To run your program from TextWrangler use the #! Menu and select Run (this combination of symbols is call hashbang or she-bang). Your program will run, and a new file with your results will open. You will see a new document name added to the Documents panel called “Unix Script Output.” This file is not currently saved, and you do not need to save it. Subsequent runs pg. 2 of 14 BIO 102 Renn_Lab#1 (in-lab handout) Name ___________________ will be added below, and it maybe a very long file by the end of the day. You should record all of your coding attempts in your lab notebook as you work. When your attempted code fails to give the expected results, try to figure out why and correct it. Record this learning process in your lab notebook. ___ Run your program now before you change anything in the code. How many words were there with a ‘q’ that is followed by a letter that is not ‘u’. (Write your findings in your lab notebook.). ____To Return to the script, click on the document icon to the left of your saved regex_yourname.pl in the document drawer, or use the back-arrow button Backarrow button Document Drawer pink regex text QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. Yellow highlighted text at cursor Open diamond = saved file Black diamond = unsaved file Bold text = working file OUTPUT of initial program: counts of words returned Work your way through the rest of the handout. Always start at the top of each left hand page. This is a demonstration using the English dictionary and English letters. Each page then contains a “going back for more” that you must solve, and record your solution in your lab notebook. You are asked to be creative using the regular expressions that you have learned on each page. Then, move to the facing page, the right hand page. This is a demonstration using DNA or Protein code rather than English. Again, there is an example followed by one or more assigned problems, and an opportunity to be creative. The answers are available from your instructor. Working with your fellow students is encouraged. Please ask for and offer help to each other before looking up the answer. pg. 3 of 14 BIO 102 Renn_Lab#1 (in-lab handout) Name ___________________ Scrabble in English Question: Are there any words in an English dictionary where the letter ‘q’ is not immediately followed by the letter ‘u’? Pattern to match expressed in English: Search for two adjacent characters: the letter ‘q’ immediately followed by any character that is not the letter ‘u’. […] [^…] {n} {n,} {x,y} […]{x,y} | match any one of the characters in the set match any one character that is not in the set match any of the characters in the set exactly n times match any of the characters in the set at least n times match at least x and not more than y times these can be combined match any of the characters in the set at least x and not more than y times a vertical bar separates alternatives this is OR in Boolean logic. regex = ‘q[û]’ (1) Iraqi (2) Iraqis (3) Qatar Note: In these facing page examples, we are assuming that the Perl program is instructing each regex to ignore the case of letters. You need not worry about upper vs. lower case in your regex. In the example above, “q” is really handling both upper and lower case ‘q’. This is part of the work that the Perl program is doing for you. After you play with regex, we can look at the code if you are interested. Going back for more: Select regex_yourname.pl in the Documents Drawer to return to your Perl script. Edit your new regex in pink text between the single quotes, and then save your changes before running the script from the hash-bang menu #! . If you forget to save, it will run the previous regexs. (1a) All words that contain the three-letter string “ghi”. Note: Do not confuse this with the request to find words with any one of the letters ‘g’, ‘h’, or ‘i’. [ghi] is not the regex you want. [ghi] finds words with one of the letters in the set. (1b) All words that contain “fin” or “phin”. (1c) All words with ‘yz’ that are not immediately followed by an ‘e’ or an ‘i’ . (1d) All words with.... (make up your own). pg. 4 of 14 BIO 102 Renn_Lab#1 (in-lab handout) Name ___________________ “Scrabble” in DNA Sequences Our context for DNA sequences is a file with one DNA motif (or word) per line. The file is comprised of 16,384 7-mers, each line holding one DNA sequence or motif of length seven(7), where a motif is defined as “a pattern with putative biological meaning”. The 7mers were collected by us from NCBI’s publicly available DNA sequence for the E.coli bacteria. Change the file that will be searched by inserting and deleting the hash sign # as shown. # my $filename = "english.txt"; my $filename = "ecoli.txt"; Question: Are there any 7-mers in E. coli where the nucleotide ‘G’ (guanine) is not immediately followed by the nucleotide ‘T’ (thymine)? Search for two adjacent nucleotides: guanine ‘G’ followed by any nucleotide that is not thymine ‘T’. regex = ‘G[^T]’ TAAAGAA ACGTGCC GATATTT … 11267 total The program is checking All Motifs for a G followed by a letter that is not T TCAGTGT no GTTCACG no TAAAGAA yes AGTAGTG no CTTTTTT no ACGTGCC yes ACTCATT no etc Going back for more: (2a) All 7-mers that contain GTGAC. (2b) All 7-mers that contain the dimer GC, and this dimer is immediately followed by a letter that is not a G nor a C. (2c) All 7-mers where ... (write your own)? pg. 5 of 14 BIO 102 Renn_Lab#1 (in-lab handout) Name ___________________ Direct Repeats Question: Are there any words in which a sequence of three letters is repeated elsewhere in the word? Note: Don’t forget to switch back to the English text. Search for words where a sequence of three letters is repeated. This pattern might appear in the middle of a longer word. . (.) .* .+ .? \1 \2 \3 Match any character except a new line Match any character and remember it Match any character zero or more times Match any character one or more times Match any character zero or one time Recall the character from the 1st match Recall the character from the 2nd match Recall the character from the 3rd match regex = ‘(.)(.)(.).*\1\2\3’ (1) acclimatization (2) agglutinating (3) aggressiveness (4) Albuquerque (5) alfalfa (6) allegorically (7) amalgamate (8) amalgamated (9) amalgamates (10) amalgamating … 305 total Going back for more: (3a) Are there any words in which a sequence of two letters is directly repeated at least three times in the word? (3b) Are there any words ... (write your own)? (Note, “directly” refers to “in the same order”. It is different than “immediately” in that it allows for additional characters to come between the repeated motif.) (How would you change the regex to require an immediately repeated 3 letter motif?) pg. 6 of 14 BIO 102 Renn_Lab#1 (in-lab handout) Name ___________________ Direct Repeats in DNA Sequences Question: Are there any 7-mers in the DNA sequence of E.coli where a sequence of three nucleotides is repeated elsewhere in the motif? Search for motifs where a sequence of three nucleotides is repeated. regex = ‘(.)(.)(.).*\1\2\3’ 1) AAAAAAA (2) AAAAAAC (3) AAAAAAG (4) AAAAAAT (5) AAACAAA (6) AAACAAC (7) AAAGAAA (8) AAAGAAG (9) AAATAAA (10) AAATAAT … 700 total Going back for more: (4a) Are there any motifs in which a sequence of two nucleotides is repeated at least three times in the motif? (4b) Are there any motifs ... (write your own) pg. 7 of 14 BIO 102 Renn_Lab#1 (in-lab handout) Name ___________________ Mirror Repeats (also called palindromes in English) Question: Are there any words in which the first three letters of the beginning of a word are the same as the reverse of those letters at the end of the word? Search for words where the initial sequence of three letters ends with the reverse of those letters. regex = ‘^(.)(.)(.).*\3\2\1$’ (1) despised (2) detected (3) deteriorated (4) detested (5) foolproof (6) Greenberg (7) Hannah (8) redder (9) reviver (10) revolver (11) rotator remember: âb Match ‘ab’ at the beginning of a word xyz$ Match ‘xyz’ at the end of a word \2 Recall the character from the 2nd match Going back for more: (5a) Are there any words in which three consecutive letters anywhere in a word are followed by the reverse of those letters anywhere in the word? (Hint: you do not need ^ (start of word) and $ (end of word) for this). (Note: the following two examples combine both direct and mirror patterns) (5b) Are there any words in which a pair of letters is followed by a direct repeat of those pair of letters followed by two occurrences of the reverse of the pair of letters? Each of the pairs can be zero or more letters apart), e.g., “AB...AB...BA...BA” (5c) Are there any words in which this pattern occurs twice: a pair of letters is followed by its reverse of those pair of letters? (Each of the pairs can be zero or more letters apart). pg. 8 of 14 BIO 102 Renn_Lab#1 (in-lab handout) Name ___________________ Mirror Repeats in DNA Sequences Question: Are there any motifs in which the first three nucleotides of the beginning of a motif are the same as the reverse of those letters at the end of the motif? Search for motifs where the reverse of the initial three nucleotides is repeated at the end of the motif. regex = ‘^(.)(.)(.).*\3\2\1$’ 1) AAAAAAA (2) AAACAAA (3) AAAGAAA (4) AAATAAA (5) AACACAA (6) AACCCAA (7) AACGCAA (8) AACTCAA (9) AAGAGAA (10) AAGCGAA … 256 remember: ÂTG TAC$ (.) \2 Match ‘ATG’ at the beginning of a motif Match ‘TAC’ at the end of a motif Match any character and remember it Recall the character from the 2nd match Going back for more: (6a) Are there any motifs in which three consecutive nucleotides anywhere in a motif are followed by the reverse of those nucleotides anywhere in the motif? (Hint: you do not need ^ (start of motif) and $ (end of motif) for this). (Note: the following two examples combine both direct and mirror patterns) (6b) Are there any motifs in which a pair of adjacent nucleotides is followed by a direct repeat of those pair of nucleotides followed by two occurrences of the reverse of the pair of nucleotides? (Each of the pairs can be zero or more nucleotides apart), e.g., “GT...GT...TG...TG”. (6c) Are there any motifs in which this pattern occurs twice: a pair of adjacent nucleotides is followed by its reverse of those pair of nucleotides? (Each of the pairs can be zero or more nucleotides apart). pg. 9 of 14 BIO 102 Renn_Lab#1 (in-lab handout) Name ___________________ Pattern Matching in Large Texts. Perl can be a powerful tool for searching and manipulating larger text documents. We have downloaded the full text of The Origin of Species by means of Natural Selection, 6th Edition by Charles Darwin. You will write regex’s to find interesting words in this text. Choose words or concepts from Bob Kaplan’s lectures first semester. In TextWrangler close (w) the regex.pl script and open (o) book_search.pl This script will return the entire paragraph that contains your interesting word so you can learn how Charles Darwin used these terms. Search the text origin.txt You could search any other downloaded text by changing line #13 in the program. (my $filename = "origin.txt"; #enter the text file name here) (9a) How many times do you think Darwin used the word “evolution” in his text? Write your guess in your lab notebook. (9b) Write a regex to find the correct answer. This would be easy to do in a simple text program like Word using the Find tool, but now try to find any term related to evolution (evolve, evolution, evolved). (9c) Write a regex that will find any term related to evolution. (9d) How many are there? (9e) Look carefully. Did you get the words you weren’t expecting (“revolved” “revolution”)? Look back at your answer to 9b now also (9f) Write a regex that will avoid these words? (Hint \s means “white space” or \b means “word boundary”) Several special characters can be helpful regex tools in this type of search in Perl. \n new line character \t tab \s whitespace \S non-whitespace \d digit \D non-digit \b word boundary \B not word boundary If you are finding it difficult to find your word in the full printed paragraph, comment out line 52 and use line 53 by removing that comment mark. 52 53# print @paragraph; print "$match\n"; # and print the paragraph # will print only the line that matched pg. 10 of 14 BIO 102 Renn_Lab#1 (in-lab handout) Name ___________________ Pattern Matching for Protein Sequences Remember Janis Shampay’s lectures last semester? Proteins are a string of amino acids of which there are 20. Each amino acid has a three letter abbreviation but also a one letter abbreviation used for writing protein sequences. * G - Glycine (Gly) * P - Proline (Pro) * A - Alanine (Ala) * V - Valine (Val) * L - Leucine (Leu) * I - Isoleucine (Ile) * M - Methionine (Met) * C - Cysteine (Cys) * F - Phenylalanine (Phe) * Y - Tyrosine (Tyr) * W - Tryptophan (Trp) * H - Histidine (His) * K - Lysine (Lys) * R - Arginine (Arg) * Q - Glutamine (Gln) * N - Asparagine (Asn) * E - Glutamic Acid (Glu) * D - Aspartic Acid (Asp) * S - Serine (Ser) * T - Threonine (Thr) What amino acid sequence would be represented by the protein sequence GENE? In TextWrangler close (w) book_search.pl; open (o) omic_search.pl This program will return either DNA or protein sequence for genes. It is expecting a FASTA formatted file, rather than a book. Use the file E_coli_K12_proteins.fasta by commenting the appropriate line. If you want to see only one matching line of protein code, comment out line 43 and use 44. (10a) Write a regex to find how many proteins in E.coli include the protein code GENE. (10b) Write a regex to find how many proteins in E.coli include the amino acids DARWIN in any order. (10c) Write a regex to find proteins in E.coli include …. (make up your own). (10d) Is your name part of E.coli proteome? (If you don’t know what the word proteome means, look it up in Wikipedia quickly http://en.wikipedia.org/wiki/Proteome ). Regular expressions can be used to search for secondary structure. Different amino acids have different chemical characteristics. The secondary structure of a protein depends on the number and spacing of these different chemical characteristics. Different amino-acid sequences have different propensities for forming -helical structure. Methionine, alanine, leucine, glutamic acid, and lysine ("MALEK" in the amino-acid 1letter codes) all have high helix-forming propensities. Proline (“P”) tends to break or kink helices. However, proline is often seen as the first residue of a helix. (10e) Write a regex for an amino acid pattern that is likely to form a long (10 amino acid) alpha helix secondary structure. There are many web-based tools that search for complex protein or DNA patterns such as Neuropeptide cleavage predictors (see appendix for an example). These tools rely on programming in languages like Perl or Python. One advantage of learning to write your own programs is that you can search multiple sequences (or whole genomes as you have done today). You do not need an internet connection, you do not have to wait for results, and you can format the output to be compatible with your own files. You can adjust the parameters. Often it is possible to obtain the program code that is used by these web-based tools, and you can then adjust specific parameters to meet your own needs. pg. 11 of 14 BIO 102 Renn_Lab#1 (in-lab handout) ___________________ Name Pattern Matching for Protein Sequences in DNA Each amino acid is encoded by 3 letters in the genomic DNA (translated via mRNA). These three bases are called a “codon”. Remember that some of the amino acids can be encoded by more than one codon, so the code is called “degenerate”. Do you remember the Universal Genetic Code? Of course not, nobody does. Use the Perl script omic_search.pl Use the file E_coli_K12_genes.fasta by commenting out the appropriate file line. A regular expression for alanine is regex = ‘gc[acgt]’ A regular expression for Arginine is a bit tougher remember the vertical line symbol | means OR regex= ‘(cg[acgt])|(ag[ag])’ (11a) Write a regular expression that will search the genomic sequence for the possible amino acid sequence GENE. (11b) Do you find this sequence the same number of times in the DNA as you do when you search the Protein file for GENE? If not, what is the reason? (11c) Write a regular expression to find a Lysine rich region (3 or more in a row). Or make it harder. Find a region where every other amino acid is a Lysine, Think about how you could specify which amino acids are allowed to be interspersed with the Lysines. (11d) Why might a researcher be interested in looking for secondary structure in a given DNA sequence rather than looking directly at an amino acid sequence? (11e) Why might a researcher want to write their own program rather than using a web-based tool? pg. 12 of 14 BIO 102 Renn_Lab#1 (in-lab handout) ___________________ Name When you have finished __The answer key is available in lab if you wish to check your own work. __There is an additional exercise on finding anagrams if you are having fun, have finished early and want to try more difficult regular expressions. __If you want to save your output file, use the Save As command, give it a name and save it to your Home Server. This file contains only the output, not the regular expressions you used to get what you were looking for. Your lab notebook is the real record of your efforts. __Before you leave, find another student who has also finished the exercises. Trade notebooks with this student. Complete the post-lab evaluation form while you learn what creative regular expressions your classmate has attempted. __Take your own lab notebook home with you. __Leave your post-lab evaluation form with Carey or Ned. Appendix This lab is based on Exercise 3 from “Programming in PERL for Biology” a text by Mark LeBlanc and Betsy Dreyer from Wheaton College. The full text will be published later in 2007 and is recommended for any student wishing to go further with this work. Until that book is published, I recommend Beginning Perl for Bioinformatics from O'Reilly & Associates as it addresses the needs of biologists who want to learn Perl programming.       If you want to install Perl on your own computer; it comes installed with Mac OS X or it can be downloaded for free from http://www.perl.com/download.csp If you want to install it on your own computer, TextWrangerTM is available at http://www.barebones.com/products/textwrangler/download.shtml TextWrangler is a freeware programming environment provided by BBEdit. TextWranglerTM is a “lighter” version of BBEdit®, the industry standard Perl programming environment for MAC OS X. If you want to learn more Perl, there are several books available as well as several online courses. Perl is open source language and many scripts and modules can be freely found and downloaded from the web. One good online tutorial that I have seen is available at http://bioportal.weizmann.ac.il/course/prog/ The english.txt dictionary and 7mer text that you used in class can be downloaded from http://cs.wheatonma.edu/mleblanc/dna/ The full text of “On the Origin of Species” (and other works) can be downloaded for free from project Gutenberg at http://www.gutenberg.org/etext/2009 There are >20,000 free books for download from project Gutenberg. Project Gutenberg is the first and largest single collection of free electronic books, or eBooks. It was founded by Michael Hart, who invented eBooks in 1971. http://www.gutenberg.org/wiki/Main_Page Whole genomes, collections of genes, protein coding regions, upstream regulatory regions etc. can be downloaded from many sources. For this lab, the DNA coding sequence file and the protein file for E. coli strain K12 were downloaded from http://cmr.tigr.org/tigr-scripts/CMR/CmrHomePage.cgi The Institute for Genome Research, Comprehensive Microbial Resource. C elegans genomes can downloaded pg. 13 of 14 BIO 102 Renn_Lab#1 (in-lab handout) ___________________  Name from biomart http://www.biomart.org/index.html if you want to play with larger genomes. NeuroPred is a web application designed to predict cleavage sites at basic amino acid locations in neuropeptide precursor sequences. The user can study one amino acid sequence or multiple sequences simultaneously, selecting from several prediction models and optional, user-defined functions. http://neuroproteomics.scs.uiuc.edu/cgibin/neuropred.py http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1538825 FASTA format In bioinformatics, FASTA format is a text-based format for representing either nucleic acid sequences or protein sequences, in which base pairs or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences. The simplicity of FASTA format makes it easy to manipulate and parse sequences using text-processing tools and scripting languages like Perl. A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greaterthan (">") symbol in the first column. The word following the ">" symbol is the identifier of the sequence, and the rest of the line is the description (both are optional). The sequence ends if another line starting with a ">" appears; this indicates the start of another sequence. pg. 14 of 14

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download In-Lab Handout