Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Protein moonlighting wikipedia , lookup
Human genome wikipedia , lookup
Non-coding DNA wikipedia , lookup
Microsatellite wikipedia , lookup
Maximum parsimony (phylogenetics) wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Point mutation wikipedia , lookup
Metagenomics wikipedia , lookup
Smith–Waterman algorithm wikipedia , lookup
Sequence alignment wikipedia , lookup
Biology of Seed Plants Presents Making a Phylogenetic Tree Contents • Principles behind phylogenetic trees • How to find the sequences to make the tree • Doing it (with your own hands) Slides 3-10 Slides 11-26 Slides 27-102 Goals • You'll know what goes into making a phylogenetic tree • You'll appreciate that the concepts aren't very difficult, but there's an awful lot of picky things to do. This demonstration is best viewed as a slide show, enabling you to simulate a session and make changes in cursor more Click anywhere to position go on to theobvious. next slide To do this, click Slide Show on the top tool bar, then View show. Making a Phylogenetic Tree In mosses, the gametophyte form is dominant In angiosperms, the sporophyte form is dominant The moss, Psychomitrella patens. University of Leeds The angiosperm, Arabidopsis thaliana. Universität Karlsruhe. Menand et al (2007) found that a transcription factor governing genes involved in sporophyte development in an angiosperm is found also in moss. Making a Phylogenetic Tree Their finding that factors of the AtRHD6 family are also present in the moss Psychomitrella patens can be explained in at least two ways. 1. Those in moss and angiosperms evolved from a common ancestor, but the specialization of the factors arose after angiosperms diverged from bryophytes. If this were the case, you'd expect to see a phylogenetic tree as shown to the right OR… Phylogenetic tree of transcriptional factors in Arabidopsis and moss related to the AtRHD6 family Making a Phylogenetic Tree 2. Specialization of the factors arose before divergence. If this were the case, you'd expect to see a phylogenetic tree with the factors from the two plants mixed together. … and that's exactly what you DO see for factors in the AtRHD6 clade! This family evidently evolved before Arabidopsis and moss existed as distinct plants. If so much rests on the structure of this phylogenetic tree, then we need to understand what trees mean and how they are made (two sides of the same issue). Making a Phylogenetic Tree To illustrate the process by which phylogenetic trees are constructed, consider the following list of words: English Mother Father Red Foot Salt Young Dutch Moeder Vader Rood Voet Zout Jong German Mutter Vater Rot Fuss Salz Jung Swedish Fostra Fader Rött Foten Salt Barn French Mère Père Rouge Pied Sel Jeune Spanish Madre Padre Rojo Pie Sal Joven Italian Madre Padre Rosso Piede Sale Giovane Russian Mat Otetz Kracnii Noga Sol Molodoye Clearly, these words (and many more) are connected across several European languages, sometimes closely connected, sometimes more distantly. The relationships can be quantitated and expressed as a tree. Making a Phylogenetic Tree English Mother Father Red Foot Salt Young Dutch Moeder Vader Rood Voet Zout Jong German Mutter Vater Rot Fuss Salz Jung Swedish Fostra Fader Rött Foten Salt Barn French Mère Père Rouge Pied Sel Jeune Spanish Madre Padre Rojo Pie Sal Joven Italian Madre Padre Rosso Piede Sale Giovane Russian Mat Otetz Kracnii Noga Sol Molodoye French Spanish Italian German Dutch English Swedish The tree to the right showing the divergence of human languages over the past few thousand years is based on an analysis of mutations that have taken place in individual words. Analysis is not sufficient, however. There is a prior step… years Adapted from Gray & Atkinson (2003) Nature 426:435-439 Russian Making a Phylogenetic Tree English Red Mother Salt Foot Father Young Dutch Voet Moeder Zout Jong Rood Vader German Salz Jung Rot Vater Mutter Fuss Swedish Salt Fostra Fader Rött Foten Barn French Pied Sel Rouge Jeune Père Mère Spanish Joven Rojo Madre Pie Sal Padre Italian Piede Padre Madre Sale Giovane Rosso Russian Otetz Sol Molodoye Noga Mat Kracnii French Spanish Italian German Dutch English Swedish It would not be possible to analyze the mutations if the words with the same meanings had not first been aligned with each other. years Russian Making a Phylogenetic Tree Alignment is equally important in building phylogenetic trees from protein and DNA sequences. Note that when these sequences are aligned properly, similarities stand out, and so do regions that are similar only in certain groups. Making a Phylogenetic Tree The tree may be built by counting how many differences there are between pairs of sequences. Making a Phylogenetic Tree Back to the key figure from Menand et al (2007)… One excellent way of understanding a figure is to construct it yourself from the raw data. How could you construct this tree? Recall the steps: 0. GET THE SEQUENCES! 1. Alignment of sequences 2. Analysis of alignments tree … Actually, we've glided over a step necessary before we can even begin. How can we get them? This is often the most difficult of all the steps. Obtaining the Sequences for the Tree In any research article (if the authors have done their jobs), we should be able to find the source of the key material that underlie the results, usually in the Materials and Methods section or in the appropriate figure legend. OK, we turn then to Menand et al (2007). If you don't have a copy handy, go back to the main page for this investigation and click on the link to Science… let's do that together (you do it, too). We want to search for Menand et al (2007). Science 316:1477-1480. You'd think Science would make this sort of thing easy, but the fastest way of doing requires a click to the Advanced Search section 316 1477 Type in the volume and page numbers for Menand et al (2007), Science 316:1477-1480. That makes the search absolutely unique. Then click the Search button. They give us four choices: Abstract is useful for a brief picture of the article (but we're well beyond that) Full Text gives the article as a web page, good for clicking on references PDF gives the article as it looks on the printed page. I choose this. Supporting Online Material… we'll see about that. Once the PDF file loads, I go straight to Fig. 3. What (if anything) does it say about where the sequences come from? I see… Where's "fig.S3"? I can't find it anywhere. I also note that Science articles don't have Materials and Methods sections!!! I can think of no reason for presenting a research article without a Materials and Methods section, at least no reason I can repeat on a family web page. In desperation I try a general search using Acrobat's search function. After some searching, I find something potentially interesting. I click on it. That seems to be the winner, at the very end of the article! Evidently the source of the sequences are in Fig. S3, and Fig. S3 is hidden in another web site! Nothing to do but to go there. Now I understand the import of the Supporting Material link at the page that got me to this article. I go back, click on it, and download the supporting material. I scroll down the page to Fig. S3. This is an alignment all right, as advertised. But where did these sequences come from? The legend for this figure is no help. Fortunately, during the scrolling, I ran across the missing Materials and Methods section, including a paragraph that seems to meet at least some of my needs. I go back a few pages. Finally! I discover that at least some of the sequences (those that begin PpRSL…, the ones from moss) are in GenBank and the others provide GenBank accession numbers. I can get what I need from there. What exactly does GenBank have? Let's go together to NCBI (where GenBank lives), using the link on the main page. NCBI houses GenBank and much other information. Type in the first GenBank accession number, EF156393, and click the Go button. Since the GenBank accession number is highly specific, we get back only one hit, to a nucleotide record. Click to get to the record. "Psycomitrella patens…" yup, that's what we want. Click on the link. This is what GenBank knows about PpRSL1. Right organism. Right authors. What about the sequence? Scroll down. It shows me both the amino acid sequence... …and the DNA acid sequence. Do I have to look up each sequence separately, then copy and paste, then somehow get rid of the numbers and spaces, and then somehow do an alignment? Fortunately, all of this can be automated. Making a tree: Doing it Our plan: • Obtain moss sequences from GenBank • Somehow obtain Arabidopsis sequences (discussed later) • Align the sequences • Use the aligned sequences to make a phylogenetic tree The execution of this plan will be greatly facilitated by a web site, an instance of BioBIKE, that makes it possible for those without programming experience to do creative programming. BIKEs are Biological Integrated Knowledge Environments. They come stocked with knowledge specific to a group of organisms. We won't be using that organism-specific knowledge, so any of the servers should do. Try clicking CyanoBIKE. Your name (no spaces) Enter anything you like as a login name, but no spaces or symbols. No password necessary. Click New Login Function palette Workspace The BioBIKE environment is divided into three areas as shown. You'll bring functions down from the function palette to the workspace, execute them, and note the results in the results window Results window Two very important buttons on the function palette: HELP! On-line help (general) PROBLEM Something went wrong? Tell us! Our Plan We want to define a set of sequences, each element being one of the eight moss sequences with GenBank accession numbers given in the Supplemental Material. We'll define each sequence separately, then combine them into a set. Click the DEFINITION button on the function palette. Clicking on any palette button brings down a function or data into the workspace. Click DEFINE. A DEFINE box is now in the workspace. Before continuing with the problem, let's consider what function boxes mean. General Syntax of BioBIKE Function-name Argument (object) Keyword object The basic unit of BioBIKE is the function box. It consists of the name of a function, perhaps one or more required arguments, and optional keywords and flags. A function may be thought of as a black box: you feed it information, it produces a product. Flag General Syntax of BioBIKE Argument (object) Function-name SIN Keyword object angle A function you’re already familiar with is the Sin function. You feed it an angle, it produces the sin of the angle. In BioBIKE, you provide information by clicking on a gray input box to open it up for entry. Flag General Syntax of BioBIKE Function-name SIN Argument (object) Keyword object 30 A box that is white and outlined in read is open for entry. Type into it an appropriate value, then close it by pressing Enter or Tab. All input boxes must be closed before a function may be executed. If you leave a white input box open while trying to execute a function, you'll get an error! Flag General Syntax of BioBIKE Function-name Argument (object) Keyword object Flag Function boxes contain the following elements: • Function-name (e.g. SEQUENCE-OF or LENGTH-OF) • Argument: Required, acted on by function • Keyword clause: Optional, more information • Flag: Optional, more (yes/no) information General Syntax of BioBIKE Function-name Argument (object) Keyword object Flag … and icons to help you work with functions: • Option icon: Brings up a menu of keywords and flags • Action icon: Brings up a menu enabling you to execute a function, copy and paste, information, get help, etc Clear/Delete icon: Removes information you entered or removes box entirely • Back to our story… we were defining the set of moss sequences. The DEFINE function has two arguments: the name of the variable being defined and its value. Click on the argument marked variable to provide the name. In the white, open input box, type in the name of the first sequence on our list, pprsl1 (upper/lower case doesn't matter). Then press Tab or Enter If you pressed Tab, the first input box will close (turn gray) and the next box will automatically open. If you pressed Enter, you'll have to pen the next box yourself by clicking on it. You'll define pprsl1 as the sequence from GenBank. We could copy and paste the actual sequence here, but it's much easier to ask a function to go to GenBank and do that. Mouse over to the GENES/PROTEIN menu to get the function. Click on SEQUENCE-OF to bring it down into the open input box. We need to tell SEQUENCE-OF two things: (a) We want to get the sequence from GenBank. (b) The accession number by which to identify it Start with the first. Mouse over the Options Arrow. Click FROM-GENBANK to add that flag to the SEQUENCE-OF function. Open the input box of the function and type in the accession number (in quotes), "EF156394", which you get from the Supplemental Material. Press Tab or Return. One more thing… When we make the alignment, we want each sequence labeled by its name. Specify now that the name is to be associated with the sequence. That's an option given by the DEFINE function. The function is now complete, so mouse over the green Action Icon and click Execute. Notice that a sequence appears in the Result Window. Is it the right one? Compare it with what you saw in GenBank. You've defined one of the eight moss sequence. It's now an easy matter to modify the function to get the rest. Reopen the variable box in the DEFINE function, and change pprsl1 to pprsl2. Press Tab or Enter. Then reopen the entity box in SEQUENCEOF and change the accession number. Execute, and repeat this to get all eight sequences. If all went well, you should be able to mouse over the VARIABLES button and see all eight sequences you created. We're ready to define a set consisting of all eight moss sequences. Go back to the DEFINITION menu and bring down a new copy of DEFINE. Open the variable input box and type in the name of the set. You can call it anything you like (anything with no spaces). I used all-sequences. Perhaps allmoss-sequences would have been more appropriate. The set will consist of a list of sequences, so after opening the value box of DEFINE (causing it to turn white and get a red outline)… …mouse over the LISTS/TABLES button and click LIST. The list will have eight elements (count them!), so we need to add seven additional holes, an action accessible from the Options Menu. Repeat until you have all eight holes. Now populate the holes, selecting each one in turn and then going to the VARIABLES menu to select one of the eight sequences. When the DEFINE function is complete, execute it. Finally we're ready to align the sequences. You can find the ALIGNMENT-OF function by mousing over the STRINGS/SEQUENCES menu and then the BIOINFORMTICTOOLS submenu. Click on ALIGNMENT-OF. Open the argument box of ALIGNMENT-OF and put the set all-sequences in it. The function is complete, so execute it. The alignment pops up on the screen (attend to FireFox's popup blocker to allow popup windows). …but there's only one sequence! Why? Scrolling down… …finally PpRSL2 shows up, after 1700+ nucleotides of PpRSL1. Scrolling down more… …there are the rest. Scrolling to near the end… …most end together but a few go on. This won't do. We need all the sequences to be the same size, otherwise the tree-making program will get confused. We'll need to truncate the long sequences. Plan: Define new sequences that are truncated versions of the originals. Bring down a new DEFINE function. First we'll define a modified pprsl1, I'll call it pprsl1x. It will consist of part of the sequence of pprsl1. Select the value box and bring down SEQUENCE-OF. This time we don't want the entire sequence, so mouse over the green Options Arrow and select FROM. From where? Recall the alignment. The coordinates on the left allow us to count over to determine the exact nucleotide we want to start the modified sequence. I count 1930. 1911 1921 1931 Now insert a TO keyword from the Options Menu. 2111 Count over to reach the end of pprsl1. I count 2130. 2121 2131 This sequence needs to be labeled like all the rest. Now do the same with the other sequence (PpRSL2) that needs truncation (don't worry about the one extra nucleotide of PpIND1) Be sure to get the coordinates right and to Execute. Now we need to update the definition of the set of all-sequences. Click each box that needs modifying, close each… …and execute the function again. …and execute ALIGNMENT-OF again. Much nicer, but it doesn't look anything like the alignment in the Supplemental Material, Fig.S3. That alignment was of amino acids. This one is of nucleotides. Of course amino acid sequences are translations of nucleotide sequences. It isn’t necessary to define translated sequences. Much easier to translate the entire set all at once. To do this, surround the set with a TRANSLATION-OF function. TRANSLATION-OF may be found on the GENES-SEQUENCES menu, TRANSLATION submenu. Execute the alignment again Now the alignment looks right! (…except the sequences from Arabidopsis are still missing) But before moving on, let's compare the two alignments we've gotten. Comparison of nucleotide and protein alignments Nucleotide alignment Note that there is a good deal more similarity amongst the first seven sequences as judged by the protein alignment. Why is that? Protein alignment What's the relationship between the nucleotide sequence and the protein sequence? Comparison of nucleotide and protein alignments Nucleotide alignment Take the first three nucleotides as an example. They're uniformly GGG or GGT. And the next three are either AGT or TCA. Protein alignment But there is no variation in the amino acid sequence at the first two positions. Why? Why are protein sequences preferred to DNA sequences to compare genes of different organisms? Our goal was to make a phylogenetic tree, so let's do it. The TREE-OF function is found in the STRINGS-SEQUENES menu and the PHYLOGENETIC-TREE submenu. (There are a lot of ways to make a tree! We'll use a simple approach.) The TREE-OF function asks for two things: the alignment (which we now have) and some name of our choice to give the project. We can get the alignment by copying and pasting what we already did. Mouse over the Action Icon of ALIGNMENT-OF. Select Copy to copy the function and all of its contents. Paste alignment of into the alignment input box of TREE-OF. Now we need to name the project. Two warnings: 1. The name must be in quotes (I chose "moss-tree") 2. You can't reuse a previously used name Of the many ways to make a tree, I choose parsimony, because (delving deep into the paper) that's what Menand et al used. Finally, the tree program will get confused by the line of asterisks, so I get rid of that line. Execute, and… … if all went well, you should have yet another alignment, AND A TREE! Better, the tree looks remarkably like Fig. 3, except no Arabidopsis sequences. We'll finally deal with that. We'll define the Arabidopsis sequences much as we did the moss sequences (even reusing the same DEFINE box). Give the first sequence the name given in Fig. 3. But where do we get the actual sequence??? First step is to delete the old value, using the red X delete button. Now to the sequence. We turn to the time honored (but universally vilified) practice of screenscraping. This is the last resort, done only when an article forgets to tell you where they got the sequence. Highlight the AtRHD6 sequence, and (making sure that the Acrobat Select Tool is clicked) copy it to the clipboard. Now select the value box in the DEFINE function, type in "", paste the sequence between the two quotation marks, press Tab or Enter, and execute the completed function. You can repeat this operation for the remaining six sequences from Arabidopsis, being sure to change in each case both the name of the sequence and the sequence itself. Now that you have all the Arabidopsis sequences, join them as you did with the moss sequences into a set, called perhaps arabidopsis-set, or whatever. Do this by replacing all the moss sequences with Arabidopsis sequences and deleting the box you don't need. The Arabidopsis sequences should now be in the VARIABLES list. To delete the unwanted box, click the red X delete/clear button first to clear and then to delete. Finally, change the name of the set and execute the DEFINE function. Now we can join the moss set and Arabidopsis set into a single set, from which you'll then derive a tree. Bring down a new DEFINE function, give the set a name, select the value box and bring down the JOIN function, found in the LISTSTABLE menu, LIST-PRODUCTION submenu. Be sure you join the right things: the translated moss sequences (because the originals are DNA sequences) and the unmodified Arabidopsis set (because the originals are protein sequences). Now to make the final phylogenetic tree. Click the red clear button to erase the previous sequences that were aligned, and replace it with the set you just defined. Once that's done, nothing left but to execute and enjoy! Biology of Seed Plants Presents Making a Phylogenetic Tree Congratulations!