Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
History of molecular evolution wikipedia , lookup
DNA barcoding wikipedia , lookup
Biochemistry wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Expanded genetic code wikipedia , lookup
Protein structure prediction wikipedia , lookup
Point mutation wikipedia , lookup
Homology modeling wikipedia , lookup
Molecular evolution wikipedia , lookup
Biology 164 Laboratory Introduction to Bioinformatics and Molecular Genetics (Based on a lab exercise developed by Henrik Kibak, 2004) Skills developed in this lab • • • • • Use of National Center for Biotechnology Information (NCBI) databases Retrieval of sequences from NCBI Alignment of homologous sequences using Clustalx Using ClustalX output to prepare phylogenies Testing evolutionary hypotheses Overview In the Protein Fingerprinting exercise you performed in lab a few weeks ago, we used molecular genetic techniques to develop a phylogeny of related vertebrate species. In that exercise, species were distinguished using the electrophoretic mobilities of their various muscle proteins. In today’s exercise we will distinguish different species by comparing the amino acid sequence of a specific mitochondrial protein. In a subsequent exercise you will distinguish different species by comparing the nucleotide sequence of a particular gene. Bioinformatics is emerging as a hugely important field affecting all areas of biology. While bioinformatics is formally the application of computer technologies to biological sciences - ranging from automated analysis of microarrays containing thousands of individual experiments to the development of browser tools for looking at whole genomes - students in all areas of biology need to be familiar with software tools developed by bioinformaticians to accomplish routine tasks in biology. As a demonstration exercise, we will look at the taxonomic position of Euglena using the sequence of amino acids for the mitochondrial protein, Cytochrome C. After that exercise you will have the skills necessary to answer the question: “ Did Darwin’s finches arise from an ancestral species that migrated to the Galapagos Islands directly from the distant mainland, or did they arise from a species inhabiting a more closely located island?” You will answer that question during next week's lab Important note: for all of the web-based activities it is important to use the Safari web browser for accessing websites and downloading files. Taxonomic Position of Euglena Algae are protists with chloroplasts. However, Euglena is a protist genus where some species have chloroplasts and others don't. So, are they more closely related to animals or plants?!! To answer that question we will use the resources of the National Center for Biotechnology Information (NCBI). First, we will download protein sequence data for Euglena and a small number of plants and animals. We will then use the sequence alignment software, Clustalx, to analyze the similarities and differences among the sequences. Based on this analysis, Clustalx will compute the genetic distances among our group of organisms. Finally we will plug the genetic distances into the phylogenetic treebuilding software, N-J Plot, to develop of phylogeny of the organisms. Log onto the NCBI homepage using the bookmark established in Safari. Introduction to Bioinformatics Page 1 It is impossible to provide a reasonable guide to even a small section of this tremendous resource... you will have to explore it yourself. For example search “All Databases” for “Alces” This will bring you to the Entrez search engine, which lets you navigate through the data resources of NCBI. As you can see, there is a vast amount of information cataloged for Alces, a very popular Maine genus. Try clicking on “PubMed: biomedical literature citations and abstracts” You will see a diverse array of literature citations relating to the genus Alces. Go back to the “Entrez” page. Note that in addition to literature searching, Entrez allows you to search a variety of genetic data resources such as nucleotide and protein sequences. To see what is available for Euglena let's enter that instead of Alces. Go ahead and refine the search a bit by clicking "Protein" and adding the search modifier for "organism" like this: Euglena [orgn] Introduction to Bioinformatics Page 2 That should reduce the number of hits a bit. Adding "cytochrome c" with quotes like this should help a lot: Euglena [orgn] "cytochrome c" Finally, if you add the search modifier for "protein" like this: Euglena [orgn] "cytochrome c" [prot] ...it should knock it down to two hits that include the Cytochrome C sequences for Euglena viridis and Euglena gracilis that were obtained many years ago by direct protein sequencing. Click on the first accession number for the Euglena viridis sequence. This will bring you to a reference page that documents background information on the origin of the sequence, principal investigators, journal references, etc. In the window next to the “Display” button, select “FASTA” and then click “Display.” This will bring you to the amino acid sequence of the Euglena viridis Cytochrome C protein in FASTA format. The FASTA format is the primary format for sequence data that is recognized by bioinformatics software. In the window next to the “Send” button, select “this page to text” and then click “Send” Introduction to Bioinformatics Page 3 Save the Euglena viridis sequence to that folder as a text file called "Cyt_c_Eug_vir.txt" In doing so, create a New Folder on the desktop named “Seqs” to store your sequence data. The letters in the sequence data correspond to the specific amino acids that comprise the Cytochrome C protein. The following single letter code is used to designate the various amino acids. Return to the Protein search page by backclicking, and erase your previous search terms and try typing in "Cytochrome C" in quotes... what results do you get when you search? You should see "Page 1" of at least "2,567 pages" of results!!! A bit more than Alces... To refine the search try adding [prot] after the "Cytochrome C" - that should get it down to only 23 pages of results (!). Finally try adding "mammalia" to the search terms as in the example below: What do you see? You should see that the results have been narrowed to 47 items (2005) on 3 pages. Click on P0007. What organism was this sequence obtained from? If you wanted to download this sequence, you would follow the same steps outlined above for Euglena viridis. In order to save time I have downloaded five homologous Cyt C sequences from five different organisms for us to use in this exercise. The five sequences have been saved in a file called “all_five.txt” Introduction to Bioinformatics Page 4 Make a copy of this file using the “Duplicate” command of the “File” menu of the Finder, and place the copy in the folder you created named “Seqs” Th e C yt och ro me C seq ue nce s w e w ill u se (in FA S T A form at ): >Arabidopsis gi|4539007 Cytochrome c [Arabidopsis thaliana] MASFDEAPPGNPKAGEKIFRTKCAQCHTVEKGAGHKQGPNLNGLFGRQSGTTPGYSYSAA NKSMAVNWEEKTLYDYLLNPKKYIPGTKMVFPGLKKPQDRADLIAYLKEGTA >Euglena GI|117985:1-102 Cytochrome c [Euglena viridis] GDAERGKKLFESRAGQCHSSQKGVNSTGPALYGVYGRTSGTVPGYAYSNANKNAAIVWED ESLNKFLENPKKYVPGTKMAFAGIKAKKDRLDIIAYMKTLKD >Hippo gi|65451 Cytochrome c [Hippopotamus amphibius] GDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLHGLFGRKTGQSPGFSYTDANKNKGITWG EETLMEYLENPKKYIPGTKMIFAGIKKKGERADLIAYLKQATNE >Mosquito gi|31202411|ref|XP_310154.1| [Anopheles gambiae] MGVPAGDVEKGKKLFVQRCAQCHTVEAGGKHKVGPNLHGLFGRKTGQAAGFSYTDANKAK GITWNEDTLFEYLENPKKYIPGTKMVFAGLKKPQERGDLIAYLKSATK >Rice gi|218249 Cytochrome C [Oryza sativa (japonica cultivar-group)] MASFSEAPPGNPKAGEKIFKTKCAQCHTVDKGAGHKQGPNLNGLFGRQSGTTPGYSYSTA NKNMAVIWEENTLYDYLLNPKKYIPGTKMVFPGLKKPQERADLISYLKEATS Preparing sequences for alignment using Clustalx The Clustalx software runs a mathematical algorithm that aligns multiple sequences in ways that minimize the differences between them. If you think about the types of changes that occur to genes over time, e.g., point mutations, reading frame shifts, codon transpositions or deletions, etc., you begin to see how proteins can change as well. During the alignment procedure, Clustalx uses a variety of approaches to account for these different types of changes by shifting sequences in relation to one another, and by adding small gaps to make up for deletions that may have occurred (keeping track of these modifications of course). Once the sequences are aligned, Clustalx then computes the genetic distance between every possible combination of sequences. Calculation of the genetic distance is fairly complex but, in simple terms, it is a measure of evolutionary divergence between homologous sequences. The greater the genetic distance, the more distantly related the sequences are. Use MS Word to open your renamed copy of “all_five.txt” In addition to the list of amino acid residues for each protein, the sequence data downloaded from NCBI contains other descriptive information such as the accession number, common and scientific names of the organism from which the sequence was obtained, etc. You will need to edit out some of the descriptive information now so that it does not clutter up the phylogenic tree you will build later. Edit out all of the descriptive information except for the common name. Make sure not to remove the “>” character, since that is how Clustalx knows a sequence is beginning. Save your changes and close out of Word. Launch Clustalx and use the File menu to Load your copy of “all_five.txt” Introduction to Bioinformatics Page 5 Use the size box in the lower right-hand corner to stretch out the window so that you can see the entire length of all the sequences. At this point the sequences are in their original state and have not been aligned yet. If you find the colors distracting, you can set them to Black and White in the Colors menu. The colors correspond to different groups of amino acids. For example, red corresponds to the amino acids with basic R groups, while green corresponds to those bearing non-polar R groups. Note how the lengths of the sequences differ from one another. Which organism has the shortest sequence? Which have the longest? Which appear to be most similar to one another? Which are most different? If you want to position two sequences adjacent to one another so you can compare them without the other sequences in between, use the cut and paste commands in the File menu to do so. The gray bars under the ruler indicate how well-conserved amino acids are at each position. The higher the bar, the more conserved the sequences are at that position. Obviously, in the un-aligned state not many bars are very tall. If corresponding conserved regions exist among the homologous sequences, you will see that the gray bars will be much taller after the alignment procedure . . . So go ahead and align the sequences. From the “Alignment” menu choose “Do Complete Alignment” You should note that three things have occurred during the alignment. First, the gray bars under the ruler have grown in length!! Shifting the shorter sequences to the right has revealed stretches within the sequences that are well-conserved. Second, Clustalx has rearranged the positions of the sequences to place those most similar next to one another. Third, symbols have been placed in the gray horizontal bar above the sequences to indicate highlights of the conserved features as follows: “*” indicates a single fully conserved residue “:” indicates ‘strong’ amino acid R groups conserved “.” indicates ‘weak” amino acid R groups conserved Introduction to Bioinformatics Page 6 Before leaving Clustalx you will need to run an algorithm that calculates the genetic distances between the aligned sequences. From the “Tree” menu choose “Draw N-J Tree”. An actual tree will not be drawn, but a file will be written that will contain the genetic distance data that can be used to construct a phyologeny. We will use another application to actually draw the tree. Go to the Finder and look inside your “Seqs” folder to see the files that Clustalx has written to it. Your folder should look something like this: You should be able to recognize the files you originally placed in the folder, but in addition there will be three files that were written by Clustalx. The file with the .aln extension is the sequence alignment file. It can be opened by Clustalx if you would like to see the aligned sequences again. The files with the .dnd extension and with .ph extension are genetic distance files formatted for use by software other than Clustalx. We will use the file with the .ph extension (an abbreviation for Phyllip, a popular tree building application) to actually construct our phylogenetic tree. Building a phylogenetic tree using the application N-J Plot In 1968, a graduate student in Japan, Matatoshi Nei, was reading a paper about the proportion of different isozymes in sibling species of Drosophila when he realized it could be possible to build a phylogenetic tree based on the similarities and differences (genetic distance) that existed between the isozymes. The mathematical method he developed in conjunction with Naruya Saitou for building such trees came to be known as the ‘Neighbor-joining Method’, hence the name N-J tree. In the phylogenetic trees you have made using MacClade, the relationship between species is determined by their membership in one or more clades. In the MacClade approach, all branches of the tree are essentially of equal length, only the location of the branch points on the tree differs from one species to the next. With the Neighbor-joining method, genetic distances are used to actually calculate the lengths of the tree branches. This is a very powerful approach for building phylogenies that was made possible by the advent of molecular genetics. Launch N-J Plot, and open the genetic distance file created by Clustalx that has the .ph extension. The file should be located in your “Seqs” folder. Introduction to Bioinformatics Page 7 Your tree should look something like this: Notice that most of the branches are not the same length. In fact there is a scale in the upper righthand corner that shows how the length of an individual branch corresponds to the genetic distance to its nearest neighbor. In most cases the ‘nearest’ neighbor will be a presumed ancestral species. The genetic distance between any two current species (e.g., the species at the tips of the branches) is found by adding up the lengths of the branches in between. This level of precision was not possible in phylogenetic systematics before the Neighbor-joining method was developed. Try clicking on the “Branch lengths” box to display the specific genetic distances on the branches of the tree. You should now be able to answer the question that was originally posed concerning where to place Euglena with respect to the plants and animals used in our study. Is Euglena more closely related to plants or animals? Is Euglena more closely related to a hippo or a mosquito? Do the results surprise you? Remember that this phylogeny is only based on genetic distances calculated from sequence data derived from just one specific protein!! For the second half of lab today you will use the skill you developed in this exercise to evaluate two hypotheses regarding the evolution of Darwin’s finches. However, instead of using protein sequences, you will use finch mitochondrial DNA sequences that have been archived at NCBI. Introduction to Bioinformatics Page 8