Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Genetic code wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Point mutation wikipedia , lookup
Sequence alignment wikipedia , lookup
Albert-Ludwigs-Universität Freiburg Technische Fakultät Institut für Informatik Robert Kleinkauf Björn Grüning Camorn Smith Sita Saunders Prof. Dr. Rolf Backofen http://www.bioinf.uni-freiburg.de Practical Exercises Applied Bioinformatics SS 2015 Exercise Sheet 2 Hand-in date: 25.05.2015 All exercise sheets and supplementary material can be found on the group website: http://www.bioinf.uni-freiburg.de/Lehre/Courses/2015_SS/P_Einf_Bioinfo/index.html. A common question when analysing a set of protein sequences is how and if they are evolutionary related and whether they share a common function. Proteins mostly have active domains, parts of the protein that interact with other molecules to perform a certain function. Often only these domains are conserved throughout a group of functionally related proteins. In the following two weeks you will apply a few databases and methods used to explore the characteristics of some example proteins. For this you will use: • Multiple sequence alignments. ClustalΩ, succession of ClustalW http://www.ebi.ac.uk/ Tools/msa/clustalo/ , Muscle http://www.ebi.ac.uk/Tools/msa/muscle/, and T-coffee http://www.ebi.ac.uk/Tools/msa/tcoffee/. • Phylogeny (trees) http://www.ebi.ac.uk/Tools/phylogeny/clustalw2_phylogeny/. • NCBI website http://www.ncbi.nlm.nih.gov • RCSB Protein Data Bank (PDB) http://www.pdb.org/pdb/home/home.do Exercise 1 First, we want you to apply and practise your knowledge from the last sheet about the identification of homologous sequences. In this exercise, we are interested in finding orthologous protein sequences. Use NCBI Blast for this purpose (database: reference proteins, options: exclude model proteins XM/XP). First identify orthologous proteins of the human TGFB1 gene from the previous exercise. Next, use the human protein that is provided in the fasta file (sheet2-ex1.fasta). (8 points) a. What is the name of the protein in sheet2-ex1.fasta? Describe in a few words its general function. (1 point) b. How long is the nucleotide coding sequence of the unknown protein sequence? (1 point) c. Go to the full report page of the human TGFB1 in the Entrez Gene database. Describe shortly two ways to get the protein sequence of TGFB1 from there. (1 point) d. For both TGFB1 and the unknown protein, select those orthologs that have an accession number beginning with ‘NP ’ (i.e. are contained in the RefSeq database and are thus of high quality). Please make sure the annotations of the gene fit exactly to the original gene annotation and that you do not select a highly similar variant. This ensures you are (most likely) selecting orthologs instead of paralogs. Download the protein sequences of the selected orthologs (Hint: use only the output and reformatting options of the website to get the fasta file). Save the multiple sequence FASTA file as genename orthologs.fa. Reformat the headers such that they equivalent to the following example: >Homo sapiens NP 000001.1 [Human], where in square brackets you state the organim name used in general speech (instead of latin species name). Please hand in your FASTA files as well as listing the following in your handed-in answer sheet: (1) accession, (2) species name, (3) general organism name, (4) E-value, (5) coverage and (6) percent identity in table format. (4)–(5) are derived from the BLAST output page. To check whether your results are correct, you should have 14 proteins for TGFB1 and 6 proteins for the unknown protein, including the original protein sequence. (4 points) e. Why are the E-values lower for TGFB1 in comparison to the unknown protein—even with same coverage and identity values? (1 point) Exercise 2 An abundantly used method to analyse a set of given proteins is multiple sequence alignment (MSA), of course only if the protein sequences are available. So in the following you shall understand the general method underlying MSA, some common programs and their differences and apply them to your selected sequences from Exercise 1. (12 points) a. Describe the general process commonly used to create a multiple sequence alignment (see Feng Doolittle). (1 point) b. Three common MSA approaches are called ClustalΩ, Muscle, and T-Coffee. In a few sentences describe the main algorithmic differences between these approaches. (3 points) c. Do these methods calculate the optimal MSA? Explain your answer. (1 point) d. What is meant by ”’once a gap, always a gap”? (1 point) e. Align both orthologous protein sets with all three algorithms. Download and hand in all alignments in ClustalW (default) format. Are there any differences in the alignments? Why do you see differences and where or why do you not see any differences? When do you have to choose your method carefully? (2 points) f. For TGFB1 only, use your computational skills to find out the (1) lengths of the alignments and (2) the degree of conservation detected. For (2) count the number of ’*:.’ symbols and multiply * by 3, : by 2, and . by 1 and add these scores together to gain one conservation score per alignment. Do not count this by hand because it will take you too long. Please also write the numbers of the individual characters as well as the final conservation score. (2 points) g. In addition to conservation, count the number of proposed insertion or deletion events (called indels) calculated by the different alignments. You find indels by looking at the gaps (symbol ‘-’). Consecutive ‘-’ characters represent either an insertion (nucleotides have been inserted) or a deletion (nucleotides have been deleted). When aligning blocks of gaps (i.e. consecutive ‘-’ chars), if there are three consecutive gaps on top of two consecutive blocks, then these represent two independent indel events (as proposed by the alignment). Regarding the number of indels, is the score for the conservation a good measure for an MSA? Why/why not? (2 points) Exercise 3 Given a set of orthologous proteins, the analysis of phylogenetic trees is a common task in biological sequnence analysis. Once the ClustalΩ alignment is finished, use “JalView” for further analysis. Look at various methods of tree construction. Create phylogenetic trees. If there are differences in the alignments, create trees and look at differences. (8 points) a. Explain in a few sentences the clustering methods “Neigbour Joining”, “UPGMA” and “WPGMA”. (3 points) b. What is the relation between a guide tree and a phylogenetic tree? (1 point) c. Homologous, orthologous or paralogous genes are used in phylogenetic analyses? Why? (1 point) d. Now using the 14 orthologs of TGFB1 from Exercise 1 and the muscle alignment from Exercise 2, then generate phylogenetic trees with both the neighbour joining and the UPGMA algorithms. Note, use the phylogeny link provided above and change the method type in the options. What are the main differences? Do you think these trees reflect the real evolutionary path? Explain your answer. What is the closest to the Zebrafish? (2 points) e. What is wrong in this sentence?: “Protein X (Homo sapiens Myoglobin) and protein Y (Mus musculus Myoglobin) are 84% homologous”. (1 point) Exercise 4 Now you are to look at the structural aspects of proteins. The structure is what makes a protein functional, therefore it is of substantial importance to understand the fundamental rules and building blocks that underly protein structure. (10 points) a. Draw a general representation of an amino acid and name the Cα atom, the rest (R) group, the carboxyl end and the amino end. (0.5 points) b. Draw a peptide bond. (0.5 points) c. What is the primary, secondary, tertiary and quaternary structure of a protein? What are the basic elements of the primary and secondary structure? (1 point) d. The 20 amino acids can be grouped according their main chemical properties. Find three reasonable groups and state which amino acid belongs to them. (1 point) e. Which amino acids can form covalent bonds in the tertiary structure? (1 point) f. How is the function of Hemoglobin different from Myoglobin? (1 point) g. Look for both of these structures in the PDB. First search for “Myoglobin” and “Hemoglobin” and note down how many hits you get for each. From looking at the results you should see that it is sometimes difficult to find the structure you are looking for, so for the following questions use !!!2MM1 for Myoglobin!!! (obselete-¿3RGK) and 2HBE for Hemoglobin. (0 points) h. Which SCOP and CATH family do each belong to? (1 point) i. For both Myoglobin and Hemoglobin, list the names and numbers of their secondary structures. (2 points) j. Which methods were used to solve the protein structures and how accurate are they in general? How is accuracy measured? What are the difficulties, advantages and disadvantages of these methods? (2 points) Exercise 5 Proteins are often composed of different subunits or domains with particular functions. Due to different events during evoultion, new proteins with new functions have emerged from existing proteins and domains. In this exercise your task is to identify the largest common domain of the provided protein sequences (sheet2-ex5.fasta). For the alignment, use a gap open penalty of 10 and a gap extension penalty of 5. (8 points) Alignment: http://www.ebi.ac.uk/Tools/psa/ Expasy: http://expasy.org/tools/scanprosite/ a. Use your knowledge to identify the accession numbers and gene names of both provides proteins in sheet2-ex5.fasta. (1 point) b. Write down and compare the sequence identity and sequence similarity for the global (NeedlemanWunsch) and local (Smith-Waterman) alignments. (1 point) c. Which alignment method is more suited for identifying common domains? (1 point) d. What is the name of the conserved region identified by the local alignment? To find this out, extract from your local alignment the identified domain region from the larger protein and delete all gaps. Search with this sequence snippet against Expasy database. (1 point) e. Check both fragments, is there a difference in the domain features? (1 point) f. Now use the complete sequences of the two proteins and search each again against Expasy. Which domains are found in both provided proteins? State their names and their locations within the original sequences. (1 point) g. Three main databases exist for protein domains: SCOP, CATH, and Pfam. State whether the databases are based on structure or sequence conservation. Which idea is more suitable for the classification into functional families? (2 points) Exercise 6 Programming in Python (5 points) a. Wite down a function that gets a list of numbers and returns the sum over all list items. Check the results of your sum function with the result of pythons sum() function. (1 point) b. Write down a funtion that gets a list of numbers and a boolean called ’odd’. If ’odd’ is True, the function should return all odd numbers in the given list, otherwise the function should return all even numbers. (1 point) c. Write a function that returns all prime numbers in a given range. (3 points) Good luck!