Download Applied Bioinformatics Exercise Sheet 2

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Genomics wikipedia , lookup

Genetic code wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Point mutation wikipedia , lookup

NEDD9 wikipedia , lookup

Sequence alignment wikipedia , lookup

Protein moonlighting wikipedia , lookup

Multiple sequence alignment wikipedia , lookup

Transcript
Albert-Ludwigs-Universität Freiburg
Technische Fakultät
Institut für Informatik
Robert Kleinkauf
Björn Grüning
Camorn Smith
Sita Saunders
Prof. Dr. Rolf Backofen
http://www.bioinf.uni-freiburg.de
Practical Exercises
Applied Bioinformatics
SS 2015
Exercise Sheet 2
Hand-in date: 25.05.2015
All exercise sheets and supplementary material can be found on the group website:
http://www.bioinf.uni-freiburg.de/Lehre/Courses/2015_SS/P_Einf_Bioinfo/index.html.
A common question when analysing a set of protein sequences is how and if they are evolutionary
related and whether they share a common function. Proteins mostly have active domains, parts
of the protein that interact with other molecules to perform a certain function. Often only these
domains are conserved throughout a group of functionally related proteins. In the following two
weeks you will apply a few databases and methods used to explore the characteristics of some
example proteins.
For this you will use:
• Multiple sequence alignments. ClustalΩ, succession of ClustalW http://www.ebi.ac.uk/
Tools/msa/clustalo/ , Muscle http://www.ebi.ac.uk/Tools/msa/muscle/, and T-coffee
http://www.ebi.ac.uk/Tools/msa/tcoffee/.
• Phylogeny (trees) http://www.ebi.ac.uk/Tools/phylogeny/clustalw2_phylogeny/.
• NCBI website http://www.ncbi.nlm.nih.gov
• RCSB Protein Data Bank (PDB) http://www.pdb.org/pdb/home/home.do
Exercise 1 First, we want you to apply and practise your knowledge from the last sheet about
the identification of homologous sequences. In this exercise, we are interested in finding orthologous
protein sequences. Use NCBI Blast for this purpose (database: reference proteins, options: exclude
model proteins XM/XP). First identify orthologous proteins of the human TGFB1 gene from the
previous exercise. Next, use the human protein that is provided in the fasta file (sheet2-ex1.fasta).
(8 points)
a. What is the name of the protein in sheet2-ex1.fasta? Describe in a few words its general function.
(1 point)
b. How long is the nucleotide coding sequence of the unknown protein sequence? (1 point)
c. Go to the full report page of the human TGFB1 in the Entrez Gene database. Describe shortly
two ways to get the protein sequence of TGFB1 from there. (1 point)
d. For both TGFB1 and the unknown protein, select those orthologs that have an accession number
beginning with ‘NP ’ (i.e. are contained in the RefSeq database and are thus of high quality).
Please make sure the annotations of the gene fit exactly to the original gene annotation and that
you do not select a highly similar variant. This ensures you are (most likely) selecting orthologs
instead of paralogs. Download the protein sequences of the selected orthologs (Hint: use only the
output and reformatting options of the website to get the fasta file). Save the multiple sequence
FASTA file as genename orthologs.fa. Reformat the headers such that they equivalent to the
following example: >Homo sapiens NP 000001.1 [Human], where in square brackets you state
the organim name used in general speech (instead of latin species name). Please hand in your
FASTA files as well as listing the following in your handed-in answer sheet: (1) accession, (2)
species name, (3) general organism name, (4) E-value, (5) coverage and (6) percent identity in
table format. (4)–(5) are derived from the BLAST output page. To check whether your results
are correct, you should have 14 proteins for TGFB1 and 6 proteins for the unknown protein,
including the original protein sequence. (4 points)
e. Why are the E-values lower for TGFB1 in comparison to the unknown protein—even with same
coverage and identity values? (1 point)
Exercise 2 An abundantly used method to analyse a set of given proteins is multiple sequence
alignment (MSA), of course only if the protein sequences are available. So in the following you shall
understand the general method underlying MSA, some common programs and their differences and
apply them to your selected sequences from Exercise 1. (12 points)
a. Describe the general process commonly used to create a multiple sequence alignment (see Feng
Doolittle). (1 point)
b. Three common MSA approaches are called ClustalΩ, Muscle, and T-Coffee. In a few sentences
describe the main algorithmic differences between these approaches. (3 points)
c. Do these methods calculate the optimal MSA? Explain your answer. (1 point)
d. What is meant by ”’once a gap, always a gap”? (1 point)
e. Align both orthologous protein sets with all three algorithms. Download and hand in all alignments in ClustalW (default) format. Are there any differences in the alignments? Why do you
see differences and where or why do you not see any differences? When do you have to choose
your method carefully? (2 points)
f. For TGFB1 only, use your computational skills to find out the (1) lengths of the alignments
and (2) the degree of conservation detected. For (2) count the number of ’*:.’ symbols and
multiply * by 3, : by 2, and . by 1 and add these scores together to gain one conservation score
per alignment. Do not count this by hand because it will take you too long. Please also write
the numbers of the individual characters as well as the final conservation score. (2 points)
g. In addition to conservation, count the number of proposed insertion or deletion events (called
indels) calculated by the different alignments. You find indels by looking at the gaps (symbol
‘-’). Consecutive ‘-’ characters represent either an insertion (nucleotides have been inserted) or
a deletion (nucleotides have been deleted). When aligning blocks of gaps (i.e. consecutive ‘-’
chars), if there are three consecutive gaps on top of two consecutive blocks, then these represent
two independent indel events (as proposed by the alignment). Regarding the number of indels,
is the score for the conservation a good measure for an MSA? Why/why not? (2 points)
Exercise 3 Given a set of orthologous proteins, the analysis of phylogenetic trees is a common
task in biological sequnence analysis. Once the ClustalΩ alignment is finished, use “JalView” for
further analysis.
Look at various methods of tree construction. Create phylogenetic trees. If there are differences in
the alignments, create trees and look at differences. (8 points)
a. Explain in a few sentences the clustering methods “Neigbour Joining”, “UPGMA” and “WPGMA”.
(3 points)
b. What is the relation between a guide tree and a phylogenetic tree? (1 point)
c. Homologous, orthologous or paralogous genes are used in phylogenetic analyses? Why? (1 point)
d. Now using the 14 orthologs of TGFB1 from Exercise 1 and the muscle alignment from Exercise 2,
then generate phylogenetic trees with both the neighbour joining and the UPGMA algorithms.
Note, use the phylogeny link provided above and change the method type in the options. What
are the main differences? Do you think these trees reflect the real evolutionary path? Explain
your answer. What is the closest to the Zebrafish? (2 points)
e. What is wrong in this sentence?: “Protein X (Homo sapiens Myoglobin) and protein Y (Mus
musculus Myoglobin) are 84% homologous”. (1 point)
Exercise 4 Now you are to look at the structural aspects of proteins. The structure is what
makes a protein functional, therefore it is of substantial importance to understand the fundamental
rules and building blocks that underly protein structure. (10 points)
a. Draw a general representation of an amino acid and name the Cα atom, the rest (R) group, the
carboxyl end and the amino end. (0.5 points)
b. Draw a peptide bond. (0.5 points)
c. What is the primary, secondary, tertiary and quaternary structure of a protein? What are the
basic elements of the primary and secondary structure? (1 point)
d. The 20 amino acids can be grouped according their main chemical properties. Find three
reasonable groups and state which amino acid belongs to them. (1 point)
e. Which amino acids can form covalent bonds in the tertiary structure? (1 point)
f. How is the function of Hemoglobin different from Myoglobin? (1 point)
g. Look for both of these structures in the PDB. First search for “Myoglobin” and “Hemoglobin”
and note down how many hits you get for each. From looking at the results you should see that
it is sometimes difficult to find the structure you are looking for, so for the following questions
use !!!2MM1 for Myoglobin!!! (obselete-¿3RGK) and 2HBE for Hemoglobin. (0 points)
h. Which SCOP and CATH family do each belong to? (1 point)
i. For both Myoglobin and Hemoglobin, list the names and numbers of their secondary structures.
(2 points)
j. Which methods were used to solve the protein structures and how accurate are they in general?
How is accuracy measured? What are the difficulties, advantages and disadvantages of these
methods? (2 points)
Exercise 5 Proteins are often composed of different subunits or domains with particular functions.
Due to different events during evoultion, new proteins with new functions have emerged from
existing proteins and domains. In this exercise your task is to identify the largest common domain
of the provided protein sequences (sheet2-ex5.fasta). For the alignment, use a gap open penalty of
10 and a gap extension penalty of 5. (8 points)
Alignment: http://www.ebi.ac.uk/Tools/psa/
Expasy: http://expasy.org/tools/scanprosite/
a. Use your knowledge to identify the accession numbers and gene names of both provides proteins
in sheet2-ex5.fasta. (1 point)
b. Write down and compare the sequence identity and sequence similarity for the global (NeedlemanWunsch) and local (Smith-Waterman) alignments. (1 point)
c. Which alignment method is more suited for identifying common domains? (1 point)
d. What is the name of the conserved region identified by the local alignment? To find this out,
extract from your local alignment the identified domain region from the larger protein and delete
all gaps. Search with this sequence snippet against Expasy database. (1 point)
e. Check both fragments, is there a difference in the domain features? (1 point)
f. Now use the complete sequences of the two proteins and search each again against Expasy.
Which domains are found in both provided proteins? State their names and their locations
within the original sequences. (1 point)
g. Three main databases exist for protein domains: SCOP, CATH, and Pfam. State whether the
databases are based on structure or sequence conservation. Which idea is more suitable for the
classification into functional families? (2 points)
Exercise 6 Programming in Python (5 points)
a. Wite down a function that gets a list of numbers and returns the sum over all list items. Check
the results of your sum function with the result of pythons sum() function. (1 point)
b. Write down a funtion that gets a list of numbers and a boolean called ’odd’. If ’odd’ is True, the
function should return all odd numbers in the given list, otherwise the function should return
all even numbers. (1 point)
c. Write a function that returns all prime numbers in a given range. (3 points)
Good luck!