* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Word file - UC Davis
Protein–protein interaction wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Metalloprotein wikipedia , lookup
Non-coding DNA wikipedia , lookup
Expression vector wikipedia , lookup
Amino acid synthesis wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Promoter (genetics) wikipedia , lookup
Biochemistry wikipedia , lookup
Deoxyribozyme wikipedia , lookup
Community fingerprinting wikipedia , lookup
Ribosomally synthesized and post-translationally modified peptides wikipedia , lookup
Proteolysis wikipedia , lookup
Biosynthesis wikipedia , lookup
Gene expression wikipedia , lookup
Structural alignment wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Protein structure prediction wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Name:__________________________________ ID : ____________________________________ ECS 129: Structural Bioinformatics March 15, 2016 Notes: 1) 2) 3) 4) 5) The final exam is open book, open notes. The final is divided into 2 parts, and graded over 100 point You can answer directly on these sheets (preferred), or on loose paper. Please write your name at least on the front page! Please, check your work! If possible, show your work when multiple steps are involved. Part I (15 questions, each 4 points; total 60 points) (These questions are multiple choices; in each case, find the most plausible answer) 1) Two homologous genes: A) Would be expected to have very similar sequences in related organisms B) Would be expected to be more similar in distantly related organisms than in organisms that are closely related C) May have become similar to each other by random mutations D) Cannot be found on the same genome E) All of these Homologous means the two sequences are related, often very similar. 2) In the dynamic programming matrix below, what is the score in the cell identified with an interrogation mark (?). Assume that the score for a perfect match is set to 10, the score of a mismatch is set to 0, and gap penalties are ignored A T W C Y T 0 0 0 A 10 0 0 T 0 20 10 10 10 20 C 0 10 20 30 20 20 A) B) C) D) E) 20 10 30 40 0 3) The figure below shows a non-standard nucleotide base pair; identify it (note that dX indicates a deoxyribonucleotide, as contained in a DNA molecule, while rX refers to a ribonucleotide, as found in an RNA molecule). A) B) C) D) E) 1) 1 dG-dC rG-rC dG-rC rG-dC rC-dG Name:__________________________________ ID : ____________________________________ The nucleotide on the left is a ribonucleotide (as the C2’ carries a O); the nucleotide on the right is a deoxy-ribonucleotide (no O on C2’); the bases are G on the left, C on the right. 4) The figure below shows a small peptide of six amino acids; give its sequence: (hint: there is one charged amino acid at physiological pH – from pH 5.5 to pH 8.0) A) AHYWPEF B) AHFWPEY C) AHFWPQY D) AHFWPEF E) AHFWPEW 5) Given the DNA sequence S= 5’-GAATTC-3’, how does the dotplot between S and its complementary, cS, look like? S = 5’-GAATTC-3’ and cS = 5’-GAATTC-3’ (remember that if nothing is said, sequences are always assumed to be 5’ to 3’); the two sequences are the same and therefore the corresponding dot plot is A. 6) The figure below shows a small fragment of a protein. From this figure, is it possible to define which extremity is the N-terminal, and which extremity is the C-terminal? A) Yes: 1 is Nter, 2 is Cter B) No: there is not enough information C) Yes, 1 is Cter and 2 is Nter D) No: Nter and Cter are only defined for nucleic acids E) No: we would need to know the sequence of this protein fragment 2 Name:__________________________________ ID : ____________________________________ Based on the arrows representing the strands, Nter is at 2, and 1 is Cter. 7) The so-called Rosetta stone for predicting protein-protein interactions is: A) Gene fusion B) Gene co-expression C) Presence of the name of the two proteins concerned in the same scientific paper D) A very old stone recently found in Gizeh, Egypt, next to the Sphinx, that describes the code for protein-protein interactions in three scripts: hieroglyphic, demotic and Greek E) A free software with high success rate Gene fusion is really the so-called Rosetta stone for protein-protein interactions. 8) Which combination of program / substitution matrix will most likely give you the best alignment between two sequences that are highly similar? A) BLAST / Blosum45 B) Dynamic programming / Blosum45 C) BLAST / Blosum90 D) Dynamic programming / Blosum90 E) BLAST / Blosum10 BLAST is only heuristic; it is best to use dynamic programming to get a good alignment. As the two sequences are highly similar, it is best to use a BLOSUM matrix computed from sequences that are very similar, such as BLOSUM90, computed with sequences that have at least 90% sequence identity. 9) How many possible alignments, with no internal gaps, can you form when you compare a sequence of length 4 with a sequence of length 8? (Note that an alignment must have at least one letter match between the 2 sequences) A) 4 B) 8 C) 9 D) 10 E) 11 The last letter of the sequence of length 4 can face any of the 8 letters of the second sequence, but can also be hanging after the last letter, up to 3 letters away; therefore the total number is 8 + 3 = 11 10) Only one of these techniques directly studies the behavior of a molecule as a function of time: A) Molecule dynamics B) Monte Carlo sampling techniques C) Molecular mechanics D) Energy minimization E) Simulated annealing techniques 3 Name:__________________________________ ID : ____________________________________ Monte Carlo explores in conformational space; molecular mechanics == energy minimization, no time involved; simulated annealing is just a better sampling technique. 11) We want to find the best alignment(s) between the DNA sequences AGTATCT and AGATGC. The scoring scheme S is defined as follows: S(i,j) = 1 if i = j, and S(i,j) = 0 otherwise. There is a constant gap penalty of -1 (penalty for the first position counts; see table below). The score Sbest and the number N of optimal alignments are (show your final dynamic programming matrix and the best possible alignment (s) for full credit): A G T A T C T A 1 -1 -1 0 -1 -1 -1 G -1 2 0 0 0 0 0 A 0 -1 2 2 1 1 1 T -1 0 2 2 3 1 2 G -1 1 1 2 2 3 2 C -1 0 1 1 2 3 3 A) Sbest = 3, N = 2 B) Sbest = 3, N = 1 C) Sbest = 4, N = 1 D) Sbest = 3, N = 3 E) Sbest = 4; N = 3 12) A protein sequence contains one ASP residue. You want to create a new protein sequence, with this ASP being replaced with a TYR. To do this, you first generate the cDNA corresponding to the original protein (with your own choice for the codons you use), then mutate this cDNA to get the sequence corresponding to the new protein. What is the minimum number of mutations needed? A) B) C) D) E) 1 2 3 0 None of the above ASP can be represented with the codon GAU; mutating G with U, you get the codon UAU which codes for TYR. 13) The Ramachandran plot of the protein structure 1axc in the PDB databank is given on the right. Which of the model of protein structures given below is most likely the corresponding structure: 4 Name:__________________________________ ID : ____________________________________ A) B) C) D) The Ramachandran plot shows as many residues in helical structures than in strand conformations. Only structure C corresponds. 14) A single stranded DNA contains 15% Adenine, as many Guanines as Cytosines, and 40% of purines. What is the amount (in percent) of Thymine: A) 25% B) 15% C) 35% D) 40% E) Not enough information available There are 40% of purines and 15% Adenine, therefore there are 25% of Guanine. Since there are as many Guanine as Cytosine, there are 25% Cytosine. Finally, there are 35% of Thymine. 15) The protein sequence alignment shown below has a total score of 28. Knowing that the score for an exact match is 5 and the score for a mismatch is -4, what is the score used for the (constant, i.e. independent of length) gap penalty: GCTGGAAG-GCA-T GC----AGAGCACT A) -1 B) -2 C) -3 D) -4 E) Undefined (any value would give the same total score) Total score = 28 = 5*8 +3*x -> x = (28-40)/3 = -4 (it was important to notice that the cost of a gap is independent of its length, as said explicitly in the text of the question; there are therefore 3 gaps to consider, it does not matter that there are 6 residues total in those gaps). 5 Name:__________________________________ ID : ____________________________________ 16) Docking is the process of predicting the conformation of the complex formed by a receptor and a ligand. Which of these four statements about docking is most likely to be true? A) Rigid, bound docking is the most difficult situation for predicting the conformation of the complex B) We only need the conformation of the receptor to perform docking C) The lock-and-key concept relates to rigid docking D) Docking can be solved with a simple energy minimization. Lock-and-key assumes that the two partners are rigid… which is the underlying assumption for rigid docking. 17) Dynamic programming, popular for sequence alignment, can also be used for spell checking. Assuming that a match is worth 10, a mismatch is worth 5, and a gap “costs” -5, which of these four words is closest to the word “graffe” typed by a user? Write the score of the optimal alignment next to each word (gaps at the start or at the end do not count). A) gaff B) graft C) grail D) giraffe best score: 40-5 = 35 best score: 40+5 = 45 best score: 30+5+5 = 40 best score: 60-5 = 55 18) Let us consider the Luria and Delbruck experiment. The distribution of the number of mutations that occur during the growth of parallel cultures has a Poisson distribution. If there are no mutants, there were no mutations, and so the mean number of mutations m that occurs during the growth of a culture can be calculated from p0, the proportion of cultures with no mutants: m = -log ( p0 ) . Let us consider a bacterium B that is sensitive to a bacteriophage T, unless it carries a mutation M. 50 cultures of the bacterium, each with approximately 3 10^7 bacteria, are subjected to the bacteriophage; 40 of those cultures show no resistance, i.e. none of their bacteria carried the mutation. Estimate the mutation rate per bacterium B: A) 7.4 10^(-9) B) 2.9 10^6 C) 0.097 D) 1 10^(-9) E) Not enough information available m, number of mutations per culture = -log(40/50) = 0.22 = m/(3^10^7) = 7.4 10^(-9) 19) You want to design a small peptide that can interact with the TATA box of a specific gene (the TATA box is a small DNA sequence upstream from the gene that serves as transcription initiator). Your constraints are: the peptide should contain a strand (at least predicted to be mostly in extended conformation, based on Chou and Fassman, see appendix D), and it should contain 12 residues. Which of the following peptide would be a good candidate? A) MPGCLPQALGLP B) MPGLEWQLPGLP 6 Name:__________________________________ ID : ____________________________________ C) MLGYTWTTVSVT D) MVTTVWYVTGT A and B are unlikely due to the presence of many prolines; D is only 11 residue long. 20) The cDNA corresponding to a small peptide is ATGTATGATCAATGCAGCGGGCCTTTA TAG. The corresponding amino acid sequence is Met-Tyr-Asp-Glu-Cys-Ser-Gly-Pro-Leu-Stop. A mutation occurs at the DNA level, with the C at position 15 being substituted with T. What effect do you think this mutation might have on the expression of this gene? A) It introduces a stop codon and the peptide will be shorter B) The Cys in position 5 of the protein sequence will be replaced with Trp C) The Start and Stop codons won’t be in phase anymore and the gene won’t be expressed D) This is a silent mutation as it will have no impact on the protein sequence The codon TGC is mutated to TGT… both code for Cysteine; the mutation is silent. Part II (2 problems; total 40 points) Problem 1 (4 questions, each 8 points) 1) The following eukaryotic DNA sequence was given to you: 5’-TAATGGCCTTAGAAGAGGGTCTCGCGAAACACTAAGG-3’ You are told that this sequence, or its complementary, codes for one gene. Find the longest “gene”, or open reading frame (ORF) corresponding to this DNA sequence; remember that there are 6 possibilities, i.e. 3 possible reading frames for one strand and 3 possible reading frames for its complementary. Transcribe this ORF into an RNA sequence We don’t know if the sequence given corresponds to the coding strand, so we need to check both this sequence S, and its complementary C: 5’-CCTTAGTGTTTCGCGAGACCCTCTTCTAAGGCCATTA-3’ The complementary strand C does not contain any ATG (Start codon) The initial sequence S contains one ATG, and one TAA (stop codon), in phase with ATG. Consequently, the longest ORF goes from the first ATG to TAA: 5’ ATG GCC TTA GAA GAG GGT CTC GCG AAA CAC TAA-3’ The corresponding RNA sequence is: 7 Name:__________________________________ ID : ____________________________________ 5’ AUG GCC UUA GAA GAG GGU CUC GCG AAA CAC UAA-3’ 2) As this is a eukaryotic sequence, it may contain an intron. For simplicity, we will assume that introns always start with GU and end with CA. Identify all possible introns, and explain why their removal would result in the loss of the gene. There is one GU and one CA in the RNA sequence: 5’ AUG GCC UUA GAA GAG GGU CUC GCG AAA CAC UAA-3’ If we remove the corresponding intron GU CUC GCG AAA CAC, we would get the new RNA sequence: 5’ AUG GCC UUA GAA GAG G UAA-3’ in which the start and stop codon would not be in phase anymore; the gene would be lost. 3) Based on question 2 just above, we know that the RNA is not spliced. Find the sequence of the “protein” it encodes. The mRNA is: 5’ AUG GCC UUA GAA GAG GGU CUC GCG AAA CAC UAA-3’ The protein sequence is obtained directly using the genetic code: Nter – Met Ala Leu Glu Glu Gly Leu Ala Lys His – Cter Or, in one-letter code: Nter- MALEEGLAKH-Cter 4) Predict the secondary structure of this “protein” using the Chou and Fassman method, with the propensities given in Appendix D We start by writing the propensities: P(helix) M A L E E G L A K H 1.47 1.29 1.30 1.44 1.44 0.56 1.30 1.29 1.23 1.22 0.9 1.02 0.75 0.75 0.92 1.02 0.9 0.77 1.08 P(strand) 0.97 8 Name:__________________________________ ID : ____________________________________ There are no initiation sites for strands. However, there are multiple possible initiation sites for helices. We can pick the Nter of the sequence: MALEEG. We can prolong: -EEGL: sum P(alpha) = 1.44+1.44+0.56 + 1.30 = 4.74 > 4 - EGLA: sum P(alpha) = 1.44 + 0.56 + 1.30 + 1.29 = 4.59 > 4 - GLAK: sum P(alpha) = 0.56 + 1.30 + 1.29 + 1.23 = 4.38 > 4 - LAKH: sum P(alpha) = 1.30 + 1.29 + 1.23 + 1.22 = 5.04 > 4 Finally, we compute the average P(alpha) over the whole peptide: Sum = 1.47 + 1.29 + 1.30 + 1.44 + 1.44 + 0.56 + 1.30 +1.29 + 1.23 + 1.22 = 12.54 Average = 12.54 / 10 = 1.254 > 1 The whole peptide is predicted to be helical. Problem 2 (8 points) You have isolated an important gene that regulates the size of a newly found frog from the island of Borneo. You have also been able to find the sequence of the protein encoded by this gene. You suspect that sequences similar to this sequence can be found in other organism, but with circular permutation: N amino acids Initial sequence Permuted sequence In a circular permutation, N amino acids (N can take any value between 1 and M-1, where M is the total length of the protein) at the end of the original sequence will appear at the beginning of the permuted sequence (i.e. before the remaining M-N amino acids). Propose an efficient strategy for detecting all possible permuted sequences of your frog sequence in a large database of protein sequences. The most efficient strategy is to generate a pseudo sequence in which the sequence of your frog protein is repeated twice. This pseudo sequence will look like (following the drawing of the question): 9 Name:__________________________________ ID : ____________________________________ repeat one repeat two If you search a protein sequence database with this sequence, you will detect all possible permutations! 10 Name:__________________________________ ID : ____________________________________ Appendix A: Amino Acids Hydrophobic Amino Acids CD2 CG2 C GLY (G) CG C C C CB Leu (L) CA Val (V) ALA (A) CD CG CG1 CD1 CG1 CZ CB CG2 CE1 CE2 CD1 CD CB N CA Ile (I) CA CG CA CD2 CB Pro (P) C Phe (F) CE CB CA CG S Met (M) Polar Amino Acid CG2 OG1 C C OH CE2 CD2 OG CG CB CA Tyr (Y) CA CA Ser Thr (T) CZ2 NE CE2 1 CD1 CE1 CD1 NE CD2 2 CB CG His (H) OD1 CG OE1 ND2 CD CE3 CB CB CA Trp ND1 CA CG CB CA CE1 CH CZ3 CD2 CG CZ Asn (N) 11 CA Gln (Q) NE2 Name:__________________________________ ID : ____________________________________ Polar Amino Acids NZ OE1 OE2 CD CG NH2 CZ CB NH1 CG CB CD NE CE CD CG CB CA Glu (E) SG CB Lys (K) CA OD1 CG CA CB Arg (R) CA Cys (C) CA Asp (D) Appendix B: Nucleotides Uracyl (U) 12 OD2 Name:__________________________________ ID : ____________________________________ Appendix C: Genetic Code U U Phe Phe Leu Leu C Leu Leu Leu Leu A Ile Ile Ile Met/Start G Val Val Val Val C A G Ser Ser Ser Ser Pro Pro Pro Pro Thr Thr Thr Thr Ala Ala Ala Ala Tyr Tyr STOP STOP His His Gln Gln Asn Asn Lys Lys Asp Asp Glu Glu Cys Cys STOP Trp Arg Arg Arg Arg Ser Ser Arg Arg Gly Gly Gly Gly U C A G U C A G U C A G U C A G Appendix D: Chou and Fassman Propensities Amino Acid Ala Cys Leu Met Glu Gln His Lys Val Ile Phe Tyr Trp Thr Gly Ser Asp Asn Pro Arg Helix 1.29 1.11 1.30 1.47 1.44 1.27 1.22 1.23 0.91 0.97 1.07 0.72 0.99 0.82 0.56 0.82 1.04 0.90 0.52 0.96 Strand 0.90 0.74 1.02 0.97 0.75 0.80 1.08 0.77 1.49 1.45 1.32 1.25 1.14 1.21 0.92 0.95 0.72 0.76 0.64 0.99 13 Turn 0.78 0.80 0.59 0.39 1.00 0.97 0.69 0.96 0.47 0.51 0.58 1.05 0.75 1.03 1.64 1.33 1.41 1.23 1.91 0.88