* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Download final exam in kje-2004
Extrachromosomal DNA wikipedia , lookup
Protein moonlighting wikipedia , lookup
Bisulfite sequencing wikipedia , lookup
Gene nomenclature wikipedia , lookup
DNA vaccination wikipedia , lookup
Gene desert wikipedia , lookup
Transposable element wikipedia , lookup
Genomic library wikipedia , lookup
Gene expression profiling wikipedia , lookup
Primary transcript wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
History of genetic engineering wikipedia , lookup
Cre-Lox recombination wikipedia , lookup
Genome evolution wikipedia , lookup
Designer baby wikipedia , lookup
Genetic code wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Pathogenomics wikipedia , lookup
Microevolution wikipedia , lookup
Human genome wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Non-coding DNA wikipedia , lookup
Microsatellite wikipedia , lookup
Computational phylogenetics wikipedia , lookup
Genome editing wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Point mutation wikipedia , lookup
Helitron (biology) wikipedia , lookup
Metagenomics wikipedia , lookup
Smith–Waterman algorithm wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Page 1 of 4 pages FINAL EXAM IN KJE-2004 Exam in : KJE-2004 Bioinformatics - An introduction Date : November 29th, 2011 Time : 09.00-13:00 Place : Åsgårdvegen 9 Approved remedies : None The exam contains 4 pages including this cover page Contact person: Peik Haugen Tlf.776 45288 /95122932 Read the questions carefully. Do not spend too much time on each question. It is better to proceed to the next question and return to time consuming questions at the end. If not otherwise stated, brief and concise answers are expected. Use sketches and figures to illustrate your answers. If a question appears unclear, then explain how you have interpreted the question. If one answer appears to overlap with another, you may cross-reference them. Answer all questions. Questions may be answered in English or Norwegian. FACULTY OF SCIENCE AND TECHNOLOGY University of Tromsø, N-9037 Tromsø, Phone 77 64 40 01, Telefax 77 64 47 65 Question 1. Databases (6 points) a) (2p) Briefly explain how databases with flat file format are structured. Also briefly discuss advantages/disadvantages with such databases (for example, why do you think flat file format databases are popular among biologists?). Flat file databases consists of basically one long file/table with many entries that are delimited by special characters (e.g., a vertical bar). No hidden computer instructions are included. Advantages: can be created and managed by non-experts in databses (e.g., biologists), easy structure, little effort to set up, easy to understand, can easily be browsed, easy to update. Disadvantages: unmanageable if too large, not easily searchable if too big. Basically it is ok as long as not too big. b) (2p) Name and describe briefly the structure of at least one alternative database type (other than flat file format databases). Relation databases: uses sets of tables (instead of a single table). Tables are set in “relation” to each other by sharing features (attributes). Therefore they can by cross-referenced. Information from different tables can be gathered and put into a single report. Easier to managed, and can produce different reports. Object-oriented databases (OOD): Store data as “objects”. Objects are linked by a set of pointers that define pre-determined relationships between objects. OODs are flexible, but lack the rigous mathematical foundation (as relation databases). c) (2p) Describe briefly the content of the following databases: (i) (ii) (iii) (iv) UniProt GenBank TrEMBL Pfam: (i) UniProt: a relatively new protein database that contain information from the three databases Swiss-Prot, TrEMBL and Pir.Swiss-Prot-Manually curated, TrEMBL-automatic annotation GenBank: (primary database). Complete collection of available DNA sequences. Pfam: a comprehensive database of conserved protein families. Is used extensively by experimental, computational, structural and evolutionary biologists. Collection of >12,000 families. many families contain >100,000 sequences uses “seed alignments” (representative set of sequences that are relatively stable). “seed alignments” are used to build profile hidden Markov models (HMMs) that can be used to search databases. homologues that score above thresholds are aligned against the profile to make a full alignment. (ii) (iii) FACULTY OF SCIENCE AND TECHNOLOGY University of Tromsø, N-9037 Tromsø, Phone 77 64 40 01, Telefax 77 64 47 65 Question 2. Sequence alignment (6 points) a) (2p) Explain the terms sequence homology, sequence similarity and sequence identity. In relation to the three terms, does it make a difference if DNA or proteins sequences are compared? Sequence homology: a conclusion about the common ancestry of sequences. The conclusion is based on the similarity between a pair of sequences. There is never a degree of homology. Sequence similarity: a quantitative measure between two sequences in an alignment. The similarity can be presented as for example percentage similarity. Sequence identity: a quantitative measure between two sequences in an alignment. The similarity can be presented as for example percentage similarity. For nucleotide sequences similarity and identity will be the same (either positions are identical, or different). However, for amino acid sequences amino acids can share physiochemical properties, and therefore share more similarity than identity. Homology is independent of DNA vs protein. b) (2p) How would you best describe the genes below (homologs, paralogs, orthologs): ProtA ProtB1 ProtB2 ProtC ProtA ProtB1 ProtB2 9% 7% 61% 89% 5% 6% ProtC Table shows identity between protein products of GeneA, GeneB1, GeneB2 and GeneC. All proteins are approximately 250 amino acids in length. Two aligned amino acid sequences can share 5% identity by chance. Therefore ProtA and ProtC are likely to be unrelated to ProtB1 and ProtB2. ProtA is 61% identical to ProtC over 250 amino acids of length. This indicates that they are likely to be homologs. Short proteins with 61% identity on the other hand are not necessarily homologs even if they share a relatively high degree of identity. ProtB1 and Prot B2 are also probably a result of a gene duplication event and are therefore homologs and paralogs. They share even higher degree of identity. FACULTY OF SCIENCE AND TECHNOLOGY University of Tromsø, N-9037 Tromsø, Phone 77 64 40 01, Telefax 77 64 47 65 c) (2p) Two sequences (SEQ1 and SEQ2) are being optimally aligned by dynamic programming. Give a brief and overall description of the individual steps in dynamic programming. (Note that figure below is only meant as illustration/help) Steps in dynamic programming: - Construct a 2D matrix of the two sequences. - Scoring is based on a scoring system (e.g., match=1, mismatch=0) - Scoring is done one row or column at a time. - Scoring of second row/column is based on the score for first row/column. Hence “dynamic”. - Once scoring is finished, then optimal alignment is found by tracing back through the matrix from bottom right corner to upper left. - Best matching path will give the optimal alignment. FACULTY OF SCIENCE AND TECHNOLOGY University of Tromsø, N-9037 Tromsø, Phone 77 64 40 01, Telefax 77 64 47 65 Question 3. Sequence alignment (4 points) a) (2p) Describe the concept of profiles (in the context of multiple sequence alignments), and briefly name the difference between profiles and position-specific scoring matrices (PSSMs). Profiles: a table that describes the probabilities of having specific amino acids/nucleotides at each position in an alignment. Made by: - Make a raw frequency table. Normalize (divide by overall freq) Convert numbers to log2 (values become log odds scores) Profiles contain gap-penalty information (not in PSSMs). b) (2p) “Hidden” Markov models (HMMs) include probabilities of unobservable states. Explain briefly what is meant with unobservable states in the context of multiple sequence alignments of DNA. Draw a sketch, if necessary. In HMMs matched positions are “observed states”. Gapped positions are “hidden”. Hidden positions are “unobserved states” and can be a result of insertions or deletions. HMMS can therefore take into account probabilities for states that have not yet been observed (based on prior knowledge from previous position). FACULTY OF SCIENCE AND TECHNOLOGY University of Tromsø, N-9037 Tromsø, Phone 77 64 40 01, Telefax 77 64 47 65 Question 4. Gene and promoter prediction (6 points) With the rapid generation of sequences generated from the “Next Generation Sequencing (NGS)”machines there is an increasing need to use bioinformatics approaches to accurately predict gene structure. a) (2p) Name and describe briefly the different elements that a typical prokaryote gene consists of. Use a figure to illustrate. Transcription start, ribosome binding site, translation start, coding region, translation stop, transcription terminator. b) (2p) The accuracy of gene prediction programs can be evaluated using parameters such as sensitivity and specificity. Which features are used to evaluate sensitivity and specificity? Use a figure if needed True positive, false positive, true negative and false negative. c) (2p) Gene prediction programs for identifying eukaryotic genes can be categorised based on their algorithms. Describe short the different categories of algorithms. Ab initio based, homology based and consensus based. FACULTY OF SCIENCE AND TECHNOLOGY University of Tromsø, N-9037 Tromsø, Phone 77 64 40 01, Telefax 77 64 47 65 Question 5. Molecular phylogenetics (6 poeng) a) (2p) Describe briefly what kind of information you can extract from a cladogram and a phylogram. Cladogram: (unscaled tree) the topology between branches defines the evolutionary relationships between taxa/sequences/species. In a cladogram the end nodes line up perfectly, meaning that lengths of branches are meaningless (not proportional to evolutionary distances). Phylogram: (scaled tree) Same as cladogram except that branch lengths represent the differences in evolutionary divergences/distances. b) (2p) You have been given twenty homologous protein sequences that you should align as best as possible before performing a phylogenetic analysis. Explain how you would proceed with aligning the sequences and then reconstructing the phylogeny. Name software that you could use. - Import sequences into BioEdit. - Automatically align all sequences with ClustalW - manually adjust alignment. - Select positions to be used in phylogenetic analysis by creating a “mask” in BioEdit. - Export final alignment to appropriate file format (e.g., msf, fasta, Mega) - Perform phylogenetic analysis using e.g., MEGA and Neighbor Joining - method (also choose settings). - Test phylogeny with Bootstrap analysis (or Bayesian analysis). c) (2p) Explain how Bootstrap replicates are generated from the original dataset. Each bootstrap pseudoreplicate dataset is made by randomly choosing positions from the original dataset until the dataset is as long as the original dataset. Bootstrap is therefore a statistical analysis method to test the robustness of the phylogeny based on the original dataset. FACULTY OF SCIENCE AND TECHNOLOGY University of Tromsø, N-9037 Tromsø, Phone 77 64 40 01, Telefax 77 64 47 65 Question 6. Genomics and proteomics (6 points) The highest resolution genome map is the genomic DNA sequence, which may be considered as a type of physical map described at a single base-pair level. a) (2p) There are two major strategies for genome sequencing. Explain the different strategies. Use figures to illustrate the different strategies. Shotgun vs hierarchical approaches b) (2p) Genome annotation involves two steps: gene prediction and functional assignment. Describe the process of functional assignment. Often employs a combination of theoretical prediction and experimental verification. Predicted proteins often verified by BLAST searches against different databases, motif and domain searches e.g. Pfam and InterPro and further compared with experimentally verified proteins/sequences in published literature. c) (1p) What is lateral gene transfer (or horizontal gene transfer)? Transfer of genetic material between different organisms. Vertical transfer is the “normal” mother-to-daughter transmission of genetic material. FACULTY OF SCIENCE AND TECHNOLOGY University of Tromsø, N-9037 Tromsø, Phone 77 64 40 01, Telefax 77 64 47 65