Download final exam in kje-2004

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Extrachromosomal DNA wikipedia , lookup

Protein moonlighting wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

Gene nomenclature wikipedia , lookup

DNA vaccination wikipedia , lookup

Gene desert wikipedia , lookup

Transposable element wikipedia , lookup

Genomic library wikipedia , lookup

Gene expression profiling wikipedia , lookup

Primary transcript wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

History of genetic engineering wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Genome evolution wikipedia , lookup

Designer baby wikipedia , lookup

Genetic code wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Gene wikipedia , lookup

Pathogenomics wikipedia , lookup

RNA-Seq wikipedia , lookup

Microevolution wikipedia , lookup

Human genome wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Non-coding DNA wikipedia , lookup

Microsatellite wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Genomics wikipedia , lookup

Genome editing wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Point mutation wikipedia , lookup

Helitron (biology) wikipedia , lookup

Metagenomics wikipedia , lookup

Smith–Waterman algorithm wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Sequence alignment wikipedia , lookup

Multiple sequence alignment wikipedia , lookup

Transcript
Page 1 of 4 pages
FINAL EXAM IN KJE-2004
Exam in
:
KJE-2004 Bioinformatics - An introduction
Date
:
November 29th, 2011
Time
:
09.00-13:00
Place
:
Åsgårdvegen 9
Approved remedies
:
None
The exam contains 4 pages including this cover page
Contact person: Peik Haugen Tlf.776 45288 /95122932
Read the questions carefully. Do not spend too much time on each question. It is better to proceed
to the next question and return to time consuming questions at the end. If not otherwise stated,
brief and concise answers are expected. Use sketches and figures to illustrate your answers. If a
question appears unclear, then explain how you have interpreted the question. If one answer
appears to overlap with another, you may cross-reference them. Answer all questions. Questions
may be answered in English or Norwegian.
FACULTY OF SCIENCE AND TECHNOLOGY
University of Tromsø, N-9037 Tromsø, Phone 77 64 40 01, Telefax 77 64 47 65
Question 1. Databases (6 points)
a) (2p) Briefly explain how databases with flat file format are structured. Also briefly
discuss advantages/disadvantages with such databases (for example, why do you think flat file
format databases are popular among biologists?).
Flat file databases consists of basically one long file/table with many entries that are
delimited by special characters (e.g., a vertical bar). No hidden computer instructions are
included.
Advantages: can be created and managed by non-experts in databses (e.g., biologists), easy
structure, little effort to set up, easy to understand, can easily be browsed, easy to update.
Disadvantages: unmanageable if too large, not easily searchable if too big. Basically it is ok
as long as not too big.
b) (2p) Name and describe briefly the structure of at least one alternative database type
(other
than flat file format databases).
Relation databases: uses sets of tables (instead of a single table). Tables are set in “relation”
to each other by sharing features (attributes). Therefore they can by cross-referenced.
Information from different tables can be gathered and put into a single report. Easier to
managed, and can produce different reports.
Object-oriented databases (OOD): Store data as “objects”. Objects are linked by a set of
pointers that define pre-determined relationships between objects. OODs are flexible, but
lack the rigous mathematical foundation (as relation databases).
c) (2p) Describe briefly the content of the following databases:
(i)
(ii)
(iii)
(iv)
UniProt
GenBank
TrEMBL
Pfam:
(i)
UniProt: a relatively new protein database that contain information from the
three databases Swiss-Prot, TrEMBL and Pir.Swiss-Prot-Manually curated,
TrEMBL-automatic annotation
GenBank: (primary database). Complete collection of available DNA
sequences.
Pfam: a comprehensive database of conserved protein families. Is used
extensively by experimental, computational, structural and evolutionary
biologists. Collection of >12,000 families. many families contain >100,000
sequences uses “seed alignments” (representative set of sequences that are
relatively stable). “seed alignments” are used to build profile hidden Markov
models (HMMs) that can be used to search databases. homologues that score
above thresholds are aligned against the profile to make a full alignment.
(ii)
(iii)
FACULTY OF SCIENCE AND TECHNOLOGY
University of Tromsø, N-9037 Tromsø, Phone 77 64 40 01, Telefax 77 64 47 65
Question 2. Sequence alignment (6 points)
a) (2p) Explain the terms sequence homology, sequence similarity and sequence identity. In
relation to the three terms, does it make a difference if DNA or proteins sequences
are compared?
Sequence homology: a conclusion about the common ancestry of sequences. The
conclusion is based on the similarity between a pair of sequences. There is never a
degree of homology.
Sequence similarity: a quantitative measure between two sequences in an alignment.
The similarity can be presented as for example percentage similarity.
Sequence identity: a quantitative measure between two sequences in an alignment.
The similarity can be presented as for example percentage similarity.
 For nucleotide sequences similarity and identity will be the same (either positions
are identical, or different). However, for amino acid sequences amino acids can
share physiochemical properties, and therefore share more similarity than identity.
 Homology is independent of DNA vs protein.
b) (2p) How would you best describe the genes below (homologs, paralogs, orthologs):
ProtA
ProtB1
ProtB2
ProtC
ProtA
ProtB1
ProtB2
9%
7%
61%
89%
5%
6%
ProtC
Table shows identity between protein products of GeneA, GeneB1,
GeneB2 and GeneC. All proteins are approximately 250 amino acids in
length.
Two aligned amino acid sequences can share 5% identity by chance. Therefore ProtA
and ProtC are likely to be unrelated to ProtB1 and ProtB2.
ProtA is 61% identical to ProtC over 250 amino acids of length. This indicates that
they are likely to be homologs. Short proteins with 61% identity on the other hand
are not necessarily homologs even if they share a relatively high degree of identity.
ProtB1 and Prot B2 are also probably a result of a gene duplication event and are
therefore homologs and paralogs. They share even higher degree of identity.
FACULTY OF SCIENCE AND TECHNOLOGY
University of Tromsø, N-9037 Tromsø, Phone 77 64 40 01, Telefax 77 64 47 65
c) (2p) Two sequences (SEQ1 and SEQ2) are being optimally aligned by dynamic
programming. Give a brief and overall description of the individual steps in dynamic
programming. (Note that figure below is only meant as illustration/help)
Steps in dynamic programming:
- Construct a 2D matrix of the two sequences.
- Scoring is based on a scoring system (e.g., match=1, mismatch=0)
- Scoring is done one row or column at a time.
- Scoring of second row/column is based on the score for first row/column. Hence
“dynamic”.
- Once scoring is finished, then optimal alignment is found by tracing back through the
matrix from bottom right corner to upper left.
- Best matching path will give the optimal alignment.
FACULTY OF SCIENCE AND TECHNOLOGY
University of Tromsø, N-9037 Tromsø, Phone 77 64 40 01, Telefax 77 64 47 65
Question 3. Sequence alignment (4 points)
a) (2p) Describe the concept of profiles (in the context of multiple sequence
alignments), and briefly name the difference between profiles and position-specific
scoring matrices (PSSMs).
Profiles: a table that describes the probabilities of having specific amino
acids/nucleotides at each position in an alignment.
Made by:
-
Make a raw frequency table.
Normalize (divide by overall freq)
Convert numbers to log2 (values become log odds scores)
Profiles contain gap-penalty information (not in PSSMs).
b) (2p) “Hidden” Markov models (HMMs) include probabilities of unobservable states.
Explain briefly what is meant with unobservable states in the context of multiple
sequence alignments of DNA. Draw a sketch, if necessary.
In HMMs matched positions are “observed states”. Gapped
positions are “hidden”. Hidden positions are “unobserved states” and can
be a result of insertions or deletions. HMMS can therefore take into account
probabilities for states that have not yet been observed (based on prior knowledge
from previous position).
FACULTY OF SCIENCE AND TECHNOLOGY
University of Tromsø, N-9037 Tromsø, Phone 77 64 40 01, Telefax 77 64 47 65
Question 4. Gene and promoter prediction (6 points)
With the rapid generation of sequences generated from the “Next Generation Sequencing
(NGS)”machines there is an increasing need to use bioinformatics approaches to accurately
predict gene structure.
a) (2p) Name and describe briefly the different elements that a typical prokaryote gene
consists of. Use a figure to illustrate.
Transcription start, ribosome binding site, translation start, coding region,
translation stop, transcription terminator.
b) (2p) The accuracy of gene prediction programs can be evaluated using parameters such as
sensitivity and specificity. Which features are used to evaluate sensitivity and
specificity? Use a figure if needed
True positive, false positive, true negative and false negative.
c) (2p) Gene prediction programs for identifying eukaryotic genes can be categorised based
on their algorithms. Describe short the different categories of algorithms.
Ab initio based, homology based and consensus based.
FACULTY OF SCIENCE AND TECHNOLOGY
University of Tromsø, N-9037 Tromsø, Phone 77 64 40 01, Telefax 77 64 47 65
Question 5. Molecular phylogenetics (6 poeng)
a) (2p) Describe briefly what kind of information you can extract from a cladogram and a
phylogram.
Cladogram: (unscaled tree) the topology between branches defines the evolutionary
relationships between taxa/sequences/species. In a cladogram the end nodes line up
perfectly, meaning that lengths of branches are meaningless (not proportional to
evolutionary distances).
Phylogram: (scaled tree) Same as cladogram except that branch lengths represent the
differences in evolutionary divergences/distances.
b) (2p) You have been given twenty homologous protein sequences that you should align as
best as possible before performing a phylogenetic analysis. Explain how you would
proceed with aligning the sequences and then reconstructing the phylogeny. Name
software that you could use.
- Import sequences into BioEdit.
- Automatically align all sequences with ClustalW
- manually adjust alignment.
- Select positions to be used in phylogenetic analysis by creating a “mask” in BioEdit.
- Export final alignment to appropriate file format (e.g., msf, fasta, Mega)
- Perform phylogenetic analysis using e.g., MEGA and Neighbor Joining - method
(also choose settings).
- Test phylogeny with Bootstrap analysis (or Bayesian analysis).
c) (2p) Explain how Bootstrap replicates are generated from the original dataset.
Each bootstrap pseudoreplicate dataset is made by randomly choosing positions from
the original dataset until the dataset is as long as the original dataset. Bootstrap is
therefore a statistical analysis method to test the robustness of the phylogeny based on
the original dataset.
FACULTY OF SCIENCE AND TECHNOLOGY
University of Tromsø, N-9037 Tromsø, Phone 77 64 40 01, Telefax 77 64 47 65
Question 6. Genomics and proteomics (6 points)
The highest resolution genome map is the genomic DNA sequence, which may be considered
as a type of physical map described at a single base-pair level.
a) (2p) There are two major strategies for genome sequencing. Explain the different
strategies. Use figures to illustrate the different strategies.
Shotgun vs hierarchical approaches
b) (2p) Genome annotation involves two steps: gene prediction and functional assignment.
Describe the process of functional assignment.
Often employs a combination of theoretical prediction and experimental verification.
Predicted proteins often verified by BLAST searches against different databases, motif
and domain searches e.g. Pfam and InterPro and further compared with
experimentally verified proteins/sequences in published literature.
c) (1p)
What is lateral gene transfer (or horizontal gene transfer)?
Transfer of genetic material between different organisms. Vertical transfer is the
“normal” mother-to-daughter transmission of genetic material.
FACULTY OF SCIENCE AND TECHNOLOGY
University of Tromsø, N-9037 Tromsø, Phone 77 64 40 01, Telefax 77 64 47 65