Download Algorithms in Computational Biology

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Protein–protein interaction wikipedia , lookup

Genetic engineering wikipedia , lookup

Organisms at high altitude wikipedia , lookup

Synthetic biology wikipedia , lookup

Biomolecular engineering wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Introduction to evolution wikipedia , lookup

Biochemistry wikipedia , lookup

Minimal genome wikipedia , lookup

History of RNA biology wikipedia , lookup

History of molecular evolution wikipedia , lookup

Chemical biology wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Non-coding DNA wikipedia , lookup

Neurogenetics wikipedia , lookup

Gene wikipedia , lookup

Genome evolution wikipedia , lookup

Life wikipedia , lookup

Evolutionary developmental biology wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

History of biology wikipedia , lookup

Symbiogenesis wikipedia , lookup

Genetics wikipedia , lookup

Introduction to genetics wikipedia , lookup

Biology wikipedia , lookup

History of molecular biology wikipedia , lookup

Transcript
Algorithms in Computational Biology
(236522)
Spring 2002
Lecturer: Shlomo Moran, Taub 639, tel 4363
Office hours Wednesday 1630-1730
TA: Ydo Wexler, Taub 431, tel 4927
Office hours Monday 1030-1130
Lecture: Tuesday 11:30-13:30, Taub 2
Tutorial: Monday 9:30-10:30, Taub 4
.
Course Information
Requirements & Grades:
15-25% homework, in five theoretical question
sets. [Submit in two weeks time]. Homework is
obligatory.
75-85% test. Must pass beyond 55 for the
homework’s grade to count
Exam date: 7.7.04.
2
Bibliography
Biological
Sequence Analysis, R.Durbin et al. ,
Cambridge University Press, 1998
Introduction to Molecular Biology, J. Setubal, J.
Meidanis, PWS publishing Company, 1997
Phylogenetics, C. Semple, M. Steel, Oxford
press, 2003
url: www.cs.technion.ac.il/~cs236522

3
Course Prerequisites
Computer Science and Probability Background
 Data structure 1 (cs234218)
 Algorithms 1 (cs234247)
 Probability (any course)
Some Biology Background
 Formally: None, to allow CS students to take this course.
 Recommended: Molecular Biology 1 (especially for those in the
Bioinformatics track), or a similar Biology course, and/or a serious
desire to complement your knowledge in Biology by reading the
appropriate material (see the course web site).
Studying the algorithms in this course while acquiring enough
biology background is far more rewarding than ignoring the
biological context.
4
Biological Background
First home work assignment: Read the first chapter (pages 1-30) of
Setubal et al., 1997. (a copy is available in the Taub building library,
and one for loan at Fishbach).
Solve questions 1-3, p. 30 (to be on the course web site)
Due time: Tutorial class of 22.3.04 (~2 weeks from
today), or earlier in the teaching assistant’s mail slot.
This class has been edited from Nir Friedman’s lecture which is available at
www.cs.huji.ac.il/~nir.
Changes made by Dan Geiger, then Shlomo Moran.
.
Computational Biology
Computational biology is the application of computational tools
and techniques to (primarily) molecular biology. It enables new
ways of study in life sciences, allowing analytic and predictive
methodologies that support and enhance laboratory work. It is a
multidisciplinary area of study that combines Biology, Computer
Science, and Statistics.
Computational biology is also called Bioinformatics, although
many practitioners define Bioinformatics somewhat narrower by
restricting the field to molecular Biology only.
7
Examples of Areas of Interest
•
•
•
•
•
Building evolutionary trees from molecular (and other) data
Efficiently constructing genomes of various organisms
Understanding the structure of genomes (SNP, SSR, Genes)
Understanding function of genes in the cell cycle and disease
Deciphering structure and function of proteins
_____________________
SNP: Single Nucleotide Polymorphism
SSR: Simple Sequence Repeat
8
Exponential growth of biological information:
growth of sequences, structures, and literature.
9
Course Goals
 Learning
about computational tools for (primarily)
molecular biology.
 Cover computational tasks that are posed by
modern molecular biology
 Discuss the biological motivation and setup for
these tasks
 Understand the kinds of solutions that exist and
what principles justify them
12
Topics I
Dealing with DNA/Protein sequences:
 Genome projects and how sequences are found
 Finding similar sequences
 Models of sequences: Hidden Markov Models
 Transcription regulation
 Protein Families
 Gene finding
13
Topics II
Models of genetic change:
 Long term: evolutionary changes among species
 Reconstructing evolutionary trees from sequences
 Short term: genetic variations in a population
 Finding genes by linkage and association
14
Topics III (if time allows)
Protein World:
 How proteins fold - secondary & tertiary structure
 How to predict protein folds from sequences data
 How to analyze proteins changes from raw
experimental measurements (MassSpec)
15
Human Genome
Most human cells contain
46 chromosomes:

2 sex chromosomes (X,Y):
XY – in males.
XX – in females.

22 pairs of chromosomes
named autosomes.
16
Source: Alberts et al
DNA Organization
17
Source: Alberts et al
The Double Helix
18
DNA Components
Four nucleotide types:
 Adenine
 Guanine
 Cytosine
 Thymine
Hydrogen bonds
(electrostatic connection):
 A-T
 C-G
19
Genome Sizes
 E.Coli
(bacteria)
 Yeast (simple fungi)
 Smallest human chromosome
 Entire human genome
4.6 x 106 bases
15 x 106 bases
50 x 106 bases
3 x 109 bases
20
Genetic Information
Genome – the collection of
genetic information.
 Chromosomes – storage
units of genes.
 Gene – basic unit of genetic
information. They determine
the inherited characters.

21
Genes
The DNA strings include:
 Coding regions (“genes”)
 E. coli has ~4,000 genes
 Yeast has ~6,000 genes
 C. Elegans has ~13,000 genes
 Humans have ~32,000 genes
 Control regions
 These typically are adjacent to the genes
 They determine when a gene should be “expressed”
 “Junk” DNA (unknown function - ~90% of the DNA in
human’s chromosomes)
22
The Cell
All cells of an organism contain the same DNA content
(and the same genes) yet there is a variety of cell types.
23
Example: Tissues in Stomach
How is this variety encoded and expressed ?
24
Central Dogma
‫שעתוק‬
Transcription
Gene
‫תרגום‬
Translation
mRNA
Protein
cells express different subset of the genes
In different tissues and under different conditions
25
Transcription
sequences can be transcribed to RNA
Source: Mathews & van Holde
 Coding
 RNA


nucleotides:
Similar to DNA, slightly different backbone
Uracil (U) instead of Thymine (T)
26
Transcription: RNA Editing
1. Transcribe to RNA
2. Eliminate introns
3. Splice (connect) exons
* Alternative splicing exists
Exons hold information, they are more stable during evolution.
This process takes place in the nucleus. The mRNA molecules
diffuse through the nucleus membrane to the outer cell plasma.
27
RNA roles
Messenger RNA (mRNA)
 Encodes protein sequences. Each three nucleotide acids
translate to an amino acid (the protein building block).
 Transfer RNA (tRNA)
 Decodes the mRNA molecules to amino-acids. It connects
to the mRNA with one side and holds the appropriate
amino acid on its other side.
 Ribosomal RNA (rRNA)
 Part of the ribosome, a machine for translating mRNA to
proteins. It catalyzes (like enzymes) the reaction that
attaches the hanging amino acid from the tRNA to the
amino acid chain being created.
 ...

28
Translation
 Translation
is mediated by the ribosome
 Ribosome is a complex of protein & rRNA
molecules
 The ribosome attaches to the mRNA at a translation
initiation site
 Then ribosome moves along the mRNA sequence
and in the process constructs a sequence of amino
acids (polypeptide) which is released and folds into
a protein.
29
Genetic Code
There are 20 amino acids from which proteins are build.
30
Protein Structure
 Proteins
are polypeptides of 70-3000
amino-acids
 This
structure is
(mostly) determined
by the sequence of
amino-acids that
make up the protein
31
Protein Structure
32
Evolution
 Related
organisms have similar DNA
 Similarity in sequences of proteins
 Similarity in organization of genes along the
chromosomes
 Evolution plays a major role in biology
 Many mechanisms are shared across a wide
range of organisms
 During the course of evolution existing
components are adapted for new functions
33
Evolution
Evolution of new organisms is driven by
 Diversity
 Different individuals carry different variants of
the same basic blue print
 Mutations
 The DNA sequence can be changed due to
single base changes, deletion/insertion of DNA
segments, etc.
 Selection bias
34
Source: Alberts et al
The Tree of Life
35
Characters in Species
A
(discrete) character is a property which
distinguishes between species (e.g. dental
structure, a certain gene)
 A characters state is a value of the character
(human dental structure).
 Problem: Given set of species, specified by their
characters, reconstruct their evolutionary tree.
38
Species ≡ Vertices
States ≡ Colors
Characters ≡ Colorings
Evolutionary tree ≡ A tree with many colorings,
containing the given vertices
39
Evolutionary trees should avoid
reversal transitions
 A species
regains a state it’s direct ancestor has lost.
 Famous examples:
 Teeth in birds.
 Legs in snakes.
40
Evolutionary trees should avoid
convergence transitions
 Two
species possess the same state while their
least common ancestor possesses a different state.
 Famous example: The marsupials.
41
42
Common Assumption:
Characters with Reversal or Convergent transitions
are highly unlikely in the Evolutionary Tree
A character that exhibits neither reversals nor convergence
is denoted homoplasy free.
43
A character is Homoplasy Free
↕
The corresponding coloring is convex
(each color induces a block)
44
A partial coloring is convex if it can be
completed to a (total) convex coloring
45
The Perfect Phylogeny Problem
 Input:
a set of species, and many characters, each
assign states (colors) to the species.
 Question: is there a tree T containing the species
as vertices, in which all the characters (colorings)
are convex?
46
The Perfect Phylogeny Problem
(combinatorial setting)
Input: Some colorings (C1,…,Ck) of a set of vertices (in the
example: 3 colorings: left, center, right, each by (the same) two
colors).
RRB
BBR
RRR
RBR
Problem: Is there a tree T which includes these vertices, s.t. (T,Ci) is
convex for i=1,…,k?
NP-Hard In general, in P for some special cases
47