Download DNA

Document related concepts

DNA damage theory of aging wikipedia , lookup

DNA polymerase wikipedia , lookup

United Kingdom National DNA Database wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

RNA wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Metagenomics wikipedia , lookup

Genealogical DNA test wikipedia , lookup

Gel electrophoresis of nucleic acids wikipedia , lookup

Cancer epigenetics wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Designer baby wikipedia , lookup

Genomic library wikipedia , lookup

Genome evolution wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Nucleic acid tertiary structure wikipedia , lookup

Nucleosome wikipedia , lookup

Molecular cloning wikipedia , lookup

Epitranscriptome wikipedia , lookup

Cell-free fetal DNA wikipedia , lookup

DNA vaccination wikipedia , lookup

Epigenomics wikipedia , lookup

Non-coding RNA wikipedia , lookup

Human genome wikipedia , lookup

RNA-Seq wikipedia , lookup

Microevolution wikipedia , lookup

History of RNA biology wikipedia , lookup

Microsatellite wikipedia , lookup

DNA supercoil wikipedia , lookup

Replisome wikipedia , lookup

Extrachromosomal DNA wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Genetic code wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Nucleic acid double helix wikipedia , lookup

Genome editing wikipedia , lookup

Gene wikipedia , lookup

History of genetic engineering wikipedia , lookup

Point mutation wikipedia , lookup

Genomics wikipedia , lookup

Non-coding DNA wikipedia , lookup

Primary transcript wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Helitron (biology) wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Transcript
Bioinformatique et modélisation
Biological sequences
&
Challenges for the Bioinformatician
The tree of life
All living organisms (on Earth) use 3 major macromolecules:
DNA
RNA
proteins
DNA - RNA - proteins
DNA
transcription
mRNA
CCTGAGCCAACTATTGATGAA
transcriptional regulation
CCUGAGCCAACUAUUGAUGAA
translation
Protein
PEPTIDE
The main role of DNA is information storage and transmission
The main role of protein is to direct processes on which life depends
(energy metabolism, biosynthesis, inter-cellular communication)
DNA - RNA - proteins
A glance at the chain of command in the cell reveals DNA as
the boss, running the show by its coded instructions, getting
RNA to do all the fetching and carrying work, and telling the
ribosomes what proteins to make next. The proteins have a
completely servile role, but are the real workers.
P. Davies (2003) The origin of life, Penguin, UK, p. 105
DNA
DNA (deoxyribonucleic acid)
A DNA sequence is a linear biopolymer of nucleotides.
A nucleotide is composed of 3 main parts: a base, a
pentose sugar, and a phosphate group (fig. A).
There are 4 bases, separated into 2 groups: purines
(adenine, guanine) and pyrimidines (cytosine, thymine)
(fig. B).
H
In DNA, nucleotides are bound by phosphodiester
bounds (fig. C).
DNA structure
X-ray diffraction of DNA
Rosalind Franklin
James Watson &
Francis Crick
(Nobel Prize, 1953)
Watson & Crick (1953)
Nature. 171:737-8.
DNA structure
In cells, DNA exists mainly as a two-strand coiled structure (double helix).
The two strands are held together by hydrogen bounds between the bases. The bases are
located inside the helix, the phosphate-linked sugar forming a backbone on the outside.
The base pairing is complementary: A can only bind T: A
can only bind C: G ≡ C (3 hydrogen bounds).
= T (2 hydrogen bounds), and G
Chargaff's rules
Chargaff studied the composition in nucleic acids of
each strand of the DNA and found two rules:
Chargaff's first rule
The 2 stands of DNA are sometimes called watson (w) and crick (c)
Ac = Tw , Aw = Tc , Cc = Gw , Cw = Gc
(the letters represent the molar fraction of a base on one strand).
Chargaff's first rule express the fact that double stranded DNA
obeys Watson-Crick base pairing.
Chargaff's second rule
The second rule is much less understood:
Ac ≈ Tc , Aw ≈ Tw , Cc ≈ Gc , Cw ≈ Gw
This parity is observed in most genomes (eukaryotic chromosomes,
bacteria, virus, but not in mitochondria and plasmids).
Chargaff (1951) Some recent studies on the composition and
structure of nucleic acids. J Cell Physiol Suppl 38:41-59
Erwin Chargaff
(1905-2002)
DNA packing
In cells, DNA is packed into a compact structure thanks
to specialized proteins called histones. "Chromatin"
usually refers to the complex DNA / histones.
The fundamental packing unit is known as a nucleosome. Each
nucleosome is about 11nm in diameter. The DNA double helix wraps
around a central core of 8 histone protein molecules (an octamer) to form a
single nucleosome. Additionnal histone proteins fasten the DNA to the
nucleosome core. Nucleosomes are usually packed together, with the aid
of yet another histones to form a 30nm large fiber. As a 30nm fiber, the
typical human chromosome would be about 0.1cm in length and would
span the nucleus 100 times. This suggests higher orders of packaging, to
give a chromosome the compact structure.
DNA, chromosomes, and genome
The main role of DNA is information storage. It is transmitted from
generation to generation: all the information required to make and maintain
a new organism is stored in its DNA.
The information required to reproduce even very complex organisms is
stored on a relatively small number of DNA molecules (the chromosomes).
This set of molecules is called the organism's genome.
 In human, there are 46
DNA molecules in each
cell, organised into
chromosomes.
 In bacteria, there is
often a single, circular
chromosome, but also
nonchromosomal DNA
molecules called
plasmids.
Genome size
Organism
Year
Size (Mb)
Mycoplasma genitalium
1995
0,6
Haemophilus influenzae
1995
1,8
Escherichia coli
1997
4,6
Saccharomyces cerevisiae
1996
12
Schizosaccharomyces pombe
2002
14
Caenorhabditis elegans
1998
97
Arabidopsis thaliana
2001
120
Oryza sativa
2002
5 000
Drosophila melanogaster
2000
180
Galus Galus
2004
1 200
Rattus Norvegicus
2004
2 900
Mus musculus
2002
3 400
Homo sapiens
2001
3 400
1Mb = 1 000 000 bases
Sources: Jacques van Helden + GenBank
See also: Database Of Genome Sizes
http://www.cbs.dtu.dk/databases/DOGS/
http://www.genomesize.com/
Genome size
Comparison of genome size
Genome size (bp)
Source: Cann (1997) Principles of Molecular Virology, Academic Press
Genome size
How many book/CD do I need to "write" the human genome?
If the sequence obtained was to be stored in book form, and if each page
contained 1000 base-pairs recorded and each book contained 1000 pages, then
3300 such books would be needed in order to store the complete genome.
However, if expressed in units of
computer data storage, 3.3 billion basepairs recorded at 2 bits per pair would
equal 786 megabytes of raw data. This
is comparable to a fully data loaded CD.
Source:
http://en.wikipedia.org/wikiHuman_Genome_Projec
t
The first printout of the human
genome to be presented as a series
of books, displayed at the Wellcome
Collection, London
Challenge for the bioinformatician
Is the nucleotide sequence random? Or are there some
"preferences" in the choice and in the order of the nucleotides?
Are those "preferences" related to biological functions or structure?
If yes, could we predict biological function or structure of DNA on
the basis of the sequence?
Sequence comparison (pairwise alignments, multiple alignments,
database search, motif search, etc) and sequence statistics
(based on symmetries, GC content, motif occurrence statistics,
entropy, correlation, etc) will help the bioinformaticians to answer
these questions.
DNA representation
DNA representation
The complementarity between base pairs (A = T and G ≡ C) implies that if you know
one sequence you can deduce the complementary sequence.
It is common to represent DNA sequences by 4-letter strings:
TGCTAATGCCGCTACTCTATCTGC
By convention, we write sequences from 5' to 3' end.
5'- TGCTAATGCCGCTACTCTATCTGC - 3'
Source: Jacques van Helden
DNA representation
Don't forget the second strand!
When we analyze a DNA sequence represented by
ATGCGCGGATG
we should keep in mind that the corresponding molecule is a double strand helix with
the following base pairs:
5' - ATGCGCGGATG - 3'
(upper strand)
|||||||||||
3' - TACGCGCCTAC - 5'
(lower strand)
Note that the concept of upper strand and lower strand are purely artificial. A DNA
molecules is a 3D structure and there is no reason to consider preferentially one or
the other strand.
This does not mean however that the two strands are functionally equivalent: in
coding regions for example, only one strand will serve as a template for the synthesis
of RNA.
Source: Jacques van Helden
DNA representation
Reverse complementarity
Reverse complementary sequences represent the two strands of the same DNA
molecule.
The reverse complement is obtained by transposing each nucleotide into its
complementary nucleotide (A → T, T → A, C → G, G → C), and then reversing the
string.
For example the sequences ATGCGCGGATG and CATCCGCGCAT are mutually reverse
complementary. These strings describe the two strands of the same DNA molecule.
Consequently, the two following double strand schemes represent the same
molecule:
5'- ATGCGCGGATG - 3'
5'- CATCCGCGCAT - 3'
|||||||||||
|||||||||||
3'- TACGCGCCTAC - 5'
3'- GTAGGCGCGTA - 5'
Source: Jacques van Helden
Symmetries in DNA sequences
Symmetries in DNA sequences
Tandem repeat
GATAAGATAAGATAAGATAA = 2 x GATAAGATAA = 4 x GATAA
GATAAGATAAatgtagGATAAGATAA = 2 x GATAAGATAA separated by
a non repeated sequence.
Tandem repeats are presumed to occur frequently in genomic sequences,
comprising perhaps 10% or more of the human genome (Benson, NAR
27:573,1999).
Tandem repeat are sometimes associated to a repeated structure of a
protein (Ex: some ABC transporters have been shown to contain a tandem
repeat of six transmembrane helices, Tusnady et al, FEBS Lett 402:1-3,1997).
In recents years, the discovery of short tandem repeat polymorphisms are
involved in various diseases (e.g. Cancer, Huntington, Parkinson,..., Zhang &
Yu, Eur J Surg Oncol,33:529-34,2007).
ATP-binding
Cassette (ABC)
transporters
Source: Jacques van Helden
Symmetries in DNA sequences
Symmetries in DNA sequences
Textual palindromes
ATGGCCGGTA = ATGGC|CGGTA
Note that the corresponding DNA molecule does not contain any axis of
symmetry since in 3D space a nucleotide cannot be superimposed on its
own image. Therefore, searching for palindromes is not relevant for
detecting biological features.
5'- ATGGC - 3' ≠ 5'- CGGTA - 3'
Source: Jacques van Helden
Symmetries in DNA sequences
Symmetries in DNA sequences
Reverse complementary palindromes
A reverse complementary palindrome is a sequence identical to its reverse complement.
Example: ATGGGGCCCCAT
Reverse complementary palindromes correspond to 3D symmetries in DNA molecules. In the
following 2-strand representation, a 180° rotation around the center would swap the two
strands, and each letter would take place of an identical letter on the complementary strand.
5'- ATGGGG CCCCAT - 3'
||||||.||||||
Note that this sequence is
not a textual palindrome.
3'- TACCCC GGGGTA - 5'
Note that reverse complementary palindromes can be separated by a stretch of nonsymmetrical nucleotides.
Source: Jacques van Helden
Symmetries in DNA sequences
Symmetries in DNA sequences
Reverse complementary motifs play important roles in biological mechanisms.
Example 1: some classes of transcription factors
(e.g. helix-turn-helix) typically form homodimers
whose tridimensional structure is symmetrical.
These protein complexes specifically recognize
reverse complementary motifs in gene
promoters.
cAMP Receptor Protein (CRP)
TGTGA-N6-TCACA
Example 2: In bacteria, hexamers with reverse
complementary palindromic structure also play
an essential role as recognition sites for
restriction enzymes.
---AAGCTT-----TTCGAA--The restriction enzyme HindIII
specifically cuts DNA at
instance of AAGCTT
Symmetries in DNA sequences
Symmetries in DNA sequences
Reverse complementary motifs play important roles in biological mechanisms.
Example 3: Reverse complementary motifs
separated by a stretch are frequent in RNA,
where they mediate the pairing between distant
segments of the molecules.
5'- UCGGGcucauaaCCCGA - 3'
folding
a
c u
u
a
c a
GC
GC
GC
stem loop structure
CG
UA
||
5'3'
Source: Jacques van Helden
RNA
RNA (ribonucleic acid)
A RNA sequence is also a linear biopolymer of nucleotides, but their chemical
composition differ from the DNA nucleotides by 2 features: (1) the sugar group
differs by one alcohol group, (2) RNA contains the base uracyl instead of thymine.
OH
RNA (ribonucleic acid)
RNA is synthesized using DNA as template, with a one-to-one correspondence.
DNA
(template)
RNA
(synthetised)
A
U
C
G
G
C
T
A
Thus, it is possible to deduce the RNA sequence (that will be synthesized)
from the (template) DNA sequence:
DNA
RNA
CTGCTAGCAAGATCTG
GACGAUCGUUCUAGAC
(template)
(synthesized)
Roles of RNA
RNA molecules have multiple roles, mainly related in the transfer
of information from DNA to protein.
They can be classified into several types:
 messenger RNA (mRNA): Their role is to mediate the synthesis of proteins
from the DNA (genes). They are synthesized during the transcription and are
used during the translation as a template to build proteins.
 transfer RNA (tRNA): Amino acids do not recognise RNA codon directly.
The role of tRNA is to transfer the right amino acid to the growing
polypeptide chain.
 ribosomal RNA (rRNA): Their role is to regulate the activity of ribosomes.
 microRNA (miRNA): Recently discovered, miRNA have been found to have
multiple roles including regulation of gene expression and protein activity.
RNA structure
Generally, RNA does not form a double helix, but contains base pairing and
loops (we refer to these structures as secondary RNA structure, the primary
structure being its ribonucleic sequence).
Tetrahymena ribozyme
Challenge for the bioinformatician
One of the challenges for the bioinformatician is to predict the secondary
structure of RNA on the basis of the primary structure of RNA.
One approach is based on primary sequence analysis. The idea is to find
which parts of the sequence are complementary and would therefore be
able to pair.
Another approach relies on minimum energy computation.
Note that this topic will not be covered in this course. For more details, see Mount
(2004) Bioinformatics: Sequence and Genome analysis (Chapter 8 - prediction of
RNA secondary structure)
PROTEINS
Proteins
A protein (or, more generally, a polypeptide) is a biopolymer of
amino acids (aa).
An amino acid contains both an amine (NH2), a carboxyl group (COOH), and a side
chain (usually denoted by R for "residue"). Amino acids differ by their residue. In
natural proteins, 20 residues have been identified.
Amino acids are bound by a peptidic bond between the amine and the carboxyl
groups.
Proteins
Here are the 20 amino acids.
The residues are highlighted in red.
Glycine
G
Gly
Alanine
A
Ala
Valine
V
Val
Leucine
L
Leu
Methionine
M
Met
Isoleucine
I
Ile
Serine
S
Ser
Threonine
T
Thr
Cysteine
C
Cys
Proline
P
Pro
Asparagine
N
Asn
Glutamine
G
Gln
Phenylalanine F
Phe
Tyrosine
Y
Tyr
Tryptophane
W
Trp
Lysine
K
Lys
Arginine
R
Arg
Histidine
H
His
Aspartate
D
Asp
Glutamate
E
Glu
Proteins
The amino acids can be
classified into different
groups:
Charged:
(+) Arg, His, Lys
(-) Asp, Glu
Polar (uncharged):
Ser, Thr, Asn, Gln, Tyr
Unpolar (hydrophobic):
Ala, Ile, Leu, Met, Phe,
Trp, Val
Others:
Cys, Gly, Pro
Roles of proteins
Proteins have multiple roles:
 They catalyze most of the biochemical reactions (enzymes).
 They regulate gene expression (transcription factors).
 They play important roles in the cellular structure and motion
(cytoskeleton, channels in membrane).
 They are involved in sigalling pathways (hormone, receptors)
 They are involved in the immune system (anti-body)
 They are transporter (hemoglobin, myoglobin)
 ...
Protein structure
Primary structure: amino acid sequence.
Secondary structure: alpha helix and beta sheets
Tertiary structure: 3D structure of a single protein molecule (one chain)
Quaternary structure: complex of several protein molecules or polypeptide
chains (called protein subunits).
myoglobin
GLSDGEWQLVLNVWGKVEADIP
GHGQEVLIRLFKGHPETLEKFD
KFKHLKSEDEMKASEDLKKHGA
TVLTALGGILKKKGHHEAEIKP
LAQSHATKHKIPVKYLEFISEC
IIQVLQSKHPGDFGADAQGAMN
KALELFRKDMASNYKELGFQG
amino acid sequence
(primary structure )
α helix
(secondary structure )
3D folding
(tertiary structure)
Protein structure
Protein structure
Chain B of Protein Kinase C
Quaternary structure of
Protein Kinase C
Protein structure
The two main types of
secondary structure are
α helices and β sheets
anti-parallel β sheet
amino acid
subunits
α helix
Protein structure
Turns, hairpins and loops
A third type of secondary structure is the β turn. These are short regions where
the protein chain takes a 180° change in direction, doubling back on itself. Such
kind of hairpin turns are found for example between two adjacent β strands.
The side chain R3
is usually H
(glycine)
The reminder of the protein structure has much less order, and can be viewed
as simply the connecting pieces (loops) that allow the α helices and β sheets to
pack.
Protein folding
Protein chains themselves rarely have a
biological function. It is only when the chain
has folded into a three-dimentional structure
that the protein has functional activity. In the
folded form, some distant residues can come
close to each other.
Many proteins fold up into several discrete
structural units, each of which is termed a
protein domain. Each domain is associated
to a specific biochemical or binding function.
Challenge for the bioinformatician
One challenge for the bioinformatician will be to predict protein
structure, function and organisation (protein-protein interactions,
protein complexes) on the basis, for example, on its amino acid sequence
(or, sometimes, on the DNA coding sequence!). Sequence analysis is an
alternative to 3D modeling to predict secondary structure and to detect
functional domains.
Due to the various properties of the amino acid side chains, certain residues are found
more often in one or the other structural units. Some residues have been classified, for
example, as α-helix breakers. Proline for example is a poor helix former due to the fact
that its backbone N atom is already bound to its own side chain and cannot form Hbounds within the helix. Good α-helix formers are Ala, Glu, Leu, and Met, whereas
good β-strand formers are Val, Ile, Tyr, and Cys. These types of preferences have
been used to predict secondary structure on the basis of amino acid composition.
A second approach is to make use of evolutionary relations: Proteins that have a
common ancestor are said homologous. Sequence alignment and database
searching can identify homologous proteins. Such homologs often (but not always)
share a common structure and function. Structure and function can thus be inferred
from known proteins.
DNA - RNA - PROTEIN
Central dogma
Central dogma
In 1958, Francis Crick formulated his
famous “central dogma”:
The central dogma states that once
"information" has passed into protein it
cannot get out again. In more detail, the
transfer of information from nucleic acid to
nucleic acid, or from nucleic acid to protein
may be possible, but transfer from protein
to protein, or from protein to nucleic acid is
impossible. Information means here the
precise determination of sequence, either
of bases in the nucleic acid or of amino
acid residues in the protein.
Quoted from Crick (1958).
Crick FH (1958) On protein synthesis. Symp Soc
Exp Biol 12:138-63.
Crick F (1970) Central dogma of molecular biology.
Nature, 227:561-3.
Source: Jacques van Helden
Central dogma
Central dogma
 This “central dogma” is often summarized by
the following sentence:
DNA makes RNA makes protein
 Note that what Crick called the “Central
dogma” has nothing of a dogma. On the
contrary, what he proposed fully deserves to
be qualified of “scientific theory”.
 This formulation is admirably clear and seems
to be as valid today as on the first day when it
had been formulated. Even the prion makes no
exception to this dogma, since Crick defines
information as “the precise determination
of sequence”, and not the conformation that a
protein might take under particular conditions.
 The central idea of Crick’s 1958 paper has
been so often misunderstood that Crick
himself felt it necessary to write a clarification
12 years later (Crick, 1970).
Source: Jacques van Helden
Central dogma
Central dogma
DNA is transcribed into
mRNA, which is in turn
translated into protein
molecules.
Knowing the DNA sequence, it
is in principle possible to
deduce the mRNA sequence
and the amino acid sequences
of the corresponding protein.
However, because the genetic
code is degenerated, the
reverse is not possible: we can
not deduce the actual DNA
sequence coding for a given
protein
Replication
Replication: when the cell divides, the whole genome need to be
replicated (each daughter cell must receive the full DNA content).
In addition to DNA polymerase (which
synthesize new DNA by polymerization),
several additional enzymes are needed:
- DNA topoisomerase to untwist the DNA
- DNA helicase: to separate the 2 strands
- DNA ligase to re-pair the 2 strands
Replication fork
DNA is first untwist and the 2 separate strands thus
serve as template to synthesized new double strand
DNA.
Each DNA strand is read from 3' to 5' and the
complementary strand is synthesized from 5' to 3'.
This leads to an asymetry because on one strand
(the leading strand) the complementary strand is
synthesized continuously in the direction of the DNA
opening, while on the other strand (the laggingstrand) the complementary strand is synthesized
segment by segment (Okazaki segments).
Gene
A gene is a portion of DNA that codes for a protein. In practice, the gene is
sometimes considered also to include surrounding regions of non-coding DNA
that act as control regions.
gene
DNA
reverse
complement
control region
sense of RNA
synthesis
mRNA
3'
5'
protein
Transcription
From DNA to RNA: transcription
One strand of the DNA is
involved in the synthesis of
RNA. Note that the RNA
synthesized is complementary
to the template DNA.
DNA transcription is processed by the
RNA polymerase. RNA polymerase is a
large protein complex which reads DNA,
recruits the correct RNA nucleotide, and
binds them together.
Transcription
From DNA to RNA: transcription
Remark:
RNA polymerase reads the
DNA template strand (from
3' to 5'), which is
complementary to the
coding strand.
RNA is thus synthesized
from 5' to 3'. Its sequence
is complementary to the
template and identical to
the coding strand (except
that it is composed of
ribonucleotides and that
thymine is replaced by
uracyl).
mRNA organisation
5' cap: The 5' cap is a modified guanine nucleotide added to the "front" (5' end) of the pre-mRNA
using a 5',5-Triphosphate linkage. This modification is critical for recognition and proper attachment
of mRNA to the ribosome, as well as protection from 5' exonucleases.
Coding regions (CDS): Coding regions are composed of codons, which are decoded and translated
into proteins by the ribosome. Coding regions begin with the start codon (see later) and end with one
of the three possible stop codons. In addition to protein-coding, portions of coding regions may also
serve as regulatory sequences
Untranslated regions (UTR): Untranslated regions are sections of the RNA before the start codon
and after the stop codon that are not translated, termed the 5' untranslated region (5' UTR) and 3'
untranslated region (3' UTR). Several roles in gene expression have been attributed to the
untranslated regions, including mRNA stability, mRNA localization, and translational efficiency.
3' poly(A) tail (polyadenylation): The 3' poly(A) tail is a long sequence of adenine nucleotides
(often several hundred) added to the 3' end of the pre-mRNA through the action of an enzyme,
polyadenylate polymerase. Interestingly, in higher eukaryotes, the poly(A) tail is added onto
transcripts that contain a specific sequence, the AAUAAA signal.
Translation
From RNA to protein: translation
RNA translation is processed by the
ribosomes. A ribosome is a protein
comlex (composed of a large subunit and
a small subunit) which reads the
messenger RNA (mRNA), recruits the
correct amino-acid, and binds them
together.
Genetic code
The Gamow's diamond code
George Gamow
(one of the proponent of
the Big Bang theory!)
In Gamow's proposal (1954), which he called the diamond code,
double-stranded DNA acted directly as a template for assembling
amino acids into proteins. As Gamow saw it, the various
combinations of bases along one of the grooves in the double helix
could form distinctively shaped cavities into which the side chains of
amino acids might fit. Each cavity would attract a specific amino acid;
when all the amino acids were lined up in the correct order along the
groove, an enzyme would come along to polymerize them.
Each of Gamow's cavities was bounded by the bases at the four
corners of a diamond. If the DNA helix is oriented vertically, the
bases at the top and bottom corners of a diamond are on the same
strand and are separated by a single intervening base; the left and
right corners of the diamond are defined by that intervening base and
by its complementary partner on the opposite strand.
Source: http://www.americanscientist.org
Genetic code
The forgotten code cracker...
Marshall Nirenberg
(Nobel prize 1968)
Nirenberg M, Leder P. (1964) RNA codewords and protein
synthesis: the effect of trinucleotides upon the binding of sRNA to
ribosomes. Science. 145: 1399–1407.
Nirenberg M, Leder P, Bernfield M, Brimacombe R, Trupin J,
Rottman F, O'Neal C (1965) RNA codewords and protein synthesis,
VII. On the general nature of the RNA code. Proc Natl Acad Sci
USA. 53: 1161–1168.
Source: E. Regis, Sci. Am., nov. 2007
Genetic code
Three successive nucleotides (called a codon) code for one amino acid.
The correspondence between the codons and the amino acids constitutes
the genetic code.
Reading frame
The codons are not overlapping. On one strand of DNA, there are thus three
possible reading frames. Each frame would code for a different amino acid
sequence.
START and STOP signals
Start and end of transcription are marked by specific start (AUG) and stop
(UAA, UAG, UGA) signals, but that's not all...
Prokaryotes
START consensus sequence
at position -10: TATAAT
at position -35: TTGACA
STOP signal:
2 short stretches of
complementary
sequence that can
base-pair to form a
RNA double helix
usually involving
several CG base pairs.
Eukaryotes
START consensus sequence
at position -25: TATA box
(recognized by a TATAbinding protein, TBP)
STOP signal:
The AAUAAAA
sequence results in the
clivage of the 3' end of
the transcript at some
10-30 base after the
signal.
Open reading frame (ORF)
An open reading frame (ORF) is a part of DNA which contains a sequence that could
potentially code for a protein. It is usually a long portion of DNA sequence starting with a
start codon and not interrupted by an end codon.
The detection of long ORFs are usually a good indication of the presence of a gene, but
additional information might also be used in order to support the prediction, such as the
codon bias. Since the start and stop end of the ORF are not equivalent to the ends of the
mRNA, a typical ORF finder will employ algorithms based on existing genetic codes and
codon usage and all possible reading frames.
Additional difficulties may arise in eukaryotes where long parts of the DNA within an ORF
are not translated into the protein (introns).
Short ORFs can also occur by chance outside of a gene. Usually such ORF are not very
long and terminate after a few codons.
In a "random" sequence, on
average, we will find 3 stop codon
every 20 codons.
A typical gene length in human is 1015 kb long. Thus, we do not expect to
find a stop codon in such a 3.3-5
codon long sequence. However...
Introns and exons
Exons are parts of DNA which will be translated into the protein. Introns are parts of
the DNA that will not be translated into the protein. Introns are nevertheless present
in the mRNA but subsequently removed by a process called splicing.
Typically, the introns are spliced out by a two stages process: (1) mRNA forms a
loop structure (called lariat) involving an adenine base, (2) the two exons are then
joined and the intron is released.
Gene length
Human genes vary enormously in size and exon content. Exon content is
shown as a percentage of the lengths of indicated genes.
Source: Strachan & Read, Human molecular genetics, BIOS 1999
http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=hmg
Overlapping genes
Overlapping genes are defined as a pair of adjacent genes whose coding regions are
partially overlapping. In other words, a single stretch of DNA codes for portions of two
separate proteins. Such an arrangement of genetic code is ubiquitous. Many overlapping
genes have been identified in the genomes of prokaryotes, eukaryotes, mitochondria, and
viruses.
One consequence of overlapping genes is to reduce the tolerance for mutation. It was shown that
overlapping reduces the probability of accumulating so-called neutral mutations in a gene (mutations
that have no effect). Neutral mutations are unlikely with overlapping genes, because the mutation
must have no effect on two genes with different reading frames.
Veerlachaneni et al (2004) identified 1316 pairs of overlapping genes in humans and mice.
Example of overlap between 3 human genes: MUTH, FLJ13949, and TESK2.
Dark green boxes represent
coding sequences.
Light green boxes represent
untranslated regions.
Source: Veerlachaneni et al (2004) Mammalian overlapping genes:
the comparative perpective. Genome Res: 14: 280-286.
Operons
In prokaryotes, several genes can be under the control of a single promoter. The
different genes have nevertheless each a start codon and a ribosome binding site.
Such an organisation is called an operon.
Examples (from E. coli):
The lac operon
promoter
lacZ
lacY
lacA
The trp operon
promoter
trpE
trpD
trpC
trpB
trpA
Coding vs non-coding DNA
The most part of the DNA does not code for protein...
Year
Genome
size (Mb)
Number
of genes
Gene
spacing (kb)
coding
(%)
non-coding
(%)
Mycoplasma genitalium
1995
0,6
481
1,2
90
10
Haemophilus influenzae
1995
1,8
1 717
1,0
86
14
Escherichia coli
1997
4,6
4 289
1,1
87
13
Saccharomyces cerevisiae
1996
12
6 286
1,9
72
28
Arabidiopsis thaliana
2001
120
27 000
4,4
30
70
Caenorhabditis elegans
1998
97
19 000
5,1
27
73
Drosophila melanogaster
2000
165
16 000
10,3
15
85
Mus Musculus
2002
3 400
Homo sapiens
2001
3 400
30 000
103,2
3
97
Organism
Although the non-coding DNA sequences are sometimes referred to as the junk DNA, it
contains many signals that allow a proper regulation of gene expression.
Also, note that part of DNA is transcribed into RNA, which is subsequently not translated
into protein (cf. tRNA, rRNA, siRNA, but also introns).
Challenge of the bioinformatician
One challenge for the bioinformatician is to predict coding and noncoding genes from raw (genomic) sequences. This question became
particularly important since the large scale sequencing experiments.
Gene prediction is not a trivial task. A long ORF does not always mean
that a gene is encoded. Long ORF might occur by chance. On the other
hand, short ORF can be exons. To support the gene predictions,
additional criteria (based for example on nucleotide composition) are
needed. The predictions can also be validated by comparing the
sequence with genes identified in other organisms.
Another challenge is to identify the regulatory region (promoter) of a
gene. For this purpose, we can make use of the various signals (TATA
box, etc).
Similarly, analysis of mRNA sequence could be performed in order to
detect the coding sequence (CDS).
Control of transcription - Gene regulation
The promoter of a gene does not code for a protein. It is a regulatory region
of DNA usually located upstream of a gene that contains binding sites for
transcription factors. Those transcription factors can be activators (if they
activate the transcription) or repressors (if they repress the transcription).
Such control is crucial for the regulation of the gene. This is why the
binding sites are often referred to as regulatory sequences.
Control of transcription - Gene regulation
The transcription is tightly controlled by specific proteins called transcription
factors. Those factors bind DNA sequence in the promoter of the genes and
interact with the RNA polymerase. They can either activate the transcription
(activators) or repress the transcription (inhibitors).
Colored lines are binding sites: DNA sequence patterns.
Blobs are factors (proteins) that recognize binding sites.
Control of transcription - Gene regulation
Transcriptional factors recognise and bind specific DNA patterns (motifs)
called binding sites. Such sites are called regulatory sequences (or
regulatory elements)
For example, the transcription factor Pho4p
recognises specifically the sequence
CACGTG in the promotor of genes in yeast
cAMP Receptor Protein (CRP)
recognizes specifically the pattern
TGTGA-N6-TCACA
Exercise: assuming equal probabilities for each nucleotide, calculate the probability to
find the sequence CACGTG at a particular position. Assuming that the genome is
6000000 bp long, calculate the expected number of occurrences of this pattern.
Challenge of the bioinformatician
One challenge for the bioinformatician will be to predict the regulation
of genes on the basis of regulatory elements found in their
promoter.
Two questions can be addressed by the bioinformatician:
(1) You already know a regulatory element and you want to find the
genes whose promoter has this regulatory element. This is referred to as
pattern matching.
(2) You do not know the regulatory element, but you have, for example, a
set of genes which are co-regulated. You can then search if they share a
common regulatory element. This approach is called pattern discovery.
The difficulty will be to distinguish real patterns from patterns occurring by
chance, and to estimate the probability that a pattern found is indeed a
binding sequence.
DNA and evolution
Organisms are linked together in
evolutionary history, all having evolved
from one or a very few ancient
ancestral life forms. This process of
evolution, still in action, involves
changes in the genome that are
passed to subsequent generations.
These changes can alter the protein
and RNA molecules encoded, and
thus change the organism, making its
survival more (or less) likely in the
circumstances in which it lives. In this
way the forces of evolution are
inextricably linked to the genomic DNA
molecules.
Challenge for the bioinformatician
Finding out the evolutionary links between genes/genomes would
greatly help to refine the tree of life. This is the major goal of
phylogenetics.
On the other side, knowing that two organisms are closely related
from an evolutionary perspective, their comparison might help to
predict function of unknown genes/proteins in an organism when its
homolog has been characterized in other organisms. This approach
is called comparative genomics.
Tree of life
Because organisms are evolutionary
related, many things can be inferred by
comparing genes and genomes...
Tree of life
... but not everything!
Human intervention
While computer-based analysis has the
benefit of being easily carried out (large
memory, fast computation) in an objective way,
it cannot guarantee to produce biologically
relevant results.
Manual checking (i.e. interpreting the results
using the biology knowledge) remains essential!
Ultimately, only experiments will
validate (or not) the bioinformatic
predictions. Bioinformatic predictions
can be used to reduce the number
of possibilities.
References
 Zvelebil and Baum (2007) Understanding
Bioinformatics, Garland Science.
 Mount M (2004) Bioinformatics: Sequence
and Genome Analysis, Cold Spring Harbor
Laboratory Press, New York.
 Alberts, Bray, Lewis, Raff, Roberts, Watson
(2002) Molecular Biology of the Cell,
Garland Science.
 Lewin B (1997) Gene VI, Oxford Univ Press