Download 3D structures of RNA

Document related concepts

DNA polymerase wikipedia , lookup

Gel electrophoresis of nucleic acids wikipedia , lookup

United Kingdom National DNA Database wikipedia , lookup

Polyadenylation wikipedia , lookup

Short interspersed nuclear elements (SINEs) wikipedia , lookup

NEDD9 wikipedia , lookup

Nucleosome wikipedia , lookup

Metagenomics wikipedia , lookup

Genome evolution wikipedia , lookup

Cancer epigenetics wikipedia , lookup

Human genome wikipedia , lookup

RNA world wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Messenger RNA wikipedia , lookup

Replisome wikipedia , lookup

Molecular cloning wikipedia , lookup

Designer baby wikipedia , lookup

Nutriepigenomics wikipedia , lookup

DNA vaccination wikipedia , lookup

Cell-free fetal DNA wikipedia , lookup

Extrachromosomal DNA wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Epigenomics wikipedia , lookup

DNA supercoil wikipedia , lookup

RNA silencing wikipedia , lookup

RNA wikipedia , lookup

Microevolution wikipedia , lookup

Nucleic acid double helix wikipedia , lookup

History of genetic engineering wikipedia , lookup

Gene wikipedia , lookup

Point mutation wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Non-coding DNA wikipedia , lookup

History of RNA biology wikipedia , lookup

Genomics wikipedia , lookup

Nucleic acid tertiary structure wikipedia , lookup

Non-coding RNA wikipedia , lookup

Epitranscriptome wikipedia , lookup

RNA-Seq wikipedia , lookup

Helitron (biology) wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Primary transcript wikipedia , lookup

Transcript
C
E
N
T
R
E
F
O
R
I
N
T
E
G
R
A
T
I
V
E
B
I
O
I
N
F
O
R
M
A
T
I
C
S
V
U
Introduction to bioinformatics
Lecture 2
Genes and Genomes
Organisational
• Course website:
http://ibi.vu.nl/teaching/mnw_2year/mnw2_2008.php
or click on
http://ibi.vu.nl
(>teaching >Introduction to Bioinformatics)
• Course book:
• Bioinformatics and Molecular Evolution by Paul G.
Higgs and Teresa K. Attwood (Blackwell Publishing),
2005, ISBN (Pbk) 1-4051-0683-2
• Essential Bioinformatics by Jin Xiong, Cambridge
University Press, 2006, ISBN 0521840988
• Lots of information about Bioinformatics can be
found on the web.
DNA sequence
.....acctc
tcccagatgg
ggcccaggac
ttggtgacac
caaatcttgt
gagcccaaat
gcccagagcc
ccggtgccca
ttcctcttcc
cccggacccc
ccacgaagac
ggcgtggagg
agcagtacaa
cgtcctgcac
tgcaaggtct
cctggtcaaa
tgggagagca
cgcctcccat
cagcaagctc
aacatcttct
accgctacac
ctgtgcaaga
gtcctgtccc
tggggaagcc
aactcacaca
gacacacctc
cttgtgacac
caaatcttgt
gcacctgaac
ccccaaaacc
tgaggtcacg
ccnnnngtcc
tgcataatgc
cagcacgttc
caggactggc
ccaacaaagc
ggcttctacc
atgggcagcc
gctggactcc
accgtggaca
catgctccgt
gcagaagagc
acatgaaaca
aggtgcacct
tccagagctc
tgcccacggt
ccccgtgccc
acctccccca
gacacacctc
tcttgggagg
caaggatacc
tgcgtggtgg
agttcaagtg
caagacaaag
cgtgtggtca
tgaacggcaa
aaccaagtca
ccagcgacat
ggagaacaac
gacggctcct
agagcaggtg
gatgcatgag
ctctc.....
nctgtggttc
gcaggagtcg
aaaaccccac
gcccagagcc
acggtgccca
tgcccacggt
ccccgtgccc
accgtcagtc
cttatgattt
tggacgtgag
gtacgtggac
ctgcgggagg
gcgtcctcac
ggagtacaag
gcctgacctg
cgccgtggag
tacaacacca
tcttcctcta
gcagcagggg
gctctgcaca
Genome size
Organism
Number of base pairs
X-174 virus
5,386
Epstein Bar Virus
172,282
Mycoplasma genitalium
580,000
Hemophilus Influenza
1.8  106
Yeast (S. Cerevisiae)
12.1  106
Human
3.2  109
Wheat
16  109
Lilium longiflorum
90  109
Salamander
100  109
Amoeba dubia
670  109
Four DNA nucleotide building
blocks
G-C is more strongly hydrogen-bonded than A-T
A gene codes for a protein
DNA
CCTGAGCCAACTATTGATGAA
transcription
mRNA
CCUGAGCCAACUAUUGAUGAA
translation
Protein
PEPTIDE
Central Dogma of
Molecular Biology
Replication
DNA
Transcription
mRNA
Translation
Protein
Transcription is carried out by RNA polymerase (II)
Translation is performed on ribosomes
Replication is carried out by DNA polymerase
Reverse transcriptase copies RNA into DNA
Transcription + Translation = Expression
But DNA can also be transcribed into
non-coding RNA …
tRNA (transfer): transfer of amino acids to the
ribosome during protein synthesis.
rRNA (ribosomal): essential component of the ribosomes
(complex with rProteins).
snRNA (small nuclear): mainly involved in RNA-splicing
(removal of introns). snRNPs.
snoRNA (small nucleolar): involved in chemical modifi-cations
of ribosomal RNAs and other RNA genes. snoRNPs.
SRP RNA (signal recognition particle): form RNA-protein
complex involved in mRNA secretion.
Further: microRNA, eRNA, gRNA, tmRNA etc.
Eukaryotes have spliced genes …





Promoter: involved in transcription initiation (TF/RNApol-binding sites)
TSS: transcription start site
UTRs: un-translated regions (important for translational control)
Exons will be spliced together by removal of the Introns
Poly-adenylation site important for transcription termination
(but also: mRNA stability, export mRNA from nucleus etc.)
DNA makes mRNA makes Protein
DNA makes RNA makes Protein
… yet another picture to appreciate the above statement
Transcription
Factor (TF):
protein that binds
to DNA and to a
polymerase (Pol
II)
Transcription Factors
Polymerase:
complex protein
that transcribes
DNA into mRNA
TF binding
site (TFBS)
TF
mRNA
Pol II transcription
TATA
Transcription
factor –
polymerase
interaction sets off
gene
transcription…
mRNA
transcription
TF binding
site (closed)
TATA
Nucleosomes (chromatin structures
composed of histones) are
structures round which DNA coils.
This blocks access of TFs
TF binding
site (open)
… many TFBSs are
possible upstream of a
gene
Some facts about human genes
 There are about 20.000 – 25.000 genes in the human
genome (~ 3% of the genome)
 Average gene length is ~ 8.000 bp
 Average of 5-6 exons per gene
 Average exon length is ~ 200 bp
 Average intron length is ~ 2000 bp
 8% of the genes have a single exon
 Some exons can be as small as 1 or 3 bp
DMD: the largest known human gene
 The largest known human gene is
DMD, which stands for “Dystrophin
(muscular dystrophy, Duchenne and
Becker types)”
 The gene encodes the protein
dystrophin: the gene’s size is
~ 2.4 milion bp over 79 exons
 X-linked recessive disease (affects
boys)
 Two variants: Duchenne-type (DMD)
and becker-type (BMD)
 Duchenne-type: more severe,
frameshift-mutations
Becker-type: milder phenotype, “in
frame”- mutations
Posture changes during progression
of Duchenne muscular dystrophy
Nucleic acid basics
 Nucleic acids are polymers
nucleotide
nucleoside
 Each monomer consists of 3
moieties
Nucleic acid basics (2)
 A base can be of 5 rings
 Purines and Pyrimidines
can base-pair (WatsonCrick pairs)
Watson and Crick, 1953
Nucleic acid as hetero-polymers
 Nucleosides, nucleotides
(Ribose sugar,
RNA precursor)
 DNA and RNA strands
(2’-deoxy ribose sugar,
DNA precursor)
REMEMBER:


(2’-deoxy thymidine triphosphate, nucleotide)

DNA = deoxyribonucleotides;
RNA = ribonucleotides (OH-groups at
the 2’ position)
Note the directionality of DNA (5’-3’
& 3’-5’) or RNA (5’-3’)
DNA = A, G, C, T ; RNA = A, G, C, U
So …
DNA
RNA
Stability of base-pairing
 C-G base pairing is more stable than A-T (A-U) base
pairing (why?)
 3rd codon position has freedom to evolve (synonymous
mutations)
 Species can therefore optimise their G-C content (e.g.
thermophiles are GC rich) (consequences for codon use?)
Thermocrinis ruber, heat-loving bacteria
Amino Acid
Single
Letter
Code
DNA codons
Isoleucine
I
ATT, ATC, ATA
Leucine
L
CTT, CTC, CTA, CTG, TTA, TTG
Valine
V
GTT, GTC, GTA, GTG
Phenylalanine
F
TTT, TTC
Methionine
M,
Start
ATG
Cysteine
c
TGT, TGC
Alanine
A
GCT, GCC, GCA, GCG
Glycine
G
GGT, GGC, GGA, GGG
Proline
P
CCT, CCC, CCA, CCG
Threonine
T
ACT, ACC, ACA, ACG
Serine
S
TCT, TCC, TCA, TCG, AGT, AGC
Tyrosine
Y
TAT, TAC
Tryptophan
W
TGG
Glutamine
Q
CAA, CAG
Asparagine
N
AAT, AAC
Histidine
H
CAT, CAC
Glutamic acid
E
GAA, GAG
Aspartic acid
D
GAT, GAC
Lysine
K
AAA, AAG
Arginine
R
CGT, CGC, CGA, CGG, AGA, AGG
Stop codons
Stop
TAA, TAG, TGA
DNA compositional biases
 Base compositions of genomes: G+C (and therefore also
A+T) content varies between different genomes
 The GC-content is sometimes used to classify organism in
taxonomy
 High G+C content bacteria: Actinobacteria
e.g. in Streptomyces coelicolor it is 72%
Low G+C content: Plasmodium falciparum (~20%)
 Other examples:
Saccharomyces cerevisiae (yeast)
38%
Arabidopsis thaliana (plant)
36%
Escherichia coli (bacteria)
50%
Genetic diseases: cystic fibrosis
 Known since very early on
(“Celtic gene”)
 Autosomal, recessive,
hereditary disease (Chr. 7)
 Symptoms:
 Exocrine glands (which produce
sweat and mucus)
 Abnormal secretions
 Respiratory problems
 Reduced fertility and (male)
anatomical anomalies
3,000
30,000
20,000
cystic fibrosis (2)
 Gene product: CFTR (cystic fibrosis transmembrane
conductance regulator)
 CFTR is an ABC (ATP-binding cassette) transporter
or traffic ATPase.
 These proteins transport molecules such as sugars,
peptides, inorganic phosphate, chloride, and metal
cations across the cellular membrane.
 CFTR transports chloride ions (Cl-) ions across the
membranes of cells in the lungs, liver, pancreas,
digestive tract, reproductive tract, and skin.
cystic fibrosis (3)
 CF gene CFTR has 3-bp deletion leading to Del508
(Phe) in 1480 aa protein (epithelial Cl- channel)
 Protein degraded in Endoplasmatic Reticulum (ER)
instead of inserted into cell membrane
Diagram depicting the five domains of the
CFTR membrane protein (Sheppard 1999).
Theoretical Model of NBD1. PDB
identifier 1NBD as viewed in Protein
Explorer http://proteinexplorer.org
Let’s return to DNA and RNA
structure …
 Unlike three dimensional structures of proteins,
DNA molecules assume simple double helical
structures independent of their sequences.
 There are three kinds of double helices that have
been observed in DNA: type A, type B, and type Z,
which differ in their geometries.
 RNA on the other hand, can have as diverse
structures as proteins, as well as simple double
helix of type A.
 The ability of being both informational and diverse
in structure suggests that RNA was the prebiotic
molecule that could function in both replication and
catalysis (The RNA World Hypothesis).
 In fact, some viruses encode their genetic materials
by RNA (retrovirus)
Three dimensional structures of
double helices
Side view: A-DNA, B-DNA, Z-DNA
Space-filling models of A, B and Z- DNA
Top view: A-DNA, B-DNA, Z-DNA
Major and minor grooves
Forces that stabilize nucleic acid
double helix
 There are two major forces that contribute to stability of
helix formation:
 Hydrogen bonding in base-pairing
 Hydrophobic interactions in base stacking
5’
3’
3’
5’
Same strand stacking
cross-strand stacking
Types of DNA double helix
 Type A
 Type B
 Type Z
major conformation RNA
minor conformation DNA
major conformation DNA
minor conformation DNA
Right-handed helix
Short and broad
Right-handed helix
Long and thin
Left-handed helix
Longer and thinner
Secondary structures of Nucleic
acids
 DNA is primarily in
duplex form
 RNA is normally single
stranded which can
have a diverse form of
secondary structures
other than duplex.
 RNA can form duplexes
by folding back onto
itself
 DNA duplex is mostly in
the B-form, RNA duplex
regions in the A-form
 DNA is more stable than
RNA
Non B-DNA Secondary structures
 Cruciform DNA
 Slipped DNA
 Triple helical DNA
Hoogsteen basepairs
Source: Van Dongen et al. (1999) , Nature Structural Biology 6, 854 - 859
RNA Secondary structures
 RNA pseudoknots
 Cloverleaf rRNA structure
16S rRNA Secondary Structure Based on
Phylogenetic Data
Source: Cornelis W. A. Pleij in Gesteland, R. F. and Atkins, J. F. (1993)
THE RNA WORLD. Cold Spring Harbor Laboratory Press.
3D structures of RNA :
transfer-RNA structures
 Secondary structure
of tRNA (cloverleaf)
 Tertiary structure
of tRNA
3D structures of RNA :
ribosomal-RNA structures
 Secondary structure
of large rRNA (16S)
 Tertiary structure
of large rRNA subunit
Ban et al., Science 289 (905-920), 2000
3D structures of RNA :
Catalytic RNA
 Secondary structure
of self-splicing RNA
 Tertiary structure
of self-splicing RNA
Some structural rules …
 Base-pairing is stabilizing
 Un-paired sections (loops) destabilize
 3D conformation with interactions
makes up for this
Three main principles
• DNA makes RNA makes Protein
• Structure more conserved than sequence
• Sequence
Structure
Function
How to go from DNA to protein
sequence
A piece of double stranded DNA:
5’ attcgttggcaaatcgcccctatccggc 3’
3’ taagcaaccgtttagcggggataggccg 5’
DNA direction is from 5’ to 3’
How to go from DNA to protein sequence
6-frame conceptual translation using the codon table:
5’ attcgttggcaaatcgcccctatccggc 3’
3’ taagcaaccgtttagcggggataggccg 5’
So, there are six possibilities to make a protein from an unknown
piece of DNA, only one of which might be a natural protein
Remark
• Identifying (annotating) human genes, i.e. finding what
they are and what they do, is a difficult problem
– First, the gene should be delineated on the genome
• Gene finding methods should be able to tell a gene region from a nongene region
• Start, stop codons, further compositional differences
– Then, a putative function should be found for the gene located
Evolution and three-dimensional protein structure
information
Isocitrate
dehydrogenase:
The distance from
the active site
(in yellow) determines
the rate of evolution
(red = fast evolution,
blue = slow evolution)
Dean, A. M. and G. B.
Golding: Pacific Symposium
on Bioinformatics 2000
Genomic Data Sources
• DNA/protein sequence
• Expression (microarray)
• Proteome (xray, NMR,
mass spectrometry)
• Metabolome
• Physiome (spatial,
temporal)
Integrative
bioinformatics
Genomic Data Sources
Vertical Genomics
genome
transcriptome
proteome
metabolome
physiome
Dinner discussion: Integrative Bioinformatics & Genomics VU
DNA makes RNA makes Protein
(reminder)
DNA makes RNA makes Protein:
Expression data
• More copies of mRNA for a gene leads to
more protein
• mRNA can now be measured for all the
genes in a cell at ones through microarray
technology
• Can have 60,000 spots (genes) on a single
gene chip
• Colour change gives intensity of gene
expression (over- or under-expression)
Proteomics
• Elucidating all 3D structures of proteins in
the cell
• This is also called Structural Genomics
• Finding out what these proteins do
• This is also called Functional Genomics
Protein-protein interaction networks
Metabolic
networks
Glycolysis
and
Gluconeogenesis
Kegg database
(Japan)
High-throughput Biological Data
• Enormous amounts of biological data are
being generated by high-throughput
capabilities; even more are coming
– genomic sequences
– arrayCGH (Comparative Genomic Hybridization)
data, gene expression data
– mass spectrometry data
– protein-protein interaction data
– protein structures
– ......
Protein structural data explosion
Protein Data Bank (PDB): 14500 Structures (6 March 2001)
10900 x-ray crystallography, 1810 NMR, 278 theoretical models, others...
Bioinformatics databases grow exponentially…
Dickerson’s formula: equivalent to
Moore’s law
n = e0.19(y-1960)
with y the year.
Dickerson predicted that the Protein Data Bank (PDB)
of protein three-dimensional structures would grow,
starting with the first protein in 1960, as indicated by
the above exponential growth function.
On 27 March 2001 there were 12,123 3D protein
structures in the PDB: Dickerson’s formula predicts
12,066 (within 0.5% -- not a bad prediction)!
Sequence versus structural data
• Structural genomics initiatives are now in
full swing and growth is still exponential.
• However, growth of sequence data is even
more rapidly. There are now more than 600
completely sequenced genomes publicly
available.
Increasing gap between structural and
sequence data (“Mind the gap”)
Bioinformatics
• Offers an ever more essential input to
–
–
–
–
–
–
–
–
Molecular Biology
Pharmacology (drug design)
Agriculture
Biotechnology
Clinical medicine
Anthropology
Forensic science
Chemical industries (detergent industries, etc.)
This list is from a molecular biology textbook – so not a
self-absorbed bioinformatician is saying this…