Download Chapter 1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Molecular cloning wikipedia , lookup

Gene expression profiling wikipedia , lookup

Community fingerprinting wikipedia , lookup

RNA polymerase II holoenzyme wikipedia , lookup

Gene regulatory network wikipedia , lookup

List of types of proteins wikipedia , lookup

Epitranscriptome wikipedia , lookup

Eukaryotic transcription wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Non-coding RNA wikipedia , lookup

Replisome wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Promoter (genetics) wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene expression wikipedia , lookup

Genome evolution wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Gene wikipedia , lookup

Non-coding DNA wikipedia , lookup

Molecular evolution wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Transcriptional regulation wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
Introductorygeneticsforveterinarystudents
andtheanimalsciences.
MichelGeorges
ChapterI:Genesincells
“Pose ton esprit, Michel!” Catherine GeorgesPirenne,magrand-mère.
The haploid mammalian genome consists of a
9
double-helixof3x10 basepairsofDNA
The multivolume encyclopedia which we inherit
fromourparentsviathegametesconsistofaseries
9
ofDNAdouble-helicestotaling∼3x10 basepairs.
Thebuildingblocksofdeoxyribonucleicacid(DNA)
are the four nucleotides: deoxyadenosine
monophosphate
(dAMP),
deoxyguanosine
monophosphate
(dGMP),
deoxythymidine
monophosphate (dTMP) and deoxycytidine
monophosphate (dCMP). Nucleotides comprise
threeelements:apurine(adenineandguanine)or
pyrimidine(cytosineandthymine)base,covalently
linked to the 1’ carbon of the pentose β-D-2
deoxyribose, carrying one (dNMP), two (dNDP) or
three(dNTP)phosphategroupsonits5’carbon.
Cellsalsocontainnucleotidesinwhichthepentose
is ribose rather than deoxyribose. These are
ATP/ADP/AMP,GTP/GDP/GMP,CTP/CDP/CMPand
UTP/UDP/UMP. The latter is characterized by the
pyrimidinebaseuracil.Thesefourribonucleotides
are the building blocks of ribonucleic acid (RNA)
molecules.
Deoxynucleotides assemble into polynucleotide
chains. In these, a phosphodiester bond connects
the3’carbonofthepreviouswiththe5’carbonof
the following nucleotide. Thus, polynucleotide
chains are “polar” with a distinct 5’ and 3’
extremity.
A DNA double-helix is formed by the anti-parallel
juxtaposition (or “hybridization”) of two
complementary polynucleotide chains. In these,
the sugar-phosphate backbones are like the stiles
of a ladder while the bases face each other and
form the rungs. Polynucleotide chains are
complementaryifA‘sononechainpairwithT’son
theother,whileG’spairwithC’s.Thisconstraint
explainstheobservationthatinDNAG’sandC’sas
wellasA’sandT’sarealwayspresentinequimolar
amounts (Chargaff’s rule), however that the ratio
of [G+C]/[A+T] may vary depending on the
organism. DNA double-helices are stabilized by
hydrogen bonds that can form between the
complementarybases:twoforA-Tpairsandthree
for G-C pairs. Thus DNA double-helices rich in G’s
and C’s are in general more stable than doublehelicesrichinA’sandT’s.Thebase-pairs(bp)are
planarandorientedperpendicalularlytothestiles,
likestairs.
Chapter1:Genesincells
Note that the chemical structure of the bases
shownintextbookscorrespondstothemoststable
isomer,inwhichtheythereforespendmostoftheir
time. However, all four bases can also adopt
distinct “tautomeric” conformations. At any time,
a proportion of the molecules will explore these
less stable conformations. The A-T and G-C base
pairing rules do not apply to these tautomeric
forms. Hence, a rare imino form of C will
preferentiallybindwithA,whilearareenolformof
thyminewillpreferbindingtoG.Suchtautomeric
shiftsmaycauseerrorsduringDNAreplication.
Rather than being flat, the DNA ladder will
spontaneously adopt a helical structure: the
“double-helix”, in which one polynucleotide chain
(or strand) is sometimes referred to as Crick and
the other as Watson in honor of the two famous
scientists that discovered its structure. The DNA
double-helix is a “right” helix, which means that if
you were climbing up the DNA ladder your body
wouldperformaclock-wise,right-handedrotation.
Thehelixperformsonerotationevery10base-pairs
9
corresponding to 3.4nm. Thus, 3x10 base pairs
amount to ∼1m of DNA packed in each haploid
gamete,or∼2mineachdiploidcell.Viewedfrom
the side the double-helix exhibits a major and a
minorgroove,inwhichdistinctchemicalgroupsare
exposed to form sequence-specific surfaces that
canbindspecificproteins.
While the DNA helix is in general right-handed,
specific sequences may in particular conditions
locally adopt a left handed helix configuration.
Thiswillimpactthenumberofhelixturnsperunit
length, which will itself impact the degree of DNA
supercoiling.
While the basic building blocks of DNA are AMP,
GMP, CMP and TMP as described before, these
may undergo chemical modifications after their
incorporation in a double-helix. The most
commonly encountered DNA modification
encountered in the vertebrate genome is the
methylation of cytosines at position 5 of the
pyrimidine ring. Methylated cytosines are nearly
exclusivelyobservedonbothstrandsofthe5’-CpG3’dinucleotidepalindrome(i.e.thecomplementary
sequencereadsthesame).
Crick and Watson immediately recognized (“... it
didn’t escape our attention ...”) that the structure
of the double-helix was compatible with its
function as bearer of the genetic material. First,
the two strands of the double-helix carry
redundant information. Indeed, if you know the
nucleotide sequence of the first you can
unambiguously determine the sequence of the
1
second based on the complementarity rules. This
feature, they recognized, might be intimately
connectedwiththemechanismsofDNAreplication
needed to double the amount of DNA that could
then be partitioned equitably between daughter
cell. The two DNA strands might each
independently serve as a template for the
formation of two novel DNA double-helixes
identical to the parental one. Moreover, they
observedthat,whileG’sononestrandalwaysface
C’s on the other, and likewise with A’s T’s, the
structure of the double-helix doesn’t impose any
constrains on which type of base-pairs succeed to
each other on the ladder: the structure of the
double-helix is essentially the same whichever the
base-pair sequence. Thus a message can be
encrypted in the sequence in exact the same way
as successions of characters form words and
sentencesinMorseorourwrittenlanguage.
The mammalian genome is distributed over a
species-specificnumberofchromosomes
Withtheexceptionofsomeviruses,whosegenome
is composed of RNA, all living organisms have a
genome composed of DNA. The length of the
genome, however, varies widely. It measures of
6
7
theorderof10 bpinbacteria(e.g.E.Coli),10 bp
in unicellular eucaryotes (e.g. the yeast S.
8
cerevisiae),10 innon-vertebrateanimals(e.g.the
fruitfly D. melanogaster or nematode C. elegans),
9
and10 inmammals(Table1).Ingeneral,genomesize thus increases with organismal complexity
(although this notion is likely to be very
anthropocentric). Nevertheless, some closely
related organisms (notably amongst plants or
amphibians) may differ considerably in the size of
theirgenome(C-valueparadox).
9
The 3x10 base pair long double-helix in our
gametes doesn’t come as a single molecule. Our
DNA is subdivided over several molecules each
corresponding to one chromosome, just as
encyclopediae come in different volumes. Hence,
the human genome is subdivided over 23
chromosomes: 22 autosomes numbered by
decreasing size, plus one of the sex chromosomes
(X or Y). The zygote thus contains two sets of 23
chromosomes,oneoriginatingfromthefatherand
theotherfromthemother,andsowilleachoneof
thedescendantdiploidsomaticcells.
The number of chromosomes is constant within a
given species, but may extensively vary between
species. Table 2 shows the number of
chromosomes characterizing the genome of the
most important domestic species. There is no
correlation between the number of chromosomes
andthelengthoftheentiregenome.Allmammals,
including human, have genomes of approximately
9
the same size (∼3x10 bp), but chromosome
numbers may vary widely, even between closely
relatedspecies.
Chapter1:Genesincells
Chromosome means “colored bodies” in Greek.
The name refers to the rods that cytogeneticists
have been able to visualize under the microscope
since the beginning of the XX-th century when
examiningcellpreparations.Afractionofthecells
in the preparation (corresponding to the “mitotic
index”) will be in mitosis when sampled for
analysis. In these, the DNA is found in a highly
condensed state, forming the rods seen under the
microscope. By applying specific staining
procedures, cytogeneticist were able to elicit
banding patterns that allowed them, in
combination with size and position of a
constricture called the centromere, to distinguish
the different chromosomes. For some
chromosomes, referred to as metacentric, the
centromere has an approximate central position,
and hence the short (p) and long (q) arm have
comparablesize.Onotherchromosomes,referred
to as acrocentric, the centromere is close to one
end of the chromosome and hence the p-arm is
much smaller than the q arm. By doing so,
cytogeneticists were able to define the
“caryotypes” of species or individuals within
species.FigureXshowstheorderedcaryotypeofa
man with Down syndrome, characterized by the
trisomy (three copies instead of two) of
chromosome21underlyingthecondition.
Chromosomal DNA is located in the nucleus
complexedwithhistones
The majority of the cells in a tissue are in
interphase rather than mitosis. At this stage, the
chromosomal DNA is less condensed than in
mitosisandresideswithinthenucleus.Remember
that every one of our cells contains ∼2m of DNA,
cramped within a ∼20μm diameter nucleus. How
cells address this extraordinary organizational
challenge remains a mystery. What is known is
thattheDNAinthenucleusisnotnakedbutrather
complexed with proteins, forming so-called
chromatin. The nuclear DNA is wrapped around
beads formed by histones. Each histone bead
comprises eight histone proteins (2xH2A, 2xH2B,
2xH3 and 2xH4). Histone proteins are relatively
small proteins that show extreme evolutionary
conservation: only two amino-acid out of 102
differbetweentheH4proteinofcowandpea.This
indicatesthatnearlyeachoneoftheamino-acidsis
needed for the histone molecule to properly
function. Approximately one fifth of the aminoacids are positively charged lysines or arginines,
facilitatinginteractionwiththenegativechargesof
the phosphate groups in the stiles of the double
helix. ∼150 base pairs of DNA make ∼two turns
around the histone core, jointly forming a
nucleosome.Adjacentnucleosomesareconnected
by∼50bpoflinkerDNA.
Thisstringofnucleosomesformsatighterstructure
called the 30-nm fiber. A fifth class of histones,
called H1, is essential for its formation. There
2
certainly are higher levels of organization of the
nuclear chromatin, however, these remain poorly
understood. The 30-nm fiber is thought to form
loops of several hundreds of kilobases that are
anchored onto the nuclear scaffold at their base.
Each of these loops may correspond to an
independentlyregulatedtranscriptionalunit.
Nucleosomescanbeseenaslittlehairyballsasthe
amino-terminal tails of the nucleosomal histones
extrude from the surface. Specific enzymes
catalyze a myriad of covalent modifications of
these tails including phosphorylation of serines,
methylation, acetylation and ubiquitylation of
lysines, isomerisations of prolines, etc. By locally
alteringtheconformationofthehistonetails,these
modifications (sometimes referred to as the
“histone code”) have a profound effect on
chromatinstructureandfunctionality.
In addition to the nucleolus, examination of the
nucleus under the electron microscope typically
reveals lighter material in the center refers to
euchromatin, and darker material mostly in the
periphery referred to as heterochromatin.
Heterochromatinisthoughttocorrespondtohighly
condensed, transcriptionnaly inactive parts of the
genome, while the active genes are thought to be
confinedthelesscondensedeuchromaticzones.
Inpreparationforcelldivisioncellsduplicatetheir
DNAduringtheSphaseofthecellcycle.
As mentioned before, the diploid zygote inherits
one paternal and one maternal genome. The
entire body will then develop by sequential binary
cell division and differentiation. Before dividing,
the mother cell first duplicates its genome by a
process of DNA replication which defines the
S(ynthesis) phase of the cell cycle. After a G(ap)2
phase, the duplicated genome is then partitioned
between the two daughter cells during the
M(itosis) and cytokinesis phases. The daughter
cellsthenresumethecellcyclewiththeG1phase.
Cells may sometimes exit the cell cycle and enter
quiescenceorG0.
Progression through the cell cycle is a tightly
controlled process that is orchestrated by
sequentially activated cyclin-dependent kinases
(Cdks). As their name implies, Cdks only become
active in the presence of cell-cycle stage specific
cyclins(e.g.G1/S-,S-andM-cyclins).
The double-helical structure of DNA immediately
suggested semi-conservative DNA replication to
CrickandWatson:eacholdstrandmightserveasa
templateforthesynthesisofanewstrandthereby
generating two identical daughter helices each
comprising one old and one new polynucleotide
chain.Thishypothesiswasrapidlyprovencorrect.
Synthesis of new DNA occurs at the replication
fork, i.e. the site at which the two old strands
Chapter1:Genesincells
separatefromeachotherexposingsinglestranded
chainsthatserveastemplatesforthesynthesisof
the new strands. DNA dependent DNA
polymerases
catalyze
the
addition
of
complementary nucleotides at the end of a new
growing polynucleotide chain. An apparently
simple mechanism would be for one of the
polynucleotide chains to grow in the 5’ to 3’
direction,whiletheotherwouldgrowinthe3’to5’
direction,bothcloselyfollowingtheadvanceofthe
replication fork. However, all known DNA
dependent DNA polymerases catalyze 5’ to 3’
growth only. Phosphodiester bonds are created
betweenthe5’αphosphategroupoftheentering
nucleotideandthe3’hydroxylgroupattheendof
the growing chain, releasing the β and γ
phosphates as a pyrophosphate moiety. As a
consequence, DNA replication proceeds in the
samedirectionasreplicationforkadvancementfor
onestrand(theleadingstrand),butintheopposite
direction for the other strand (the lagging strand).
On the lagging strand, DNA replication has to be
continuouslyreinitiatedtofollowtheadvancement
of the replication fork. DNA dependent DNA
polymerases can extend growing polynucleotide
chains if provided with a template and nucleotide
precursors but can not initiate DNA replication in
the absence of a primer. Only RNA polymerases
havethisability.ARNApolymerasecalledprimase
is indeed fulfilling this role on the lagging strand.
The primase generates a complementary primer
composed of RNA rather than DNA. The DNA
polymerase then takes over to complete DNA
replication, filling the single stranded space
between the new and the previous RNA primer.
Thus DNA replication on the lagging strand
generates short fragments comprising an RNA
primer with a DNA extension, known as Okazaki
fragments. The RNA primer will subsequently be
removedthankstothe5’to3’exonucleaseactivity
ofspecificDNApolymerases,andadjacentOkazaki
fragmentsjoinedbyaDNAligase.DNAreplication
requires many additional enzymatic activities
including a helicase which separates the two
strands of the double-helix at the replication fork,
topoisomerases or gyrases which untangle the
knots created upstream by the unwinding of the
double helix, and single stranded DNA binding
proteins which stabilize the single stranded
templates prior to synthesis. The enzymes that
coordinately fulfill all these tasks assemble in a
largecomplexknownasthereplisome.
Why would nature have selected such a
complicated asymmetric replication process
involving a leading and a lagging strand? The
answer probably lies in the need for proofreading
to control replication errors. Once in a while, the
DNA polymerase will introduce a noncomplementary nucleotide. This will for instance
happenifanucleotidewasselectedwhilebeingin
an unusual tautomeric state. DNA polymerases
participating in DNA replication are endowed with
3
a proofreading 3’ to 5’ exonuclease activity that
allowsthemtoexcisesucherroneouslyintroduced
nucleotides.Proofreadingisincompatiblewith3’to
5’ growth as excision of the last nucleotide would
eliminatethe5’triphosphateextremityneededfor
furthersynthesis.
S-phasestartswiththeactivationoforiginsofDNA
replication.Atthesepoorlydefinedsites,thetwo
strandsareseparatedtoformareplicationbubble.
Mammalian chromosomes encompass multiple
origins of replication of which variable numbers
may be activated depending on the need to
complete the cell cycle more or less rapidly. Each
replication bubble is characterized by two
replication forks. Initiation of DNA replication
requires intervention of an RNA polymerase on
both leading and lagging strand. The cell has
developed mechanisms to ensure that origins of
replicationareonlyfiredoncepercellcycle.
AfterDNAreplication,CpGdinucleotidesthatwere
initiallymethylatedonbothstrandsbecomehemimethylatedasthenewlysynthesizedstrandisnot.
However, maintenance DNA methyl transferases
(DNMT) recognize these hemi-methylated CpG’s
and restore full methylation by adding a methyl
groupatposition5ofthecytosineringinthenewly
synthesizedstrand.Thankstothismechanism,the
methylationsstateofCpGdinucleotidesisfaithfully
transmitted “epigenetically” from mother to
daughtercells.
Discontinuous replication on the lagging strand
createsacompletionproblemattheextremitiesof
the linear chromosomes. To avoid shortening of
the chromosomes at each cell division,
chromosomal extremities are endowed with
specificstructurescalledtelomeres.Telomeresare
composed of several hundred tandem repetitions
of the GGGTTA sequence. This structure is
recognized by an enzyme called telomerase which
hastheabilitytoaddtelomericrepeatsattheend
of the chromosomes thereby counteracting
chromosomal shortening due to replication. The
telomerase is a peculiar enzyme that is composed
of both a protein and a RNA subunit. The RNA
subunitiscomplementarytothetelomericrepeats
and, by hybridizing with it, creates a primertemplate structure that allows extension of the
telomeric leading strand by the reverse
transcriptaseactivityofthetelomerase.Ithasbeen
postulated that ageing might be related to ceased
telomerase activity thereby leading to progressive
chromosomalshorteninganddegradation.Onthe
contrary, many cancers are characterized by
telomerase induction contributing to the
immortalityofcancercells.
AttheendofSphase,eachchromosomehasbeen
duplicated. However, the two identical doublehelices, referred to as “sister chromatids”, remain
gluedtogetheralongtheirentirelength,embraced
Chapter1:Genesincells
by large ring-like protein complexes called
cohesins.
The S phase is also characterized by centrosome
duplication.Centrosomescompriseacentriolepair
embedded in pericentriolar matrix. The pair of
centrioles separates, and daughter centrioles form
atthebaseofeachmothercentriole.
Mitosis equitably distributes the replicated DNA
amongstdaughtercells
After DNA replication during S phase and a short
G2period,thecellengagesinafinelytunedballet
aimed at distributing the duplicated genetic
material equitably amongst the two daughter cell:
theM-phase.
The M-phase starts with mitosis typically
subdivided
in
five
phases:
prophase,
prometaphase, metaphase, anaphase and
telophase. Prophase is characterized by the
progressive condensation of the replicated
chromosomes in the nucleus, reflecting the
activation of condensin complexes resembling
cohesins.Pairedsisterchromatidsbecomevisible.
Meanwhiletheduplicatedcentrosomesmoveapart
and initiate the formation of the mitotic spindle.
Projected tubulin-based microtubules can stabilize
in three ways: by interacting with the cell
membrane (astral microtubules), by interacting
with microtubules emanating from the opposite
centrosomes (interpolar microtubules) or by
interacting with the kinetochores, large protein
complexeslocatedatthecentromersofeachsister
chromatid(kinetochoremicrotubules).Centromers
are characterized by long stretches of tandemly
repeated sequences referred to as satellite DNA.
The latter stabilization can only occur after the
breakdown of the nuclear envelope, which
liberates the chromosomes and marks
prometaphase. Opposing forces applied on either
sidecausethesisterchromatidpairstoalignatthe
equator of the spindle, defining metaphase. At
anaphase, sister chromatids are disconnected as
the enzyme separase digests the cohesin
complexes, and segregate towards opposite ends
of the cell as a result of (i) the shortening of the
kinetochoremicrotubules(anaphaseA)and(ii)the
separation of the spindle poles as a result of the
extensionandslidingoftheinterpolarmicrotubules
and the shortening of the astral microtubules
(anaphase B). Telophase is characterized by the
progressive decondensation of the two sets of
segregated chromatids and their sequestration
within two newly assembled nuclear envelopes.
The M-phase is concluded by the actual formation
of two daughter cells by the binary fission of the
cytoplasmbyacontractileringofactinandmyosin:
cytokinesis.
Progression through M-phase is governed by the
activationofM-Cdk,acyclindependentkinasethat
4
drives the process by phosphorylating specifc
target protein including condensin subunits,
nuclear laminins, the anaphase-promoting
complex,etc.
The outcome of the M-phase are two diploid
daughter cells having exactly the same genetic
material as their progenitor cell, i.e. two copies of
each of the chromosomes characteristic of the
speciesofinterest.
Thecentraldogma
All the instructions needed for proper functioning
of the cell are encrypted in its chromosomes. But
what is the nature of the message and how is it
read? Our present view is summarized by the
central dogma: RNA copies of the genes are
generated by transcription. These messenger
moleculesmigratetothecytoplasmwhere,bythe
processoftranslation,theyguidetheformationof
proteins
whose
tridimensional
structure
determinestheirprimarilycatalyticalfunctions.By
accelerating specific chemical reactions in the cell
these enzymes act as turnouts guiding cellular
metabolism.TheinformationthusflowsfromDNA
toRNAtoprotein,theactivityofthelatterenabling
manifestation of the phenotype. This basic
underlying principle is thought to be shared by all
living organisms on earth and was therefore
dubbed“thecentraldogma”.
Note that erecting dogmas is not a welcomed
practiceinscience.Theoriesandmodelsaremere
ways to summarize the present knowledge and to
guide the design of novel experiments that are
aimed at revealing the shortcomings of present
knowledge. True scientists strive towards
changing, improving the present models. As a
matter of fact this central dogma has had to be
amendedalready.Reversetranscriptasesallowfor
aninformationflowfromRNAtoDNA,whilesome
RNA molecules (ribozymes) are endowed with
catalytical activity on their own, without being
translatedinprotein.
Themammaliangenomecontains ∼ 20,000mostly
splitproteincodinggenes
Thecompletesequenceofthehumangenomewas
obtained in 2001. The next few years saw the
completion of the genome of mice, chicken, dog,
cow,horseandpig,complementedwithshallower
sequencingofthegenomeofagrowinglistofother
mammals.Oneofthemajorsurpriseswasthatthe
mammaliangenomeonlycontains∼20,000protein
encoding genes. These numbers have to be
compared with the ∼14,000 genes found in the
genomeofthefruitflyD.melanogaster,∼19,000in
the genome of the nematode C. elegans, and
∼25,000 in the genome of the little plant A.
thaliana. Most scientists were predicting that the
organismal complexity of mammals would require
Chapter1:Genesincells
more genes than any other living organism and
textbooks from the late 1990-ies would typically
cite 100,000 genes or more. If indeed more
complex, mammals don’t derive this complexity
from a higher number of genes. Along similar
humbling lines, the human genome does not
containmoregenesthanthatofothermammals.
One of the unexpected features of the majority of
eucaryoticgenes,discoveredin1977,istheir“split”
nature: they are subdivided in exons that are
separated by non-coding intervening introns.
Intronsareremovedbythesplicingprocess,which
occursaftertranscription(videinfra).Theresulting
messenger RNA (mRNA) comprises a 5’
untranslated region (5’UTR), the protein encoding
openreadingframe(ORF),anda3’UTR.
When examining the genomes of procaryotes,
unicellular eucaryotes or even that of D.
melanogaster or C. elegans, it appears that the
majority of the sequence space is devoted to the
protein-encoding genes, hence emphasizing their
fundamental contribution in guiding development
of the body. When examining the genome of
mammals,onthecontrary,whatstrikesishowthe
sequencespacedevotedtoprotein-codingcapacity
per se is diluted: it only represents 1.5% of the
genome. This reflects the fact that intergenic
sequences have become much larger on average,
combined with the fact that intron size has
considerably increased. The average human gene
spans 27 Kb, encompasses 10.4 exons measuring
143 base pairs on average, and separated from
eachotherbyintronsof∼3,5Kbonaverage.
The meaning of this inflation of mammalian
genome size (and hence the dilution of the
sequence space devoted to protein-encoding
capacity) remains largely unknown. For some, it
indicates that multicellular organisms, which have
less pressure to divide rapidly have allowed their
genometobeinvadedbyuseless,“selfish”parasitic
DNA (vide infra). For others, the majority of noncoding DNA holds the regulatory secret to the
organismal complexity characterizing higher
vertebrates.
Why the majority of genes in metazoans are split
remains a mystery. One hypothesis is that it
facilitates the creation of novel genes by “exonshuffling”.Indeed,distantproteinssharecommon
protein domains and, at least in some instances,
exon boundaries coincide with protein domain
boundaries.
Genesmayshowsimilaritiesextendingbeyondthe
sharing of sub-domains. Hence, some genes are
encountered in the genome in multiple, virtually
identicalredundantcopies.Thisistypicalforgenes
that are coding for proteins that are needed in
large quantities that cannot be obtained from just
one gene, such as the histone proteins. Some
5
genes are not identical but nevertheless clearly
similar over their entire length. Such genes are
thought to derive by duplication from a common
ancestor gene and constitute families of
“paralogous” genes. Amongst the best-known
genefamiliesaretheglobins:ourgenomecontains
two clusters of respectively α- and β-globin genes
on chromosomes 16 and 11, respectively, which
result from the duplication of a common ancestor
gene some ∼500 millions years ago. Subsequent
duplicationsoftheseα-andβ-globinfoundergenes
have generated two clusters each comprising
multiple genes. Within each cluster, genes with
slightly distinct protein sequence are activated at
different stages of development, providing the
organism with heamoglobins that are optimally
adaptedforeachdevelopmentalstage.
Gene expression is primarily regulated at the
transcriptionallevel
As previously mentioned, all somatic cells contain
two full copies of the genetic encyclopedia, yet
theywillonlytranscribethe30-60%ofgeneswhich
theyneedtofunctionproperly.
Gene transcription is performed by RNA
polymerases.Assistedbyauxiliaryproteins(σinE.
Coli and a number of “general transcription
factors” in eucaryotes), the RNA polymerase will
recognizethepromotersequence,whichgenerally
lies just upstream of the transcription initiation
site. Promoter sequences are often defined by
consensus sequences of which some include a
TATA-box.Recognitionofthepromotersequences
resultsinlocaldenaturationofthedouble-helixand
initiation of transcription. The polymerase
synthesizes an RNA molecule in the 5’ to 3’
direction complementary to the template strand.
ComplementarityrulesareasforDNA,exceptthat
uracil replaces thymine. The auxiliary initiation
factorsarereleased,allowingtheRNApolymerase
toproceedwithelongation.NewlysynthesizedRNA
andtemplateDNAformadouble-helixoveronlya
short stretch, the original DNA double-helix being
rapidlyreformedastranscriptionproceedsthrough
thegene.Inprocaryotestranscriptiontermination
occurs at specific termination signals, while in
eucaryotes the site of termination is somewhat
stochastic.
Which strand of the DNA double-helix is used as
the template strands depends on the gene:
transcription will proceed from left to right for
somegenes,andintheoppositedirectionforother
genes. While in E.coli one RNA polymerase is
responsibleforthetranscriptionofallgenes,three
distinctRNApolymerasesexistineucaryotes:RNA
pol I (ribosomal RNAs), pol II (mRNAs and other
RNAs)andpolIII(smallRNAspecies).
In eucaryotes, transcription proceeds at ∼20
nucleotides per second. The typical ∼27Kb gene
Chapter1:Genesincells
thus requires approximately 20 minutes to be
transcribed. For some of the larger genes,
completingtranscriptionmaytaketensofhours.
A gene may be simultaneously transcribed by
several RNA polymerases, if large amount of gene
productareneeded.
The decision on whether to transcribe a gene or
not in a given cell is not made by the RNA
polymerase and their auxiliary initiation factors.
Transcriptionisregulatedbygeneswitches.
Componentsthatmakeupgeneswitcheswerefirst
identified in bacteria and phages. Gene switches
comprise cis- and trans-acting elements. The cisactingelementsareshortsegmentsofthedoublehelixinthevicinityofthegenetheyregulate.They
are called “operators” in procaryotes. The
corresponding nucleotide sequence defines a
unique surface in the major grove that can be
specifically recognized by the trans-acting
componentoftheswitch:generegulatoryproteins
with matching DNA reading domains. Regulatory
proteins are classified according to type of DNA
readingdomain:helix-turn-helix,zincfingermotifs,
leucine zippers, helix-loop-helix, etc. Bound to the
operator, some regulatory proteins will act as
repressors (precluding access of the basal
transcriptional machinery to the promoter by, for
instance,sterichindrance),andothersasactivators
oftranscription(facilitatingtheaccessofthebasal
transcriptional machinery to the promoter).
Throughregulatoryproteinscellshavetheabilityto
adapt to changing environmental conditions: on
bindingofaligand,regulatoryproteinsundergoan
allosterictransitionwhichwilleitherallowthemor
preclude them from binding to the operator. The
combination of target gene, operator, regulatory
proteinsandligandjointlycomposeanoperon.
Oneofthebestunderstoodoperonsisthelactose
(Lac) operon. The Lac operon encodes proteins
that are required to transport lactose in the
bacterial cell and then catabolize it (i.e. βgalactosidase and permease). The Lac operon is
both under negative and positive transcriptional
control. In the absence of lactose in the medium
the Lac repressor binds to the operator thereby
preventing transcription. However, when lactose
is present in the medium it acts as an inducer by
bindingtotheLacrepressorandtherebyprecluding
it from binding to the operator sequence.
However,ifglucose–thepreferredcarbonsource-
isalsopresentinthemedium,transcriptionwillnot
proceed. Indeed, productive transcription not only
requires release of the Lac repressor from the
operator, but also binding of the catabolite
activator protein (CAP) to a distinct cis-element
lyingjustupstreamofthetranscriptionstartsite.If
glucose is abundant in the medium, intracellular
concentrationsofcyclicAMP(cAMP)willbelow.In
theabsenceofcAMP,CAPcan’tbindtoitscognate
6
cis-element, hence preventing transcription of the
Lacoperon.Ifglucoselevelsdrop,however,cAMP
levelsincrease.BindingofcAMPtoCAPallowsthe
latter to bind to its cis-element and to fulfill its
activator activity, hence, promoting transcription.
For E. Coli to transcribe its Lac operon, thus
requires lactose without glucose in the medium.
[BOX 1: Genetic definition of the Lac operon
components]
Transcriptional regulation in eucaryotes shares
several of the basic features uncovered in
procaryotes. There are, however, several
specificities:
(i) Cis-acting elements (called enhancers when
activating transcription and silencers when
repressingtranscription)maybequitedistantly
removed from the promoter sequences which
they control, requiring looping of the
intervening DNA to allow interaction. Cisacting elements may be located upstream as
well as downstream from the transcription
start site. Insulator sequences restrict the
effectofgeneswitchestospecificdomains.
(ii) Interaction between regulatory proteins and
basaltranscriptionalmachineryisoftenindirect
requiringanintervening“Mediator”protein.
(iii) Transcription of eucaryotic genes is usually
controlled by multiple regulatory proteins,
allowing customized expression levels in
different cell types according to needs, as well
as combinatorial gene control. Regulatory
proteinstypicallycontrolmanygenes.
(iv) The activity of regulatory proteins can be
modulated in many ways including protein
synthesis,
ligand
binding,
covalent
modifications,additionofsubunits,unmasking,
stimulationofnuclearentryorreleasefromthe
membrane.
(v) Local chromatin state is an essential
component of eucaryotic transcriptional gene
regulation. By recruiting specialized proteins,
gene regulatory proteins influence chromatin
structure.
Alternativesplicingincreasescodingcapacity
Translation of procaryotic mRNAs begins while
transcription is progressing. In eucaryotes, on the
contrary, the nascent pre-mRNAs undergoes a
series of processing steps, prior to exportation of
thematuremRNAinthecytoplasmwhereitwillbe
translated.
Maturation of eucaryotic mRNAs involves the
additionofa“cap”structureatthe5’end,excision
of the intronic sequences and joining of the exons
(splicing), and endonucleolytic trimming of the
3’end and addition of a poly-A tail
(polyadenylation).
Chapter1:Genesincells
The Cap structure corresponds to a GMP that is
added in reversed orientation to the 5’ endof the
mRNA by an unusual 5’-to-5’ triphosphate bridge.
In addition the guanine is being methylated in
position7.
Splicing of the introns is initiated by the attack of
the5’“donor”splicesitebyanadeninelocated∼35
bases upstream of the 3’ “acceptor” splice site.
Thisattackseversthephopshodiesterlinkbetween
theupstreamexonandintronandbindsthe5’end
of the intron to the 2’ position of the attacking
adeninetherebycreatingaloopintheintron.The
unmasked 3’-OH group of the upstream exon
subsequently attacks the acceptor splice junction,
therebyreleasingtheintronasalariatwhilejoining
the exons. Splicing is performed by a complex
machinery, known as the spliceosome. Small
nuclear ribonucleoproteins (snRNP) form the core
ofthespliceosome.AstheirnameimpliessnRNPs
comprise small nuclear RNA (snRNA) components
which – by virtue of base pair complementarity –
recognize the splice donor, branching and splice
acceptor sites, each characterized by short
consensussequences.
Contrary to procaryotes, transcription termination
doesn’t occur at very specific sites. Nevertheless,
the3’endofmRNAsareneatlyspecified.Indeed,
consensus nucleotide sequences (including a
canonical AAUAAA 10-30 nucleotides upstream of
the polyadenylation site) are recognized by two
multisubunit proteins called cleavage stimulation
factor (CstF) and cleavage and polyadenylation
specificity factor (CPSF). These bind to the
transcribed RNA, assisting other proteins in (i)
cleavingthenascenttranscript,and(ii)adding∼200
Aresidues(=poly-Atail)atthe3’endproducedby
cleavage.
These three processing reactions occurs
cotranscriptionally: the enzymes performing these
reactions are tethered to the growing RNA chain
by binding to the trailing carboxyterminal domain
ofRNApolII.
Wehaveseenthattheevolutionarysignificanceof
thesplitgenestructuremightbetopromoteexon
shuffling. There is also an immediate benefit of
splicingtotheorganism.Indeed,agivengenemay
undergo more than one type of splicing reaction,
resulting in mature mRNA with distinct exon
selections thereby encoding proteins differing in
theiramino-acidsequenceandhencefunctionality.
This“alternative”splicingisparticularlycommonin
mammals, where an estimated 75% of the genes
are subject to alternative splicing generating ∼3
distinct mRNA on average. This phenomenon
considerably increases the coding capacity of the
∼20,000 mammalian genomes, hence maybe
compensating for their lower than expected
numbers.
7
Translation exploits a written language, the
geneticcode.
The nucleotide sequence of a mature mRNA
dictates the amino-acid sequence of the encoded
protein, in a process referred to as translation.
Because there are only four distinct nucleotides,
whilethereare20differentamino-acids,themRNA
needs to be read in words of three successive
nucleotides. Indeed, words of two nucleotides
(doublets) could only code for 4x4=16 different
amino-acids (i.e. less than one is needed), while
words of three nucleotides (triplets) would have
the capacity to code for 4x4x4=64 distinct aminoacids (i.e. more than actually needed). The
correspondence between specific triplets (or
codons)andtheamino-acidtheyencodeisknown
as the genetic code. One of the remarkable
featuresofthegeneticcodeisitsuniversality:itis
virtually identical for all living organisms on earth,
hence an additional testimony of our shared
ancestry.All64codonsareutilizedinthegenetic
code: 61 are used to code for amino-acids, while
three correspond to stop codons which cause
termination of translation. As there are more
amino-acidencodingcodonsthanamino-acids,the
geneticcodeissaidtoberedundant:thenumber
of codons per amino-acid ranges from one to six.
Synonymous codons typically differ at the 3’ end.
[Box2:…]
How is the sequence of codons translated in a
string of the corresponding amino-acids? The
amino-acids themselves can’t “read” their cognate
triplet. Translation requires the successive action
of two adaptor systems. The first adaptors are
small RNA molecules called tRNAs for transfer
RNAs. In eucaryotes, tRNA genes are transcribed
byRNApolymeraseI.Theensuingtranscriptsfold
locally into L-shaped structures stabilized by intramolecularbasepairing.These∼80ntstructuresare
trimmedfromtheirlargerprecursormoleculesand
undergo a series of modifications, including intron
splicing and the chemical modification of ∼10% of
theirresidues.Asaresult,tRNAsarecharacterized
by nucleotides such as inosine (I), pseudouridine
(ψ), dihydrouridine (D), 4-thiouridine (T), etc. in
additiontotheusualA,C,GandUresidues.TheD
andTresiduesgivetheirnametotheDandTloop
characterizing the so-called clover leaf
representation of tRNA molecules. The remaining
central leaf corresponds to the anticodon loop
characterized by three nucleotides which are
complementary to the codons that the
correspondingtRNAwillrecognizeinthemRNAto
be translated. Anticodons may sometimes
recognize more than one codon differing at the
3’endorWobbleposition.Hence,aG(respectively
A)atthe5’endoftheanticodonwillbase-pairwith
aC(respectivelyU)atthe3’endofthecodonbut
may also engage in Wobble base-pairing with a U
(respectivelyC)atthatposition.AnIatthe5’end
oftheanticodonwillbase-pairwitheitheraC,Uor
Chapter1:Genesincells
even A at the 3’ end of the codon. Wobble-base
pairingthusaccountsforpartoftheredundancyof
the genetic code, but not all of it. Several tRNAs
are often needed to recognize the different
synonymouscodons.Thestemoftheclover
leaf is characterized by a 3’ overhang of three
nucleotides.
The second set of adaptors, the amino-acyl tRNA
synthetases, attach specific amino-acids to the 3’OHmoietyofthelastnucleotide.Ingeneral,there
isoneamino-acyltRNAsynthetaseperamino-acid.
Ithastheabilitytospecificallyrecognizeitscognate
amino-acidaswellasthedifferentmatchingtRNAs.
These are recognized by their specific three
dimensional structure defined in part by the
specificityoftheresiduemodifications.
The loaded tRNA allows matching of a specific
codon with the cognate amino-acid. However,
linking the successive amino-acids in a protein
string requires the additional action of the
ribosomes. Ribosomes are composed of a small
and a large subunit. In eucaryotes, the small 40S
subunitismadeupofonerRNAmolecule(18S)and
∼33 proteins, while the large 60S subunit is made
upofthreerRNAmolecules(28S,5.8Sand5S)and
∼49proteins.Ribosomesubunitsareassembledin
the nucleolus. During translation, small and large
subunitclampthemRNA,exhibitingthesuccessive
codonsatthebottomofthreepocketsreferredto
as A(amino-acid), P(eptide) and E(xit) sites.
Translationproceedsinthreestepcycles.Inafirst
step, tRNAs matching the corresponding codons
occupythePandAsites.TherRNAmoleculesfrom
the large subunit, acting as ribozymes, catalyze a
peptidyl transferase reaction that transfers the
amino-acid (or peptide) attached to the 3’OH end
ofthetRNAinthePsitetotheaminogroupofthe
amino-acidonthetRNAintheA-siteinaso-called
headgrowthtypeofpolymerization.Peptidesthus
grow at their carboxyterminal end. In a second
step, the ribosome moves one codon down the
mRNA. This movement is enabled by the
mechanicalworkaccomplishedbytwoGTPase-type
molecular machines: elongation factors EF1 and
EF2ineucaryotes.Thismovesthenowunloaded
tRNAfromthePtotheEsitesforcingitsexit,while
exposing an empty A site allowing entry of a new
loadedtRNA:step3.
Each mRNA can be translated in three possible
reading frames of which only one is (usually)
correct. Selection of the correct reading frame
occurs during translational initiation, which is
accomplished by a small RNA subunit that is
preloaded with a specific initiator tRNA always
carryingamethionine.Ineucaryotes,thiscomplex
recognizes the capped 5’ end of the mRNA and
scans the 5’UTR until it finds the first AUG codon.
Dissociation of initiator factors then allows
assemblywithalargesubunitandinitiationofthe
translation steps. In procaryotes, recognition of
8
the AUG start codon does not require 5’ to 3’
scanning of the 5’UTR, but is mediated by direct
base-pairing of 16S rRNA to a complementary
sequence(Shine-Dalgarnosequence)justupstream
of the start codon. This differences accounts for
themonocistronicnatureofmosteucaryoticmRNA
versus the polycistronic character of many
eucaryoticmRNAs.
When one of the three stop codons is exposed at
the bottom of the A site, a protein known as
translation release factor (eRF1), whose three
dimensional structure mimics that of a tRNA,
enters the ribosome and causes the peptidyl
transferase activity to add a water molecule
thereby releasing the synthesized protein, and
causingdissociationandreleaseoftheribozome.
mRNAcanbetranslatedsimultaneouslybymultiple
ribosomesinbothpro-andeucaryotes,generating
so-calledpolysomes(orpolyribosomes).
The nascent proteins are progressively pushed
through a narrow channel. On their emergence,
protein domains progressively fold cotranslationally, assisted if necessary by different
typesofchaperoneproteins.
Theexpandingworldofnon-codingRNAs
The central dogma focuses on the role of mRNA
which will be translated into proteins, assisted by
the rRNA and tRNA non-coding RNA genes. In
addition,tothesethreemajorclassesofRNAs,the
cell transcribes a multitude of other types of noncoding RNA molecules. These include the small
nuclear RNAs (snRNAs) that are part of the small
nuclear ribonucleoproteins (snRNPs) which make
up the spliceosome, or the small nucleolar RNAs
(snoRNAs) that participate in posttranscriptional
modificationofrRNAs.
In recent years, a multitude of novel non-coding
RNA genes have been discovered. Some of these
are referred to as long non-coding RNA genes.
They are typically transcribed by RNA pol II, are
usually spliced, often evolutionary highly
conserved, yet not characterized by an obvious
open reading frame. Long non-coding RNA genes
include the telomerase RNA gene coding for the
RNA component of telomerase, the XIST gene
playing a central role in the inactivation of one of
the X chromosomes in females, and the H19 and
AIR imprinted genes. However, the function of
most of the long non-coding RNA genes remains
poorlyunderstood.
The cell also contains a multitude of small noncoding RNA genes. Micro RNAs (miRNAs) are
amongst the best understood small non-coding
RNA genes. Most miRNA are processed in the
nucleus as small hairpin loops (called pre-miRNAs)
Chapter1:Genesincells
cropped from longer RNA pol II transcripts
(includingpre-mRNAs)byanenzymecalledDrosha.
The ensuing pre-miRNAs are actively exported in
thecytoplasmwheretheyarefurtherprocessedby
Dicer to generate a short (∼21 double-stranded
RNA characterized by two-residue 3’ overhangs.
Oneofthetwostrandsispreferentiallyintegrated
inaRNAinducedsilencingcomplex(RISC),whichit
guides by base-pair complementarity to the 3’UTR
of specific mRNA targets to be down regulated.
The down regulation of the targets involves both
targetdegradationandtranslationalinhibition.The
mammaliangenomecontainsanestimated∼1,000
miRNAs each regulating an estimated ∼200-300
target genes. The majority of mammalian genes
(excepthousekeepinggenes)aresubjecttomiRNAmediatedregulation.
Novelgenomicmethods(Chapter4)nowallowfor
a more detailed survey of transcriptional activity
across the genome. The pictures that emerges is
one of “pervasive” trancription that is much more
elaborate than expected. Rather than being
confined to genes (excluding intergenic regions),
with clear transcription initiation and termination
sites,transcriptionalactivityisnowdetectedpretty
much throughout the entire genome, including
intergenicregions.Withingenicregions,apanoply
oftranscriptsisdetected,usuallyfrombothstrands
and often spanning multiple adjacent genes. The
function if any of these multiple and diverse
transcriptsisasubjectofintenseresearch.
Halve the mammalian genome is composed of
interspersedtransposons&pseudogenes
One of the most striking findings emerging from
sequencing mammalian genomes is that close to
50% of the sequence space is occupied by
interspersedrepetitivesequences.
Approximately halve of these are “autonomous”
transposable elements falling in three major
categories: DNA transposons, retrovirus-like LTR
retrotransposons,andnonLTRretrotransposonsor
LINES (Long INterspersed Elements). They are
referred to as autonomous because they typically
contain de genes coding for the enzymes needed
for transposition. DNA transposons move via a
non-replicativecutandpastemechanisminvolving
only DNA. Retrotransposons move via an RNA
intermediate that is reverse transcribed in DNA
prior to reintegration in the genome. The
retrotransposition mechanism of LTR-elements is
very similar to that of their close cousins the
retroviruses,anddistinctfromthatofLINES.Table
x summarizes the number of genomic copies as
well as genomic space occupied by each one of
theseelements.
9
In addition to this “autonomous” interspersed
repeats, the genome contains supposedly nonfunctional copies of many genes, referred to as
pseudogenes. Pseudogenes can be (i)
unprocessed,includingpromotersequences,exons
and introns, and resulting from intragenic
duplication events, or (ii) processed pseudogenes
resulting from reverse transcription of a “primary”
transcript (by reverse transcriptases encoded by
autonomous
retroposons)
and
ectopic
reintegration in the genome. Given their origin,
processed pseudogenes are typically devoid of
promoter sequences and introns and may be
characterized by a poly-A tail. Most genes are
characterized by none or a very small number of
pseudogenes in the genome. Some genes
however, have spawned very large numbers of
processed pseudogenes. This is particularly the
case for small genes such as 7SL-RNA or tRNA
genes.OneofthefeaturesofRNApolIIIgenesis
that the promoter sequences are intragenic and
hence carried along during the retrotransposition
process. As a consequence, the processed
pseudogenes can be transcriptionally competent
hence capable of generating pseudogenes
themselves. Moreover, some of these
pseudogeneshaveacquiredthe3’endofLINES(by
integrating into a LINE), thereby becoming
preferredsubstratesforthereversetranscriptases.
These very abundant pseudogene families are
referred to as SINES (Short Interspersed Element),
and may represent as much as x% of the genome
(Tablex).
The evolutionary significance of the large
proportion of the vertebrate genome occupied by
interspersed repeats remains somewhat of a
mystery.Ontheonehandinterspersedrepeatsare
consideredasparasiticselfishDNA.Supportingthis
view is the fact that a few species (f.i. the
pufferfish) seem to have largely escaped
transposon invasion while seeming perfectly
adapted to their environment. It is also
increasingly apparent that eucaryotes have
deployedarangeof“immune”strategiestocontrol
the proliferation of transposons. There are
footprintsofpastperiodsoftransposonepidemics
during which specific families of transposable
elements transiently escaped non adaptive
surveillance mechanisms and spread across the
genome. Some extant species exhibit signs of
considerable transposon activity, and – in these –
insertional inactivation of genes by transposon
integration contributes significantly to the
mutationalload.Ontheotherhand,thereareclear
examples where transposon sequences have been
coopted in what are now important cellular
functions,pointingtowardsatleastsomedegreeof
symbiosisbetweentransposonsandtheirhosts.
DNArepair(?)
MitochondrialDNA
Adescriptionofourgenomewouldbeincomplete
without mentioning the mitochondrial genome.
Indeed mitochondria have their own genome. In
mammals, the mitochondrial genome is a small
circular molecule of ∼16.5 Kb. However, as every
mitochondria may contain 5 to 10 such molecules
and as a single cell may contain hundreds of
mitochondria,themitochondrialDNAmayaccount
for as much as 1% of the cellular DNA. The
mitochondrial genome comprises 13 proteinencoding genes (coding for components of the
electron transport chain and ATP synthase), 22
tRNA genes, 2 rRNA genes and the D-loop
encompassingthebidirectionaloriginofreplication
and transcription. Mitochondria have their own
translational machinery (including ribosomes)
operatinginthemitochondrialmatrix.
The organization and operation of the
mitochondrial genome shares many features in
common with procaryotes. Indeed, mitochondria
are thought to be the remains of aerobic bacteria
that were engulfed by an ancestral anaerobic
eukaryotic cell initiating a symbiotic relationship.
Thisisthoughttohavehappened1.5billionsyears
ago, when as a result of aerobic bacterial activity
oxygen started to accumulate in the atmosphere.
Over time most of the mitochondrial genes were
translocated to the nucleus, requiring the
concommitant development of mechanisms to
transport the corresponding proteins from the
cytoplasm (where they are synthesized) into the
mitochondria.
The small numbers of genes remaining in the
mitochondrial genome may be trapped there as
mitochondrial genetic codes evolved in a series of
slightlydifferentdialects.Inmammals,fivecodons
have seen their meaning drift away from the
universal, while the ORF of the mitochondrial
proteincodinggenescoevolvedtherebyhampering
theirtranslocationtothenucleus.
Chapter1:Genesincells
10