* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Chapter 1
Molecular cloning wikipedia , lookup
Gene expression profiling wikipedia , lookup
Community fingerprinting wikipedia , lookup
RNA polymerase II holoenzyme wikipedia , lookup
Gene regulatory network wikipedia , lookup
List of types of proteins wikipedia , lookup
Epitranscriptome wikipedia , lookup
Eukaryotic transcription wikipedia , lookup
Cre-Lox recombination wikipedia , lookup
Non-coding RNA wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Promoter (genetics) wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Gene expression wikipedia , lookup
Genome evolution wikipedia , lookup
Deoxyribozyme wikipedia , lookup
Non-coding DNA wikipedia , lookup
Molecular evolution wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Introductorygeneticsforveterinarystudents andtheanimalsciences. MichelGeorges ChapterI:Genesincells “Pose ton esprit, Michel!” Catherine GeorgesPirenne,magrand-mère. The haploid mammalian genome consists of a 9 double-helixof3x10 basepairsofDNA The multivolume encyclopedia which we inherit fromourparentsviathegametesconsistofaseries 9 ofDNAdouble-helicestotaling∼3x10 basepairs. Thebuildingblocksofdeoxyribonucleicacid(DNA) are the four nucleotides: deoxyadenosine monophosphate (dAMP), deoxyguanosine monophosphate (dGMP), deoxythymidine monophosphate (dTMP) and deoxycytidine monophosphate (dCMP). Nucleotides comprise threeelements:apurine(adenineandguanine)or pyrimidine(cytosineandthymine)base,covalently linked to the 1’ carbon of the pentose β-D-2 deoxyribose, carrying one (dNMP), two (dNDP) or three(dNTP)phosphategroupsonits5’carbon. Cellsalsocontainnucleotidesinwhichthepentose is ribose rather than deoxyribose. These are ATP/ADP/AMP,GTP/GDP/GMP,CTP/CDP/CMPand UTP/UDP/UMP. The latter is characterized by the pyrimidinebaseuracil.Thesefourribonucleotides are the building blocks of ribonucleic acid (RNA) molecules. Deoxynucleotides assemble into polynucleotide chains. In these, a phosphodiester bond connects the3’carbonofthepreviouswiththe5’carbonof the following nucleotide. Thus, polynucleotide chains are “polar” with a distinct 5’ and 3’ extremity. A DNA double-helix is formed by the anti-parallel juxtaposition (or “hybridization”) of two complementary polynucleotide chains. In these, the sugar-phosphate backbones are like the stiles of a ladder while the bases face each other and form the rungs. Polynucleotide chains are complementaryifA‘sononechainpairwithT’son theother,whileG’spairwithC’s.Thisconstraint explainstheobservationthatinDNAG’sandC’sas wellasA’sandT’sarealwayspresentinequimolar amounts (Chargaff’s rule), however that the ratio of [G+C]/[A+T] may vary depending on the organism. DNA double-helices are stabilized by hydrogen bonds that can form between the complementarybases:twoforA-Tpairsandthree for G-C pairs. Thus DNA double-helices rich in G’s and C’s are in general more stable than doublehelicesrichinA’sandT’s.Thebase-pairs(bp)are planarandorientedperpendicalularlytothestiles, likestairs. Chapter1:Genesincells Note that the chemical structure of the bases shownintextbookscorrespondstothemoststable isomer,inwhichtheythereforespendmostoftheir time. However, all four bases can also adopt distinct “tautomeric” conformations. At any time, a proportion of the molecules will explore these less stable conformations. The A-T and G-C base pairing rules do not apply to these tautomeric forms. Hence, a rare imino form of C will preferentiallybindwithA,whilearareenolformof thyminewillpreferbindingtoG.Suchtautomeric shiftsmaycauseerrorsduringDNAreplication. Rather than being flat, the DNA ladder will spontaneously adopt a helical structure: the “double-helix”, in which one polynucleotide chain (or strand) is sometimes referred to as Crick and the other as Watson in honor of the two famous scientists that discovered its structure. The DNA double-helix is a “right” helix, which means that if you were climbing up the DNA ladder your body wouldperformaclock-wise,right-handedrotation. Thehelixperformsonerotationevery10base-pairs 9 corresponding to 3.4nm. Thus, 3x10 base pairs amount to ∼1m of DNA packed in each haploid gamete,or∼2mineachdiploidcell.Viewedfrom the side the double-helix exhibits a major and a minorgroove,inwhichdistinctchemicalgroupsare exposed to form sequence-specific surfaces that canbindspecificproteins. While the DNA helix is in general right-handed, specific sequences may in particular conditions locally adopt a left handed helix configuration. Thiswillimpactthenumberofhelixturnsperunit length, which will itself impact the degree of DNA supercoiling. While the basic building blocks of DNA are AMP, GMP, CMP and TMP as described before, these may undergo chemical modifications after their incorporation in a double-helix. The most commonly encountered DNA modification encountered in the vertebrate genome is the methylation of cytosines at position 5 of the pyrimidine ring. Methylated cytosines are nearly exclusivelyobservedonbothstrandsofthe5’-CpG3’dinucleotidepalindrome(i.e.thecomplementary sequencereadsthesame). Crick and Watson immediately recognized (“... it didn’t escape our attention ...”) that the structure of the double-helix was compatible with its function as bearer of the genetic material. First, the two strands of the double-helix carry redundant information. Indeed, if you know the nucleotide sequence of the first you can unambiguously determine the sequence of the 1 second based on the complementarity rules. This feature, they recognized, might be intimately connectedwiththemechanismsofDNAreplication needed to double the amount of DNA that could then be partitioned equitably between daughter cell. The two DNA strands might each independently serve as a template for the formation of two novel DNA double-helixes identical to the parental one. Moreover, they observedthat,whileG’sononestrandalwaysface C’s on the other, and likewise with A’s T’s, the structure of the double-helix doesn’t impose any constrains on which type of base-pairs succeed to each other on the ladder: the structure of the double-helix is essentially the same whichever the base-pair sequence. Thus a message can be encrypted in the sequence in exact the same way as successions of characters form words and sentencesinMorseorourwrittenlanguage. The mammalian genome is distributed over a species-specificnumberofchromosomes Withtheexceptionofsomeviruses,whosegenome is composed of RNA, all living organisms have a genome composed of DNA. The length of the genome, however, varies widely. It measures of 6 7 theorderof10 bpinbacteria(e.g.E.Coli),10 bp in unicellular eucaryotes (e.g. the yeast S. 8 cerevisiae),10 innon-vertebrateanimals(e.g.the fruitfly D. melanogaster or nematode C. elegans), 9 and10 inmammals(Table1).Ingeneral,genomesize thus increases with organismal complexity (although this notion is likely to be very anthropocentric). Nevertheless, some closely related organisms (notably amongst plants or amphibians) may differ considerably in the size of theirgenome(C-valueparadox). 9 The 3x10 base pair long double-helix in our gametes doesn’t come as a single molecule. Our DNA is subdivided over several molecules each corresponding to one chromosome, just as encyclopediae come in different volumes. Hence, the human genome is subdivided over 23 chromosomes: 22 autosomes numbered by decreasing size, plus one of the sex chromosomes (X or Y). The zygote thus contains two sets of 23 chromosomes,oneoriginatingfromthefatherand theotherfromthemother,andsowilleachoneof thedescendantdiploidsomaticcells. The number of chromosomes is constant within a given species, but may extensively vary between species. Table 2 shows the number of chromosomes characterizing the genome of the most important domestic species. There is no correlation between the number of chromosomes andthelengthoftheentiregenome.Allmammals, including human, have genomes of approximately 9 the same size (∼3x10 bp), but chromosome numbers may vary widely, even between closely relatedspecies. Chapter1:Genesincells Chromosome means “colored bodies” in Greek. The name refers to the rods that cytogeneticists have been able to visualize under the microscope since the beginning of the XX-th century when examiningcellpreparations.Afractionofthecells in the preparation (corresponding to the “mitotic index”) will be in mitosis when sampled for analysis. In these, the DNA is found in a highly condensed state, forming the rods seen under the microscope. By applying specific staining procedures, cytogeneticist were able to elicit banding patterns that allowed them, in combination with size and position of a constricture called the centromere, to distinguish the different chromosomes. For some chromosomes, referred to as metacentric, the centromere has an approximate central position, and hence the short (p) and long (q) arm have comparablesize.Onotherchromosomes,referred to as acrocentric, the centromere is close to one end of the chromosome and hence the p-arm is much smaller than the q arm. By doing so, cytogeneticists were able to define the “caryotypes” of species or individuals within species.FigureXshowstheorderedcaryotypeofa man with Down syndrome, characterized by the trisomy (three copies instead of two) of chromosome21underlyingthecondition. Chromosomal DNA is located in the nucleus complexedwithhistones The majority of the cells in a tissue are in interphase rather than mitosis. At this stage, the chromosomal DNA is less condensed than in mitosisandresideswithinthenucleus.Remember that every one of our cells contains ∼2m of DNA, cramped within a ∼20μm diameter nucleus. How cells address this extraordinary organizational challenge remains a mystery. What is known is thattheDNAinthenucleusisnotnakedbutrather complexed with proteins, forming so-called chromatin. The nuclear DNA is wrapped around beads formed by histones. Each histone bead comprises eight histone proteins (2xH2A, 2xH2B, 2xH3 and 2xH4). Histone proteins are relatively small proteins that show extreme evolutionary conservation: only two amino-acid out of 102 differbetweentheH4proteinofcowandpea.This indicatesthatnearlyeachoneoftheamino-acidsis needed for the histone molecule to properly function. Approximately one fifth of the aminoacids are positively charged lysines or arginines, facilitatinginteractionwiththenegativechargesof the phosphate groups in the stiles of the double helix. ∼150 base pairs of DNA make ∼two turns around the histone core, jointly forming a nucleosome.Adjacentnucleosomesareconnected by∼50bpoflinkerDNA. Thisstringofnucleosomesformsatighterstructure called the 30-nm fiber. A fifth class of histones, called H1, is essential for its formation. There 2 certainly are higher levels of organization of the nuclear chromatin, however, these remain poorly understood. The 30-nm fiber is thought to form loops of several hundreds of kilobases that are anchored onto the nuclear scaffold at their base. Each of these loops may correspond to an independentlyregulatedtranscriptionalunit. Nucleosomescanbeseenaslittlehairyballsasthe amino-terminal tails of the nucleosomal histones extrude from the surface. Specific enzymes catalyze a myriad of covalent modifications of these tails including phosphorylation of serines, methylation, acetylation and ubiquitylation of lysines, isomerisations of prolines, etc. By locally alteringtheconformationofthehistonetails,these modifications (sometimes referred to as the “histone code”) have a profound effect on chromatinstructureandfunctionality. In addition to the nucleolus, examination of the nucleus under the electron microscope typically reveals lighter material in the center refers to euchromatin, and darker material mostly in the periphery referred to as heterochromatin. Heterochromatinisthoughttocorrespondtohighly condensed, transcriptionnaly inactive parts of the genome, while the active genes are thought to be confinedthelesscondensedeuchromaticzones. Inpreparationforcelldivisioncellsduplicatetheir DNAduringtheSphaseofthecellcycle. As mentioned before, the diploid zygote inherits one paternal and one maternal genome. The entire body will then develop by sequential binary cell division and differentiation. Before dividing, the mother cell first duplicates its genome by a process of DNA replication which defines the S(ynthesis) phase of the cell cycle. After a G(ap)2 phase, the duplicated genome is then partitioned between the two daughter cells during the M(itosis) and cytokinesis phases. The daughter cellsthenresumethecellcyclewiththeG1phase. Cells may sometimes exit the cell cycle and enter quiescenceorG0. Progression through the cell cycle is a tightly controlled process that is orchestrated by sequentially activated cyclin-dependent kinases (Cdks). As their name implies, Cdks only become active in the presence of cell-cycle stage specific cyclins(e.g.G1/S-,S-andM-cyclins). The double-helical structure of DNA immediately suggested semi-conservative DNA replication to CrickandWatson:eacholdstrandmightserveasa templateforthesynthesisofanewstrandthereby generating two identical daughter helices each comprising one old and one new polynucleotide chain.Thishypothesiswasrapidlyprovencorrect. Synthesis of new DNA occurs at the replication fork, i.e. the site at which the two old strands Chapter1:Genesincells separatefromeachotherexposingsinglestranded chainsthatserveastemplatesforthesynthesisof the new strands. DNA dependent DNA polymerases catalyze the addition of complementary nucleotides at the end of a new growing polynucleotide chain. An apparently simple mechanism would be for one of the polynucleotide chains to grow in the 5’ to 3’ direction,whiletheotherwouldgrowinthe3’to5’ direction,bothcloselyfollowingtheadvanceofthe replication fork. However, all known DNA dependent DNA polymerases catalyze 5’ to 3’ growth only. Phosphodiester bonds are created betweenthe5’αphosphategroupoftheentering nucleotideandthe3’hydroxylgroupattheendof the growing chain, releasing the β and γ phosphates as a pyrophosphate moiety. As a consequence, DNA replication proceeds in the samedirectionasreplicationforkadvancementfor onestrand(theleadingstrand),butintheopposite direction for the other strand (the lagging strand). On the lagging strand, DNA replication has to be continuouslyreinitiatedtofollowtheadvancement of the replication fork. DNA dependent DNA polymerases can extend growing polynucleotide chains if provided with a template and nucleotide precursors but can not initiate DNA replication in the absence of a primer. Only RNA polymerases havethisability.ARNApolymerasecalledprimase is indeed fulfilling this role on the lagging strand. The primase generates a complementary primer composed of RNA rather than DNA. The DNA polymerase then takes over to complete DNA replication, filling the single stranded space between the new and the previous RNA primer. Thus DNA replication on the lagging strand generates short fragments comprising an RNA primer with a DNA extension, known as Okazaki fragments. The RNA primer will subsequently be removedthankstothe5’to3’exonucleaseactivity ofspecificDNApolymerases,andadjacentOkazaki fragmentsjoinedbyaDNAligase.DNAreplication requires many additional enzymatic activities including a helicase which separates the two strands of the double-helix at the replication fork, topoisomerases or gyrases which untangle the knots created upstream by the unwinding of the double helix, and single stranded DNA binding proteins which stabilize the single stranded templates prior to synthesis. The enzymes that coordinately fulfill all these tasks assemble in a largecomplexknownasthereplisome. Why would nature have selected such a complicated asymmetric replication process involving a leading and a lagging strand? The answer probably lies in the need for proofreading to control replication errors. Once in a while, the DNA polymerase will introduce a noncomplementary nucleotide. This will for instance happenifanucleotidewasselectedwhilebeingin an unusual tautomeric state. DNA polymerases participating in DNA replication are endowed with 3 a proofreading 3’ to 5’ exonuclease activity that allowsthemtoexcisesucherroneouslyintroduced nucleotides.Proofreadingisincompatiblewith3’to 5’ growth as excision of the last nucleotide would eliminatethe5’triphosphateextremityneededfor furthersynthesis. S-phasestartswiththeactivationoforiginsofDNA replication.Atthesepoorlydefinedsites,thetwo strandsareseparatedtoformareplicationbubble. Mammalian chromosomes encompass multiple origins of replication of which variable numbers may be activated depending on the need to complete the cell cycle more or less rapidly. Each replication bubble is characterized by two replication forks. Initiation of DNA replication requires intervention of an RNA polymerase on both leading and lagging strand. The cell has developed mechanisms to ensure that origins of replicationareonlyfiredoncepercellcycle. AfterDNAreplication,CpGdinucleotidesthatwere initiallymethylatedonbothstrandsbecomehemimethylatedasthenewlysynthesizedstrandisnot. However, maintenance DNA methyl transferases (DNMT) recognize these hemi-methylated CpG’s and restore full methylation by adding a methyl groupatposition5ofthecytosineringinthenewly synthesizedstrand.Thankstothismechanism,the methylationsstateofCpGdinucleotidesisfaithfully transmitted “epigenetically” from mother to daughtercells. Discontinuous replication on the lagging strand createsacompletionproblemattheextremitiesof the linear chromosomes. To avoid shortening of the chromosomes at each cell division, chromosomal extremities are endowed with specificstructurescalledtelomeres.Telomeresare composed of several hundred tandem repetitions of the GGGTTA sequence. This structure is recognized by an enzyme called telomerase which hastheabilitytoaddtelomericrepeatsattheend of the chromosomes thereby counteracting chromosomal shortening due to replication. The telomerase is a peculiar enzyme that is composed of both a protein and a RNA subunit. The RNA subunitiscomplementarytothetelomericrepeats and, by hybridizing with it, creates a primertemplate structure that allows extension of the telomeric leading strand by the reverse transcriptaseactivityofthetelomerase.Ithasbeen postulated that ageing might be related to ceased telomerase activity thereby leading to progressive chromosomalshorteninganddegradation.Onthe contrary, many cancers are characterized by telomerase induction contributing to the immortalityofcancercells. AttheendofSphase,eachchromosomehasbeen duplicated. However, the two identical doublehelices, referred to as “sister chromatids”, remain gluedtogetheralongtheirentirelength,embraced Chapter1:Genesincells by large ring-like protein complexes called cohesins. The S phase is also characterized by centrosome duplication.Centrosomescompriseacentriolepair embedded in pericentriolar matrix. The pair of centrioles separates, and daughter centrioles form atthebaseofeachmothercentriole. Mitosis equitably distributes the replicated DNA amongstdaughtercells After DNA replication during S phase and a short G2period,thecellengagesinafinelytunedballet aimed at distributing the duplicated genetic material equitably amongst the two daughter cell: theM-phase. The M-phase starts with mitosis typically subdivided in five phases: prophase, prometaphase, metaphase, anaphase and telophase. Prophase is characterized by the progressive condensation of the replicated chromosomes in the nucleus, reflecting the activation of condensin complexes resembling cohesins.Pairedsisterchromatidsbecomevisible. Meanwhiletheduplicatedcentrosomesmoveapart and initiate the formation of the mitotic spindle. Projected tubulin-based microtubules can stabilize in three ways: by interacting with the cell membrane (astral microtubules), by interacting with microtubules emanating from the opposite centrosomes (interpolar microtubules) or by interacting with the kinetochores, large protein complexeslocatedatthecentromersofeachsister chromatid(kinetochoremicrotubules).Centromers are characterized by long stretches of tandemly repeated sequences referred to as satellite DNA. The latter stabilization can only occur after the breakdown of the nuclear envelope, which liberates the chromosomes and marks prometaphase. Opposing forces applied on either sidecausethesisterchromatidpairstoalignatthe equator of the spindle, defining metaphase. At anaphase, sister chromatids are disconnected as the enzyme separase digests the cohesin complexes, and segregate towards opposite ends of the cell as a result of (i) the shortening of the kinetochoremicrotubules(anaphaseA)and(ii)the separation of the spindle poles as a result of the extensionandslidingoftheinterpolarmicrotubules and the shortening of the astral microtubules (anaphase B). Telophase is characterized by the progressive decondensation of the two sets of segregated chromatids and their sequestration within two newly assembled nuclear envelopes. The M-phase is concluded by the actual formation of two daughter cells by the binary fission of the cytoplasmbyacontractileringofactinandmyosin: cytokinesis. Progression through M-phase is governed by the activationofM-Cdk,acyclindependentkinasethat 4 drives the process by phosphorylating specifc target protein including condensin subunits, nuclear laminins, the anaphase-promoting complex,etc. The outcome of the M-phase are two diploid daughter cells having exactly the same genetic material as their progenitor cell, i.e. two copies of each of the chromosomes characteristic of the speciesofinterest. Thecentraldogma All the instructions needed for proper functioning of the cell are encrypted in its chromosomes. But what is the nature of the message and how is it read? Our present view is summarized by the central dogma: RNA copies of the genes are generated by transcription. These messenger moleculesmigratetothecytoplasmwhere,bythe processoftranslation,theyguidetheformationof proteins whose tridimensional structure determinestheirprimarilycatalyticalfunctions.By accelerating specific chemical reactions in the cell these enzymes act as turnouts guiding cellular metabolism.TheinformationthusflowsfromDNA toRNAtoprotein,theactivityofthelatterenabling manifestation of the phenotype. This basic underlying principle is thought to be shared by all living organisms on earth and was therefore dubbed“thecentraldogma”. Note that erecting dogmas is not a welcomed practiceinscience.Theoriesandmodelsaremere ways to summarize the present knowledge and to guide the design of novel experiments that are aimed at revealing the shortcomings of present knowledge. True scientists strive towards changing, improving the present models. As a matter of fact this central dogma has had to be amendedalready.Reversetranscriptasesallowfor aninformationflowfromRNAtoDNA,whilesome RNA molecules (ribozymes) are endowed with catalytical activity on their own, without being translatedinprotein. Themammaliangenomecontains ∼ 20,000mostly splitproteincodinggenes Thecompletesequenceofthehumangenomewas obtained in 2001. The next few years saw the completion of the genome of mice, chicken, dog, cow,horseandpig,complementedwithshallower sequencingofthegenomeofagrowinglistofother mammals.Oneofthemajorsurpriseswasthatthe mammaliangenomeonlycontains∼20,000protein encoding genes. These numbers have to be compared with the ∼14,000 genes found in the genomeofthefruitflyD.melanogaster,∼19,000in the genome of the nematode C. elegans, and ∼25,000 in the genome of the little plant A. thaliana. Most scientists were predicting that the organismal complexity of mammals would require Chapter1:Genesincells more genes than any other living organism and textbooks from the late 1990-ies would typically cite 100,000 genes or more. If indeed more complex, mammals don’t derive this complexity from a higher number of genes. Along similar humbling lines, the human genome does not containmoregenesthanthatofothermammals. One of the unexpected features of the majority of eucaryoticgenes,discoveredin1977,istheir“split” nature: they are subdivided in exons that are separated by non-coding intervening introns. Intronsareremovedbythesplicingprocess,which occursaftertranscription(videinfra).Theresulting messenger RNA (mRNA) comprises a 5’ untranslated region (5’UTR), the protein encoding openreadingframe(ORF),anda3’UTR. When examining the genomes of procaryotes, unicellular eucaryotes or even that of D. melanogaster or C. elegans, it appears that the majority of the sequence space is devoted to the protein-encoding genes, hence emphasizing their fundamental contribution in guiding development of the body. When examining the genome of mammals,onthecontrary,whatstrikesishowthe sequencespacedevotedtoprotein-codingcapacity per se is diluted: it only represents 1.5% of the genome. This reflects the fact that intergenic sequences have become much larger on average, combined with the fact that intron size has considerably increased. The average human gene spans 27 Kb, encompasses 10.4 exons measuring 143 base pairs on average, and separated from eachotherbyintronsof∼3,5Kbonaverage. The meaning of this inflation of mammalian genome size (and hence the dilution of the sequence space devoted to protein-encoding capacity) remains largely unknown. For some, it indicates that multicellular organisms, which have less pressure to divide rapidly have allowed their genometobeinvadedbyuseless,“selfish”parasitic DNA (vide infra). For others, the majority of noncoding DNA holds the regulatory secret to the organismal complexity characterizing higher vertebrates. Why the majority of genes in metazoans are split remains a mystery. One hypothesis is that it facilitates the creation of novel genes by “exonshuffling”.Indeed,distantproteinssharecommon protein domains and, at least in some instances, exon boundaries coincide with protein domain boundaries. Genesmayshowsimilaritiesextendingbeyondthe sharing of sub-domains. Hence, some genes are encountered in the genome in multiple, virtually identicalredundantcopies.Thisistypicalforgenes that are coding for proteins that are needed in large quantities that cannot be obtained from just one gene, such as the histone proteins. Some 5 genes are not identical but nevertheless clearly similar over their entire length. Such genes are thought to derive by duplication from a common ancestor gene and constitute families of “paralogous” genes. Amongst the best-known genefamiliesaretheglobins:ourgenomecontains two clusters of respectively α- and β-globin genes on chromosomes 16 and 11, respectively, which result from the duplication of a common ancestor gene some ∼500 millions years ago. Subsequent duplicationsoftheseα-andβ-globinfoundergenes have generated two clusters each comprising multiple genes. Within each cluster, genes with slightly distinct protein sequence are activated at different stages of development, providing the organism with heamoglobins that are optimally adaptedforeachdevelopmentalstage. Gene expression is primarily regulated at the transcriptionallevel As previously mentioned, all somatic cells contain two full copies of the genetic encyclopedia, yet theywillonlytranscribethe30-60%ofgeneswhich theyneedtofunctionproperly. Gene transcription is performed by RNA polymerases.Assistedbyauxiliaryproteins(σinE. Coli and a number of “general transcription factors” in eucaryotes), the RNA polymerase will recognizethepromotersequence,whichgenerally lies just upstream of the transcription initiation site. Promoter sequences are often defined by consensus sequences of which some include a TATA-box.Recognitionofthepromotersequences resultsinlocaldenaturationofthedouble-helixand initiation of transcription. The polymerase synthesizes an RNA molecule in the 5’ to 3’ direction complementary to the template strand. ComplementarityrulesareasforDNA,exceptthat uracil replaces thymine. The auxiliary initiation factorsarereleased,allowingtheRNApolymerase toproceedwithelongation.NewlysynthesizedRNA andtemplateDNAformadouble-helixoveronlya short stretch, the original DNA double-helix being rapidlyreformedastranscriptionproceedsthrough thegene.Inprocaryotestranscriptiontermination occurs at specific termination signals, while in eucaryotes the site of termination is somewhat stochastic. Which strand of the DNA double-helix is used as the template strands depends on the gene: transcription will proceed from left to right for somegenes,andintheoppositedirectionforother genes. While in E.coli one RNA polymerase is responsibleforthetranscriptionofallgenes,three distinctRNApolymerasesexistineucaryotes:RNA pol I (ribosomal RNAs), pol II (mRNAs and other RNAs)andpolIII(smallRNAspecies). In eucaryotes, transcription proceeds at ∼20 nucleotides per second. The typical ∼27Kb gene Chapter1:Genesincells thus requires approximately 20 minutes to be transcribed. For some of the larger genes, completingtranscriptionmaytaketensofhours. A gene may be simultaneously transcribed by several RNA polymerases, if large amount of gene productareneeded. The decision on whether to transcribe a gene or not in a given cell is not made by the RNA polymerase and their auxiliary initiation factors. Transcriptionisregulatedbygeneswitches. Componentsthatmakeupgeneswitcheswerefirst identified in bacteria and phages. Gene switches comprise cis- and trans-acting elements. The cisactingelementsareshortsegmentsofthedoublehelixinthevicinityofthegenetheyregulate.They are called “operators” in procaryotes. The corresponding nucleotide sequence defines a unique surface in the major grove that can be specifically recognized by the trans-acting componentoftheswitch:generegulatoryproteins with matching DNA reading domains. Regulatory proteins are classified according to type of DNA readingdomain:helix-turn-helix,zincfingermotifs, leucine zippers, helix-loop-helix, etc. Bound to the operator, some regulatory proteins will act as repressors (precluding access of the basal transcriptional machinery to the promoter by, for instance,sterichindrance),andothersasactivators oftranscription(facilitatingtheaccessofthebasal transcriptional machinery to the promoter). Throughregulatoryproteinscellshavetheabilityto adapt to changing environmental conditions: on bindingofaligand,regulatoryproteinsundergoan allosterictransitionwhichwilleitherallowthemor preclude them from binding to the operator. The combination of target gene, operator, regulatory proteinsandligandjointlycomposeanoperon. Oneofthebestunderstoodoperonsisthelactose (Lac) operon. The Lac operon encodes proteins that are required to transport lactose in the bacterial cell and then catabolize it (i.e. βgalactosidase and permease). The Lac operon is both under negative and positive transcriptional control. In the absence of lactose in the medium the Lac repressor binds to the operator thereby preventing transcription. However, when lactose is present in the medium it acts as an inducer by bindingtotheLacrepressorandtherebyprecluding it from binding to the operator sequence. However,ifglucose–thepreferredcarbonsource- isalsopresentinthemedium,transcriptionwillnot proceed. Indeed, productive transcription not only requires release of the Lac repressor from the operator, but also binding of the catabolite activator protein (CAP) to a distinct cis-element lyingjustupstreamofthetranscriptionstartsite.If glucose is abundant in the medium, intracellular concentrationsofcyclicAMP(cAMP)willbelow.In theabsenceofcAMP,CAPcan’tbindtoitscognate 6 cis-element, hence preventing transcription of the Lacoperon.Ifglucoselevelsdrop,however,cAMP levelsincrease.BindingofcAMPtoCAPallowsthe latter to bind to its cis-element and to fulfill its activator activity, hence, promoting transcription. For E. Coli to transcribe its Lac operon, thus requires lactose without glucose in the medium. [BOX 1: Genetic definition of the Lac operon components] Transcriptional regulation in eucaryotes shares several of the basic features uncovered in procaryotes. There are, however, several specificities: (i) Cis-acting elements (called enhancers when activating transcription and silencers when repressingtranscription)maybequitedistantly removed from the promoter sequences which they control, requiring looping of the intervening DNA to allow interaction. Cisacting elements may be located upstream as well as downstream from the transcription start site. Insulator sequences restrict the effectofgeneswitchestospecificdomains. (ii) Interaction between regulatory proteins and basaltranscriptionalmachineryisoftenindirect requiringanintervening“Mediator”protein. (iii) Transcription of eucaryotic genes is usually controlled by multiple regulatory proteins, allowing customized expression levels in different cell types according to needs, as well as combinatorial gene control. Regulatory proteinstypicallycontrolmanygenes. (iv) The activity of regulatory proteins can be modulated in many ways including protein synthesis, ligand binding, covalent modifications,additionofsubunits,unmasking, stimulationofnuclearentryorreleasefromthe membrane. (v) Local chromatin state is an essential component of eucaryotic transcriptional gene regulation. By recruiting specialized proteins, gene regulatory proteins influence chromatin structure. Alternativesplicingincreasescodingcapacity Translation of procaryotic mRNAs begins while transcription is progressing. In eucaryotes, on the contrary, the nascent pre-mRNAs undergoes a series of processing steps, prior to exportation of thematuremRNAinthecytoplasmwhereitwillbe translated. Maturation of eucaryotic mRNAs involves the additionofa“cap”structureatthe5’end,excision of the intronic sequences and joining of the exons (splicing), and endonucleolytic trimming of the 3’end and addition of a poly-A tail (polyadenylation). Chapter1:Genesincells The Cap structure corresponds to a GMP that is added in reversed orientation to the 5’ endof the mRNA by an unusual 5’-to-5’ triphosphate bridge. In addition the guanine is being methylated in position7. Splicing of the introns is initiated by the attack of the5’“donor”splicesitebyanadeninelocated∼35 bases upstream of the 3’ “acceptor” splice site. Thisattackseversthephopshodiesterlinkbetween theupstreamexonandintronandbindsthe5’end of the intron to the 2’ position of the attacking adeninetherebycreatingaloopintheintron.The unmasked 3’-OH group of the upstream exon subsequently attacks the acceptor splice junction, therebyreleasingtheintronasalariatwhilejoining the exons. Splicing is performed by a complex machinery, known as the spliceosome. Small nuclear ribonucleoproteins (snRNP) form the core ofthespliceosome.AstheirnameimpliessnRNPs comprise small nuclear RNA (snRNA) components which – by virtue of base pair complementarity – recognize the splice donor, branching and splice acceptor sites, each characterized by short consensussequences. Contrary to procaryotes, transcription termination doesn’t occur at very specific sites. Nevertheless, the3’endofmRNAsareneatlyspecified.Indeed, consensus nucleotide sequences (including a canonical AAUAAA 10-30 nucleotides upstream of the polyadenylation site) are recognized by two multisubunit proteins called cleavage stimulation factor (CstF) and cleavage and polyadenylation specificity factor (CPSF). These bind to the transcribed RNA, assisting other proteins in (i) cleavingthenascenttranscript,and(ii)adding∼200 Aresidues(=poly-Atail)atthe3’endproducedby cleavage. These three processing reactions occurs cotranscriptionally: the enzymes performing these reactions are tethered to the growing RNA chain by binding to the trailing carboxyterminal domain ofRNApolII. Wehaveseenthattheevolutionarysignificanceof thesplitgenestructuremightbetopromoteexon shuffling. There is also an immediate benefit of splicingtotheorganism.Indeed,agivengenemay undergo more than one type of splicing reaction, resulting in mature mRNA with distinct exon selections thereby encoding proteins differing in theiramino-acidsequenceandhencefunctionality. This“alternative”splicingisparticularlycommonin mammals, where an estimated 75% of the genes are subject to alternative splicing generating ∼3 distinct mRNA on average. This phenomenon considerably increases the coding capacity of the ∼20,000 mammalian genomes, hence maybe compensating for their lower than expected numbers. 7 Translation exploits a written language, the geneticcode. The nucleotide sequence of a mature mRNA dictates the amino-acid sequence of the encoded protein, in a process referred to as translation. Because there are only four distinct nucleotides, whilethereare20differentamino-acids,themRNA needs to be read in words of three successive nucleotides. Indeed, words of two nucleotides (doublets) could only code for 4x4=16 different amino-acids (i.e. less than one is needed), while words of three nucleotides (triplets) would have the capacity to code for 4x4x4=64 distinct aminoacids (i.e. more than actually needed). The correspondence between specific triplets (or codons)andtheamino-acidtheyencodeisknown as the genetic code. One of the remarkable featuresofthegeneticcodeisitsuniversality:itis virtually identical for all living organisms on earth, hence an additional testimony of our shared ancestry.All64codonsareutilizedinthegenetic code: 61 are used to code for amino-acids, while three correspond to stop codons which cause termination of translation. As there are more amino-acidencodingcodonsthanamino-acids,the geneticcodeissaidtoberedundant:thenumber of codons per amino-acid ranges from one to six. Synonymous codons typically differ at the 3’ end. [Box2:…] How is the sequence of codons translated in a string of the corresponding amino-acids? The amino-acids themselves can’t “read” their cognate triplet. Translation requires the successive action of two adaptor systems. The first adaptors are small RNA molecules called tRNAs for transfer RNAs. In eucaryotes, tRNA genes are transcribed byRNApolymeraseI.Theensuingtranscriptsfold locally into L-shaped structures stabilized by intramolecularbasepairing.These∼80ntstructuresare trimmedfromtheirlargerprecursormoleculesand undergo a series of modifications, including intron splicing and the chemical modification of ∼10% of theirresidues.Asaresult,tRNAsarecharacterized by nucleotides such as inosine (I), pseudouridine (ψ), dihydrouridine (D), 4-thiouridine (T), etc. in additiontotheusualA,C,GandUresidues.TheD andTresiduesgivetheirnametotheDandTloop characterizing the so-called clover leaf representation of tRNA molecules. The remaining central leaf corresponds to the anticodon loop characterized by three nucleotides which are complementary to the codons that the correspondingtRNAwillrecognizeinthemRNAto be translated. Anticodons may sometimes recognize more than one codon differing at the 3’endorWobbleposition.Hence,aG(respectively A)atthe5’endoftheanticodonwillbase-pairwith aC(respectivelyU)atthe3’endofthecodonbut may also engage in Wobble base-pairing with a U (respectivelyC)atthatposition.AnIatthe5’end oftheanticodonwillbase-pairwitheitheraC,Uor Chapter1:Genesincells even A at the 3’ end of the codon. Wobble-base pairingthusaccountsforpartoftheredundancyof the genetic code, but not all of it. Several tRNAs are often needed to recognize the different synonymouscodons.Thestemoftheclover leaf is characterized by a 3’ overhang of three nucleotides. The second set of adaptors, the amino-acyl tRNA synthetases, attach specific amino-acids to the 3’OHmoietyofthelastnucleotide.Ingeneral,there isoneamino-acyltRNAsynthetaseperamino-acid. Ithastheabilitytospecificallyrecognizeitscognate amino-acidaswellasthedifferentmatchingtRNAs. These are recognized by their specific three dimensional structure defined in part by the specificityoftheresiduemodifications. The loaded tRNA allows matching of a specific codon with the cognate amino-acid. However, linking the successive amino-acids in a protein string requires the additional action of the ribosomes. Ribosomes are composed of a small and a large subunit. In eucaryotes, the small 40S subunitismadeupofonerRNAmolecule(18S)and ∼33 proteins, while the large 60S subunit is made upofthreerRNAmolecules(28S,5.8Sand5S)and ∼49proteins.Ribosomesubunitsareassembledin the nucleolus. During translation, small and large subunitclampthemRNA,exhibitingthesuccessive codonsatthebottomofthreepocketsreferredto as A(amino-acid), P(eptide) and E(xit) sites. Translationproceedsinthreestepcycles.Inafirst step, tRNAs matching the corresponding codons occupythePandAsites.TherRNAmoleculesfrom the large subunit, acting as ribozymes, catalyze a peptidyl transferase reaction that transfers the amino-acid (or peptide) attached to the 3’OH end ofthetRNAinthePsitetotheaminogroupofthe amino-acidonthetRNAintheA-siteinaso-called headgrowthtypeofpolymerization.Peptidesthus grow at their carboxyterminal end. In a second step, the ribosome moves one codon down the mRNA. This movement is enabled by the mechanicalworkaccomplishedbytwoGTPase-type molecular machines: elongation factors EF1 and EF2ineucaryotes.Thismovesthenowunloaded tRNAfromthePtotheEsitesforcingitsexit,while exposing an empty A site allowing entry of a new loadedtRNA:step3. Each mRNA can be translated in three possible reading frames of which only one is (usually) correct. Selection of the correct reading frame occurs during translational initiation, which is accomplished by a small RNA subunit that is preloaded with a specific initiator tRNA always carryingamethionine.Ineucaryotes,thiscomplex recognizes the capped 5’ end of the mRNA and scans the 5’UTR until it finds the first AUG codon. Dissociation of initiator factors then allows assemblywithalargesubunitandinitiationofthe translation steps. In procaryotes, recognition of 8 the AUG start codon does not require 5’ to 3’ scanning of the 5’UTR, but is mediated by direct base-pairing of 16S rRNA to a complementary sequence(Shine-Dalgarnosequence)justupstream of the start codon. This differences accounts for themonocistronicnatureofmosteucaryoticmRNA versus the polycistronic character of many eucaryoticmRNAs. When one of the three stop codons is exposed at the bottom of the A site, a protein known as translation release factor (eRF1), whose three dimensional structure mimics that of a tRNA, enters the ribosome and causes the peptidyl transferase activity to add a water molecule thereby releasing the synthesized protein, and causingdissociationandreleaseoftheribozome. mRNAcanbetranslatedsimultaneouslybymultiple ribosomesinbothpro-andeucaryotes,generating so-calledpolysomes(orpolyribosomes). The nascent proteins are progressively pushed through a narrow channel. On their emergence, protein domains progressively fold cotranslationally, assisted if necessary by different typesofchaperoneproteins. Theexpandingworldofnon-codingRNAs The central dogma focuses on the role of mRNA which will be translated into proteins, assisted by the rRNA and tRNA non-coding RNA genes. In addition,tothesethreemajorclassesofRNAs,the cell transcribes a multitude of other types of noncoding RNA molecules. These include the small nuclear RNAs (snRNAs) that are part of the small nuclear ribonucleoproteins (snRNPs) which make up the spliceosome, or the small nucleolar RNAs (snoRNAs) that participate in posttranscriptional modificationofrRNAs. In recent years, a multitude of novel non-coding RNA genes have been discovered. Some of these are referred to as long non-coding RNA genes. They are typically transcribed by RNA pol II, are usually spliced, often evolutionary highly conserved, yet not characterized by an obvious open reading frame. Long non-coding RNA genes include the telomerase RNA gene coding for the RNA component of telomerase, the XIST gene playing a central role in the inactivation of one of the X chromosomes in females, and the H19 and AIR imprinted genes. However, the function of most of the long non-coding RNA genes remains poorlyunderstood. The cell also contains a multitude of small noncoding RNA genes. Micro RNAs (miRNAs) are amongst the best understood small non-coding RNA genes. Most miRNA are processed in the nucleus as small hairpin loops (called pre-miRNAs) Chapter1:Genesincells cropped from longer RNA pol II transcripts (includingpre-mRNAs)byanenzymecalledDrosha. The ensuing pre-miRNAs are actively exported in thecytoplasmwheretheyarefurtherprocessedby Dicer to generate a short (∼21 double-stranded RNA characterized by two-residue 3’ overhangs. Oneofthetwostrandsispreferentiallyintegrated inaRNAinducedsilencingcomplex(RISC),whichit guides by base-pair complementarity to the 3’UTR of specific mRNA targets to be down regulated. The down regulation of the targets involves both targetdegradationandtranslationalinhibition.The mammaliangenomecontainsanestimated∼1,000 miRNAs each regulating an estimated ∼200-300 target genes. The majority of mammalian genes (excepthousekeepinggenes)aresubjecttomiRNAmediatedregulation. Novelgenomicmethods(Chapter4)nowallowfor a more detailed survey of transcriptional activity across the genome. The pictures that emerges is one of “pervasive” trancription that is much more elaborate than expected. Rather than being confined to genes (excluding intergenic regions), with clear transcription initiation and termination sites,transcriptionalactivityisnowdetectedpretty much throughout the entire genome, including intergenicregions.Withingenicregions,apanoply oftranscriptsisdetected,usuallyfrombothstrands and often spanning multiple adjacent genes. The function if any of these multiple and diverse transcriptsisasubjectofintenseresearch. Halve the mammalian genome is composed of interspersedtransposons&pseudogenes One of the most striking findings emerging from sequencing mammalian genomes is that close to 50% of the sequence space is occupied by interspersedrepetitivesequences. Approximately halve of these are “autonomous” transposable elements falling in three major categories: DNA transposons, retrovirus-like LTR retrotransposons,andnonLTRretrotransposonsor LINES (Long INterspersed Elements). They are referred to as autonomous because they typically contain de genes coding for the enzymes needed for transposition. DNA transposons move via a non-replicativecutandpastemechanisminvolving only DNA. Retrotransposons move via an RNA intermediate that is reverse transcribed in DNA prior to reintegration in the genome. The retrotransposition mechanism of LTR-elements is very similar to that of their close cousins the retroviruses,anddistinctfromthatofLINES.Table x summarizes the number of genomic copies as well as genomic space occupied by each one of theseelements. 9 In addition to this “autonomous” interspersed repeats, the genome contains supposedly nonfunctional copies of many genes, referred to as pseudogenes. Pseudogenes can be (i) unprocessed,includingpromotersequences,exons and introns, and resulting from intragenic duplication events, or (ii) processed pseudogenes resulting from reverse transcription of a “primary” transcript (by reverse transcriptases encoded by autonomous retroposons) and ectopic reintegration in the genome. Given their origin, processed pseudogenes are typically devoid of promoter sequences and introns and may be characterized by a poly-A tail. Most genes are characterized by none or a very small number of pseudogenes in the genome. Some genes however, have spawned very large numbers of processed pseudogenes. This is particularly the case for small genes such as 7SL-RNA or tRNA genes.OneofthefeaturesofRNApolIIIgenesis that the promoter sequences are intragenic and hence carried along during the retrotransposition process. As a consequence, the processed pseudogenes can be transcriptionally competent hence capable of generating pseudogenes themselves. Moreover, some of these pseudogeneshaveacquiredthe3’endofLINES(by integrating into a LINE), thereby becoming preferredsubstratesforthereversetranscriptases. These very abundant pseudogene families are referred to as SINES (Short Interspersed Element), and may represent as much as x% of the genome (Tablex). The evolutionary significance of the large proportion of the vertebrate genome occupied by interspersed repeats remains somewhat of a mystery.Ontheonehandinterspersedrepeatsare consideredasparasiticselfishDNA.Supportingthis view is the fact that a few species (f.i. the pufferfish) seem to have largely escaped transposon invasion while seeming perfectly adapted to their environment. It is also increasingly apparent that eucaryotes have deployedarangeof“immune”strategiestocontrol the proliferation of transposons. There are footprintsofpastperiodsoftransposonepidemics during which specific families of transposable elements transiently escaped non adaptive surveillance mechanisms and spread across the genome. Some extant species exhibit signs of considerable transposon activity, and – in these – insertional inactivation of genes by transposon integration contributes significantly to the mutationalload.Ontheotherhand,thereareclear examples where transposon sequences have been coopted in what are now important cellular functions,pointingtowardsatleastsomedegreeof symbiosisbetweentransposonsandtheirhosts. DNArepair(?) MitochondrialDNA Adescriptionofourgenomewouldbeincomplete without mentioning the mitochondrial genome. Indeed mitochondria have their own genome. In mammals, the mitochondrial genome is a small circular molecule of ∼16.5 Kb. However, as every mitochondria may contain 5 to 10 such molecules and as a single cell may contain hundreds of mitochondria,themitochondrialDNAmayaccount for as much as 1% of the cellular DNA. The mitochondrial genome comprises 13 proteinencoding genes (coding for components of the electron transport chain and ATP synthase), 22 tRNA genes, 2 rRNA genes and the D-loop encompassingthebidirectionaloriginofreplication and transcription. Mitochondria have their own translational machinery (including ribosomes) operatinginthemitochondrialmatrix. The organization and operation of the mitochondrial genome shares many features in common with procaryotes. Indeed, mitochondria are thought to be the remains of aerobic bacteria that were engulfed by an ancestral anaerobic eukaryotic cell initiating a symbiotic relationship. Thisisthoughttohavehappened1.5billionsyears ago, when as a result of aerobic bacterial activity oxygen started to accumulate in the atmosphere. Over time most of the mitochondrial genes were translocated to the nucleus, requiring the concommitant development of mechanisms to transport the corresponding proteins from the cytoplasm (where they are synthesized) into the mitochondria. The small numbers of genes remaining in the mitochondrial genome may be trapped there as mitochondrial genetic codes evolved in a series of slightlydifferentdialects.Inmammals,fivecodons have seen their meaning drift away from the universal, while the ORF of the mitochondrial proteincodinggenescoevolvedtherebyhampering theirtranslocationtothenucleus. Chapter1:Genesincells 10