Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Essentialhumangenomics MichelGeorges Essentialhumangenomics–MichelGeorges Page2/41 TableofContents 1. Thereferencehumangenome a. Basicanatomyofthehumangenome i. Genomesizeandcomposition ii. Chromosomes:number,autosomes,sexchromosomes,bands iii. Codingandnon-codinggenes:numbers,types,structure iv. Repetitivesequences v. Epigenetics b. Obtainingareferencesequenceofthehumangenome i. FromSangertoNextGenerationSequencing ii. Shotgunsequencingofmate-pairlibraries iii. Linkagemaps,radiationhybridmapsandBACcontigs c. Annotatingthereferencesequence i. Maskingrepetitivesequences ii. Establishingagenecatalogue 1. Identifyingcis-actingregulatoryelements:evolutionaryconstraints,CHIPSeq,DNaseIhypersentivity. iii. 3-Dstructure iv. Interactome v. Genomebrowsers 2. Individualhumangenomes a. Geneticvariants:SNPs,indelsCNVsandotherstructuralvariants i. Typesofgeneticvariants ii. Minorallelefrequency,nucleotidediversityandnumberofpolymorphicsites iii. Germ-linemutations,driftandnaturalselection b. Genomicreconstructionofourevolutionaryhistory i. TheAfricaneve ii. Admixturewitharchaichominins-NeanderthalsandDenisovans c. TheHapMapand1,000Genomesprojects i. Linkagedisequilibrium ii. Haplotypeblocksandrecombinationhotspots iii. TheHapMapproject iv. The1,000Genomesproject 3. Themorbidhumangenome a. Neutralversusfunctionalvariants–germlineversussomaticmutations b. Monogenicdiseases c. Polygenicorcommoncomplexdiseases d. Somaticversusgermlinemutations:cancer Essentialhumangenomics–MichelGeorges Page3/41 1. TheReferenceHumanGenome A.BASICANATOMYOFTHEHUMANGENOME Genome size and composition. The haploid human genome comprises ∼3x109 basepairs.Thisamountstoapproximately2mofDNAineveryoneofthe∼1013 diploidcellsthatconstituteourbody.Thehumangenomeiscomparableinsize tothatofallothermammals.Itisapproximatelytentimeslargerthanthesizeof the genome of the multicellular worm C. elegans and fly D. melanogaster, 100 timeslargerthanthatoftheunicellulareukaryoticyeastS.cerevisiae,and1,000 timeslargerthanthatoftheunicellularprokaryoticE.Coli. Thehumangenomecomprisesapproximately40%G-Cbasepairsand60%A-T base pairs, a steady state equilibrium between the preferential loss of G-C’s by mutation and preferential gain of G-C’s by meiotic recombination. Local fluctuations around this mean depart very significantly from expectations at differentscales. Base-pair composition is characterized by a strong di- and even tri-nucleotide dependence. The 5’-CpG-3’ dinucleotide, in particular, is strongly underrepresentedacrossthegenome,exceptwithinso-calledCpGislandswhich markthe5’endofahighproportionofgenes. Chromosomes. The haploid human genome is subdivided over 22 linear autosomes, the X and Y linear sex chromosomes, and the small (∼16.5 Kb) circular mitochondrial genome. As most cells contain many mitochondria, the DNAfromthisorganellemayaccountfor∼1%ofcellularDNAdespiteitssmall size. The X and Y chromosome comprise a large X- and Y-specific portion, respectively,aswellastwosmallcommonsegmentsoneitherendthatalignand recombineduringmalemeiosis,knownasthepseudo-autosomalregions1and2. The fluctuations in G-C content generate chromosomal bands upon coloration thatwereusedbycytogeneticistsforchromosomeidentification. Coding and non-coding genes. The human genome comprises of the order of 21,000protein-encodinggenes.Thisisconsiderablylowerthananticipatedand possibly less than the gene content ofthe nematode C.elegans and the plant A. thaliana. The vast majority of human genes are split, meaning that they are composedofseveralusuallysmallexons(∼150bp)separatedbyusuallylarge(∼ 3000bp)introns.Thisallowsforthegenerationofmultipleproteinisoformsper genebyalternativesplicing,whichappearstobeusedbynearlyallhumangenes. Thus, despite the relative small number of protein-coding genes in the human genome, they have the capacity to encode a very complex proteome. The gene catalogue has expanded by gene duplication and specialization. Genes which descendfromacommonancestorgenebyduplicationaresaidtobehomologous genesandparalogues.Thehumanandthyroglobulingenearealsohomologous genes,yetmultiplicatedbytheprocessofspeciation;suchhomologousgenesare saidtobeorthologues. In addition to protein encoding genes, the human genome contains non-coding genesgeneratingRNAmoleculesasfinalproducts.Theseincludetheribosomal RNA(rRNA)genes,thetransferRNA(tRNA)genes,smallnuclearandnucleolar Essentialhumangenomics–MichelGeorges Page4/41 RNA genes, which have been known to be essential for RNA maturation and translation for some time. It has recently become apparent, however, that the human genome encodes a flurry of other non-coding RNA genes. Amongst the best understood figure (i) the microRNA (miRNA) genes, encoding small RNA molecules that are involved in fine-tuning the expression level of nearly all messenger RNAs (mRNAs), and (ii) the “long intergenic non-coding” RNA (lincRNA)genes,thatfulfilldiverseyetstillpoorlydefinedfunctions.Thereare of the order of 1,000 miRNA genes, and several thousands of lincRNA genes in thehumangenome.ThebestunderstoodlincRNAisprobablyXIST,whichplays anessentialroleintheinactivationofoneoftheXchromosomesinwomen. Inadditiontothesefunctionalgenes,thegenomeislitteredwithlargenumbers of what appear to be non-functional gene copies, referred to as pseudogenes. Pseudogenes come in two types: processed and non-processed. Processed pseudogeneshavenopromoter,nointrons,butoftenapoly-Atail.Theyarein fact retro-transcribed cDNA copies of mRNA that have been integrated in the genome.Thereversetranscriptasesresponsibleoftheretrotranscriptionderive either from retroviruses or – more likely – endogenous retrotransposons (see hereafter).Non-processedpseudogeneshavethestructureofausualgene.We suspectthattheyarenotfunctionalastheyareseldomevolutionaryconserved andoftencontainmultiplehighlydisruptivemutationsincludingnonsense(stop codons)andframeshiftmutations. Repetitive sequences. A large proportion of the genome corresponds to sequencesthatarepresentinmorethanonecopypergenome.Theserepetitive sequencesincludetandemrepeatsandinterspersedrepeats. Tandem repeats include satellites, minisatellites and microsatellites. Satellites arecomposedofoftenthousandsoftandemrepetitionsofmotifsthatcanveryin length from ten to hundreds of base pairs. They are found at telomeres, centromeres and in constitutive heterochromatin. Minisatellites are typically composed of tens to hundreds of tandem repetitions of motifs of 30-50 base pairs. There are thousands of them and they are dispersed across the genome with an enrichment in subtelomeric regions. Microsatellites are composed of tens of tandem repetitions of short motifs ranging from one to <10 base pairs. There are tens of thousands of them and they are dispersed throughout the entiregenome.Oneofthecommonfeaturesofalltypesofsatellitesequencesis thereveryhighdegreeofgeneticpolymorphism(seehereafter),explainingwhy mini- and microsatellites have been used extensively for DNA fingerprinting. Expansion of trinucleotide repeats that do or do not overlap with coding sequences are the cause of a number of genetic disorders including fragile X mentalretardation,Huntingtondiseaseandmyotonicdystrophy. Interspersed repeats are primarily composed of transposable elements, which aretypicallygroupedinfourcategories:(i)SINES,(ii)LINES,(iii)LTR-elements and(iv)DNAtransposons.SINES(ShortInterspersedElements)areprocessed pseudogenes from pol III dependent genes. The most abundant SINES in the humangenomearetheAlusequenceswhicharederivedfromthe7SLRNAgene. OtherSINESarederivedfromtRNAgenes.TheexpansionoftheSINESmaybe related to (i) the fact that pol III promoters reside within the gene so that processed pseudogenes remain transcriptionally competent, and (ii) that they Essentialhumangenomics–MichelGeorges Page5/41 are recognized as preferred substrates by reverse transcriptase possible as a resultofanearlyintegrationeventwithinaretrotransposon.ThetypicalSINEis ∼100 to 300 bp long. The ∼850,000 SINE elements account for ∼13% of the human genome. LINES (Long Interspersed Elements) are autonomous retrotransposons,meaningthattheytransposeviaanRNAintermediatethatis retrotranscribedtocDNApriortoreintegrationinthegenome.Full-lengthLINE copiesareapproximately6-8Kblongandcontainopenreadingframesencoding enzymes required for their transposition, including with reverse transcriptase activity.ThemajorityofLINEelementsinthegenomearetruncated.The∼1.5 million LINE elements in the genome account for ∼21% of our genome. LTRelementsareanothertypeofautonomousretrotransposons.LTRstandsforLong Terminal Repeats, and indeed LTR-elements share these structures, as well as thegenecontentwhichtypicallycomprisesgag,polandenvgenes,withregular retroviruses.LTR-elementscanbeviewedasretrovirusesthathaveforgonethe extracellular phase of their life cycle. There are approximately 450,000 LTRelementsinourgenome,whichaccountof∼8%ofitsspace.Thefinalcategoryof transposable elements are DNA transposons, which do not transpose via RNA intermediatesastheothermobileelementsdo,butratherbymeansofacut-andpastemechanism.Thereare∼300,000suchelements,amountingto∼3%ofour genome. There is a lot of questioning about the purely selfish nature of interspersedrepeats.Thefactthatorganismssuchasthepufferfishinanimals, or A. thaliana in plants, function very well with very little if any interspersed repeats supports their predominantly parasitic nature. The observation that distinct families of interspersed repeats have been active at different stages during evolution to be finally silenced by their host, also supports a mainly parasiticnature.However,thereisagrowinglistofexamplesof“exaptation”or “domestication” of transposable elements. There are several examples of essential genes in our genome that clearly derive from ancestral transposable elements, as well as hundreds of thousands of examples of evolutionary constraintselementsderivingfrommobileelements. In addition, our genome encompasses large segments (∼100Kb-1Mb) that are presentinthegenomeinmorethanone,yetasmallnumberofcopies.Theseare referredtoaslowcopyrepeatsorsegmentalduplications.Thedifferentcopies caneitherresideclosetoeachother(forinstancetandemcopies),orondifferent chromosomes. Segmental duplications often coincide with Copy Number VariantsorCNVs(seehereafter). Epigenetics.Epigeneticsisaverypopular,all-encompassingtermthatdeserves someclarification.Itinitiallyreferredtoeverythingthatwasneeded“ontop”of theDNAtoallowproperexecutionofthegeneticprogram.Itisinthiscontext thatConradWaddingtonintroducedthenotionof“epigeneticlandscape”. Morerecently,thetermepigeneticshasmainlybeenusedtorefertovehiclesof transmissible phenotypic variation other than variation in the DNA sequence. Whydodaughtercellsofahepatocytemothercellbehaveashepatocytes,while daughtercellsofathyrocytemothercellbehaveasthyrocytes,giventhefactthat – in a given individual - both cell types harbor exactly the same genome? The reason that is presently most commonly invoked is that the genome is not inherited“naked”bydaughtercells,butthatitcomeswithasetoftissue-specific “epigenetic marks”. The most studied epigenetic marks in animals are DNA Essentialhumangenomics–MichelGeorges Page6/41 methylation, and post-translational modifications of the amino-terminal tails of thenucleosomalhistones(H2A,H2B,H3andH4). Cytosines in the 5’-CpG-3’ dinucleotide can be methylated by DNA methyltransferases at position 5. The most abundant DNA methyltransferase (DNMT1),issaidtobea“maintenancemethylase”asitwillmethylatetheCofa CpGdinucleotideonlyofthecomplementaryCpG(ontheotherstrand)isalready methylated. DNMT1’s main role thus appears to be the faithful maintenance of theDNAmethylationstatusafterDNAreplication.ThemodeofactionofDNMT1 explainshowtheDNAmethylationstatusofagivenregioninthegenomecanbe faithfully transmitted “epigenetically” from mother to daughter cells. As DNA methylationisoftencorrelatedwithgenesilencing,thesamegenesareactivein motheranddaughtercells.AsDNAmethylationpatterns(i.e.whichpartsofthe genome or methylated or not) are tissue-specific, DNA methylation contributes to the epigenetic maintenance of tissue-specific gene expression patterns. But how are hepatocytes and thyrocytes (and – for that matter - all other tissues types) acquiring tissue-specific DNA methylation marks in the first place? In addition DNMT1, our genome encodes “de novo DNA methylases”, including DNMT3aandDNMT3b.ThesecanmethylateaCinaCpGdinucleotideevenifthe complementarystrandisnotyetmethylated.SuchdenovoDNAmethylasesare assumed to be involved in the establishment of the tissue-specific methylation patterns,whicharethenpropagatedthroughcellulargenerationsbyDNMT1. Another very important category of “epigenetic marks” are the myriad of posttranslational modifications of the nucleosomal histone tails. These include methylations, acetylations, phosphorylations, ubiquitylations and even proline isomerisations. Combinations of specific histone modifications correlate with the functionality of the corresponding chromosome region. Accordingly, this “histone code” can be used to distinguish promotors (active, weak, poised), enhancers(strong,poised),insulators,transcribedregions,polycombrepressed regions,heterochromatin,etc.Theenzymaticcomplexesthatimposethishistone marksincludepolycombrepressivecomplexes1(PRC1)and2(PRC2).Contrary to DNA methylation, there is no clear understanding of the molecular mechanismsthatensurethefaithfulinheritancefrommothertodaughtercellsof thetissue-specifichistonemarks.Apossibilityisthedirectphysicalinteraction betweenthehistonemodifyingandDNAmethylationmachineries. An often overlooked mechanism for epigenetic inheritance, despite its extensively documented contribution to - for instance - the differentiation betweenthelyticandlysogenicstatesofbacteriophages,are“positivefeedback loops”. An inheritable differentiation switch may be determined by the presence/absence of a transcription factor that not only determines tissuespecifictranscriptionprograms,butalsoenforcesitsownpresencebyadirector indirect positive feedback loop. Once the transcription factor is expressed in a mothercell,andprovidedthatitremainspresentatrequiredconcentrationsin the daughter cells after mitosis and cytokinesis, it will remain expressed in the daughter cells and contribute to the maintenance of the cell-type specific transcriptionpattern.If–onthecontrary–itisnotexpressedinthemothercell, it will remain “off” in the daughter cells, with concomitant effect on their respectivetranscriptome.Itisnotknowninhowfarthismechanismcontributes Essentialhumangenomics–MichelGeorges Page7/41 tothedifferentiationbetweenanimalcell-types,or–forthatmatter-tocancer progression. Thusfarwehaveconsideredepigeneticsasamechanismcontributingtocellular differentiation in multicellular organisms. A distinct question is whether epigenetic mechanisms contribute to heritable differences between individuals. This – somewhat controversial - phenomenon is referred to as “transgenerational epigenetic effects” (TEE). One possible molecular explanation to account for TEE is “transgenerational epigenetic inheritance” (TEI)or“gameticepigeneticinheritance”.TEIimpliesthatdifferentindividuals may inherit alleles that have identical nucleotide sequence yet differ epigenetically,andthatthese“epialleles”functiondifferentiallytherebycausing distinctphenotypes.Epialleleshavebeenreportedinplantsandinclude(i)the peloricvariantofLinariaresultingfromthemethylationandsilencingoftheLcyc gene,aswellas(ii)theepigeneticsilencingoftheparamutableb1locusinmaize. A number of examples have also been reported in mice, including the agouti viableyellow(Avy)allele.Thisalleleresultfromtheinsertionofanintracisternal A particle (IAP) retrotransposon upstream of the agouti gene. The IAP LTR drivesectopicexpressionoftheagoutigenecausingtheyellowphenotype.The degree of ectopic expression correlates with the methylation status of the LTR, whichis-atleastinpart–heritable(hencethenamemetastableepiallele).The methylation status of the IAP LTR also appears to be sensitive diet. There is someevidencethatrareinstancesofhereditarynon-polyposiscolorectalcancer (HNPCC) involve the inheritance of inactivated epialleles of the mismatch repair/tumorsuppressorMLH1gene.Thesegregationofepiallelesimpliestheir resistance to the reprogramming that is thought to reset near all epigenetic marksinthegermline.(Notethatparentalimprintinginvolvestheimpositionof parent-of-origin specific epigenetic marks in the corresponding germline, but thattheseareresetaccordingtotheparentalsexeachgeneration). TEI is not the only possible explanation of TEE. A nice example of a TEE that does not depend on TEI (yet involves DNA methylation in the soma) is the transgenerationalinheritanceofmotheringstyleandstressintherat.Another widely publicized case of possible TEE is the effect of malnutrition of pregnant mothers on the future health of their offspring and possibly grand-offspring as documentedforwomenthatwerepregnantduringthesecondworldwarinthe NetherlandsandRussia.TheprecisemechanismsunderlyingtheseTEEremain largelyunknown. B.OBTAININGAREFERENCESEQUENCEOFTHEHUMANGENOME FromSangertoNextGenerationSequencing.Forapproximatelythreedecades, determiningthenucleotidesequenceofaDNAmoleculewasnearlyexclusively donebymeansofFredSanger’s“dideoxy”method.Intheinitialversionofthis, one strand of the DNA molecule to be sequenced was copied using a DNA dependent DNA polymerase, in the presence of a mixture of the four regular deoxynucleotidetriphosphates,oneradiolabelleddeoxynucleotidetriphosphate, and one of the four dideoxynucleotide triphosphates. At each residue that is complementary to the dideoxynucleotide used in the reaction, the DNA polymerasehastheoptiontoincorporateeithertheregulardeoxynucleotideor Essentialhumangenomics–MichelGeorges Page8/41 the dideoxy “terminator”. If the DNA polymerase incorporates the dideoxynucleotide,furtherextensionofthatmoleculebytheDNApolymeraseis blocked by the absence of the needed hydroxyl group in the 3’ position of the ribose of the incorporated dideoxynucleotide. If, on the contrary, the DNA polymerase incorporates the regular deoxynucleotide at that position, further extensionofthemoleculeisalloweduntilitfacesthesame“choice”atthenext position that is complementary to the used dideoxy. To prime the polymerization, an oligonucleotide that hybridizes to the 3’ extremity of the template to be copied is added to the reaction. The reaction is conducted “in parallel” on a very large number of copies of the template, which are either obtainedbyconventionalcloningusingplasmidorphage-basedvectors,orby“in vitro cloning” using the Polymerase Chain Reaction (PCR). When using conventional cloning, the primer typically targets known vector sequences flanking the insert to be sequenced. When completed, Sanger’s sequencing reaction will have generated a mixture of DNA molecules of different sizes, but systematically ending with the same nucleotide (by incorporation of the correspondingdideoxy).Thisreactionisrepeatedfourtimes,witheachoneof thefourpossibledideoxynucleotidetriphosphates.Runningtheproductsofthe four reactions on an acrylamide gel (and visualizing the products by autoradiography) revealed a ladder of fragments of different size that allowed immediatereadingoftheDNAsequence. Inthemiddleofthenineties,theuseofradiolabelednucleotideswasreplacedby theuseofdideoxynucleotidesthatwerelabeledwithfourdistinctfluorophores. This allowed the four reactions to be conducted simultaneously and their product to be run in the same lane. Flat gels were replaced by capillaries, and manual reading replaced by automatic data capture using charge-coupled devices(CCD)thatwouldreadthecolorofthelightemittedattheextremityof thecapillarieswhileelectrophoresiswasproceeding.Thiswouldtypicallyallow the generation of sequence reads of ∼600-800 base pairs. The first reference genomesofE.coli,S.cerevisiae,D.melanogaster,C.elegans,H.sapiensandM.m. domesticus were generated using this first generation of “automatic capillary sequencers”. The next technological breakthrough was the development of methods of “sequencing by synthesis”. In these, the heterogeneous mix of reaction products are not examined at the very end of the sequencing reaction, but the sequencing reaction is instead conducted one nucleotide at the time and the incorporatednucleotidesdeterminedaftereachreaction“cycle”.Threedifferent chemistrieshavedominatedapproachesfor“sequencingbysynthesis”.Thefirst ispyrosequencing.AsforSangersequencing,pyrosequencingisbasedonthe generationofacopyofaprimedtemplate-to-be-sequencedbyaDNA-dependent DNApolymerase.Inthisreaction,thefournucleotidesareaddedserially,oneat thetime,tothepolymerizationreaction(thesuccessionisforinstanceG,thenA, thenC,thenT).Ifthenextbasetobecopiediscomplementarytothenucleotide that is added to the reaction, the latter will be incorporated by the DNA polymerase.Thisreactionwillstoichiometricallyreleaseapyrophosphateanda hydrogenion.Thepyrophosphatecanbequantitativelydetectedbymeansofa luciferase-catalyzed reaction that releases light. This is the approach that was usedbythe454technologythatwasacquiredbyRoche.Thehydrogenioncan Essentialhumangenomics–MichelGeorges Page9/41 be quantitatively detected by virtue of the change in pH that it causes. This approach,referredtoas“ionsemiconductorsequencing”,wasdevelopedbyIon Torrent and acquired by Life Technologies (does not require scanning with a CCDcamerawhichshortensthecycletime).Ifthenexttwoormorebasestobe copied are complementary to the nucleotide that is added to the reaction, the amounts of light emitted or the degree of pH change will be commensurate allowing to accurately count the complementary residues up to some number (whichexplainswhythesemethodsfacessomedifficultieswithmononucleotide repeats).Theresultsofpyrosequencingareoftenrepresentedasa“pyrogram”. Pyrosequencingallowsreadlengths>500basepairs. Thesecond“sequencingbysynthesis”technologyis“sequencingbyligation”.It takes advantage of the fact that DNA ligases can attach an incoming oligonucleotide to a primed template-to-be-sequenced provided that the oligonucleotide is perfectly complementary to the template sequence immediately adjacent to the primer. The ligation reaction is carried out in the presence of a mixture of degenerated oligonucleotides that are specified by either one, or a pair of residues (either in the middle or at the 5’ end), and labeledwithoneoffourspecificfluorophores.Atagivencycleinthesequencing reaction,onlyoneprimercanbeaddedtothegrowingchainbytheDNAligase. Its identity can be determined by virtue of the color of the emitted light. The fluorophoreisthenreleasedbycleavingoffpartoftheoligonucleotidecarryingit, andanewcycleisinitiated.Afteranumberofcycles,thechainoflinkedprimeroligonucleotides is released and the entire operation is repeated with a primer withonenucleotideoffsetrelativetothepreviousone.Thisprocessisrepeated until all bases in the template have been interrogated and its entire sequence determined. This approach was developed by Solid, which was acquired by AppliedBiosystemsandthenLifeTechnologies. The third (and presently dominating) sequencing by synthesis method uses “reversible terminators”. The 3’-OH group of these nucleotide analogues is typicallyreplacedbyvariouschemicalgroupscausingthemtobeterminatorsin the Sanger sense. Moreover, they carry one of four specific fluorophores attached to the base. Both the 3’ blocking group and the fluorophore can be chemically removed, restoring a 3’-OH group (hence the name reversible terminator) and a nearly natural base. As in the previous sequencing-bysynthesismethods,determiningthenucleotidesequenceofatemplatemolecule is achieved by its sequential replication by a DNA dependent polymerase extendingacomplementaryprimer.Dependingonthenextbaseinthetemplate, theDNApolymerasewillincorporateoneofthefourterminatorswhoseidentity can be determined by reading the color of the light emitted by its fluorophore. Blocking group and fluorophore are then eliminated, and this “wash and scan” cycle repeated as many times as possible. Sequencing with reversible terminators now allows the generation of reads over 250 base pairs in length. The corresponding sequencing technology was developed by Solexa that was later acquired by Illumina and which presently holds a > 85% share of the sequencers’market. Thebiochemistriesofthe“sequencing-by-synthesis”approachesdon’t–ontheir own – explain why these would be more efficient than original Sanger sequencing. As a matter of fact read length achieved by SBS is still inferior to Essentialhumangenomics–MichelGeorges Page10/41 whatcanbeachievedbyconventionalSangersequencing.ThesuperiorityofSBS resultsfromitscombinationwithspecifictemplatepreparationandprocessing methods allowing the analysis of millions of sequencing reactions in parallel. ThelimitedsensitivityofallSBSmethodsstillrequiresthegenerationofclonally amplifiedtemplates.InSangersequencingthiswasinitiallyachievedbyinvivo cloning and then increasingly by conventional PCR. In SBS approaches this is achievedbyapplyingmodifiedPCR-basedmethods,yieldingwhataresometimes called“polonies”. Thefirstoftheseis“emulsionPCR”.Inthis,“adapters”areligatedtothedouble stranded DNA to be sequenced. The double-stranded adapters are “Y-shaped” ensuringthatthe5’and3’endsofboththeWatsonandCrickstrandsaredistinct andnon-complementary.Theresultingligationproductsarethendenaturedand mixedwith(i)beadsthatarecoatedwithprimersthatarecomplementarytothe 3’ end of the ligation products, (ii) a solution containing PCR primers (one complementarytothe3’endoftheligationproductandtheothercorresponding to the 5’ end of the ligation products), deoxynucleotide triphosphates, and Taq polymerase, and (iii) oil. This generates an emulsion of aqueous droplets floatinginoil.Theemulsionisgeneratedsuchastomaximizetheproportionof droplets containing one bead and one single-stranded template molecule. The emulsion is then subject to temperature cycling allowing a polymerase chain reaction to occur in each droplet. After the PCR, the emulsion is broken, the beadsdenatured(suchastoonlykeeptheDNAstrandthatiscovalentlyattached to the bead), washed and recovered. For the droplets that contained one bead andonetemplatemolecule,thisprocesswillgeneratebeadscoveredwithalarge number of identical, single-stranded templates. In the Roche 454/FLX instrument, the corresponding beads are then loaded individually in one of ∼1 million PicoTitrePlate (PTP) wells, in which the pyrosequencing reaction will take place. As a consequence, it will be possible to monitor ∼ one million sequencingreactionsinparallel,eachyieldingupto1,000bpreads,foratypical total of ∼700 Mb of sequence per run. (On the SOLiD platform, x million such beadsarespreadandchemicallycross-linkedtoanamino-coatedglasssurface. This allows for ∼ million sequencing-by-ligation reactions yielding up to ... bp readstobeconductedinparallel,foratotalof...basepairsofsequenceperrun.) The second approach towards generating massive amounts of polonies in parallel is so-called “bridge PCR” as implemented on the Illumina instruments. In this approach, the DNA fragments to sequence are ligated to “Y-shaped” adapters as for emulsion PCR. The corresponding ligation products are then denaturedanddirectlyhybridizedontoaglassplate(referredtoas“flowcell”) thatiscoveredwitha“lawn”ofoligonucleotidescomprising(i)primersthatare complementary to the 3’ end of the ligation products, and (ii) primers correspondingtothe5’endoftheligationproducts.Hence,thesinglestranded templateshybridizebytheir3’endtocomplementaryoligonucleotidesattached totheflowcell.ADNApolymerasethenextendstheattachedprimerusingthe annealedsinglestrandedtemplateasmodel.Thisgeneratesasequencethatis bound at its 3’end by a sequence that is complementary to the other attached primer (5’ end of the ligation products). The flow cell is then subject to temperaturecyclesallowingforsolid-phaseamplificationorbridgePCR.The3’ end of the freshly synthesized strand first hybridizes to a neighboring Essentialhumangenomics–MichelGeorges Page11/41 complementary primer fixed on the glass support. A DNA polymerase then extends the primer to generate the complementary strand, resulting in two closely positioned complementary strands that are both covalently attached to thesolidsupport.Thesuccessionofseveralsuchtemperaturecyclesgenerates a tight cluster of clonally amplified molecules. It is at present possible to generate > 1 billion such clusters per lane, or > 8 billion clusters per Illumina flowcell(whichcontainseightsuchlanes).Theactualsequencing-by-synthesis reaction using reversible terminators is then conducted on the corresponding clustersusingeitheroneofthetwoprimers.Itisbeyondthescopeofthiscourse todescribetheproceduresinfulldetail,but-ineffect–bothstrandscomposing eachclustercanbesequencedoneaftertheother. The hundreds of successive “wash-and-scan” cycles that are required to complete a sequencing-by-synthesis reaction are time consuming. A typical sequencing reaction requires ∼2 days on a Roche FLX and >10 days on an IlluminaHiSeq2000instrument.OneoftheadvantagesoftheIonTorrentisthat measuringthepHinindividuallanesismuchfasterthanthecollectionofimages usingaCCDcamera.AtypicalrunonanIonTorrentmachinesrequiresonlya fewhours. (Table1summarizesthethroughputthatistypicallyachievedwiththefourmain sequencing-by-synthesisinstrumentsthatarepresentlyavailable While sequencing-by-synthesis has only become available in the last ten years, the next generation of sequencing technologies already lures on the horizon. These so-called third generation approaches target “single-molecule sequencing”,henceobviatingtheneedforthepreparationofclonallyamplified templates. Helicos commercialized the first instrument performing singlemoleculesequencing.Itusedreversibleterminatorsandsharedseveralfeatures in common with the Illumina technology (except for the lack of a bridge PCR step).Aradicallydifferentapproach,referredtoas“Single-moleculeRealTime Sequencing” (SMART), is being commercialized by Pacific Biosciences. In this approach a single primed template molecule is captured by a DNA polymerase that is fixed to the bottom of a tiny well using biotin/strepatavidin interaction. The DNA polymerase is fed deoxynucleotides that carry a fluorescent reporter attachedtotheirtri-phosphateorequivalentmoiety.WhentheDNApolymerase incorporates a given nucleotide, the corresponding fluorophore remains at the catalytical site for a period of milliseconds generating a detectable signal when combined with zero-mode waveguide (ZMW) technology (that restricts the excitinglaserbeamtothebottomofthewell).Incorporationofthenucleotide intothegrowingchainreleasesthefluorophorewhileleavingaperfectlynatural basethatcanbefurtherextendedbytheDNApolymerase.SMARTsequencingis fast,allowsreadlengthsofthousandstotensofthousandsofbasepairswhichis a major advantage for the assembly of complex genomes (see hereafter), and may allow for the identification of modified nucleotides including methylation (asthisaffectsthekineticsoftheincorporationinameasurableway).However, theerrorrateremainsveryhighandthenumberofZMWsthatcanbemonitored in parallel is presently limited to ∼150,000, hence limiting the throughput. Otherroutesthatarebeingexploredforsingle-moleculesequencingincludeDNA sequencing with nanopores, and direct imaging of DNA sequences using tunnelingandtransmission-electron-microscopy-basedapproaches. Essentialhumangenomics–MichelGeorges Page12/41 Shotgun sequencing and mate-pair libraries. Except for some viruses, genomelengthisobviouslymuchlargerthanachievablereadlength.Toobtain the complete sequence of a genome of interest one therefore typically applies “shotgun sequencing”. In this, many copies of the genome of interest are randomly fragmented to yield pieces of a size compatible with read length. As multiple copies of the genome are subjected to fragmentation, resulting fragmentsmayoverlap.Thislibraryoffragmentsisthen“clonallyamplified”.In the time of Sanger sequencing, this was done by ligating the fragments in a plasmidorphagevector,transformingorinfectingcompetentbacterialcells,and generating colonies each amplifying a distinct fragment. When using sequencing-by-synthesis methods this is done using either emulsion or bridge PCR(seeabove).Intheend,onegeneratessequencereadsfromalargenumber of random (hence the name “shotgun sequencing”) fragments derived from the genome of interest. The complete genome sequence is then reconstructed “in silico” by searching for overlapping sequences (corresponding to overlapping fragments).Aligningalloverlappingreadsgeneratessequencecontigsincluding a“tilingpath”that,inanidealworld,shouldspanandhencecorrespondtothe entire genome (or to be more precise there should be as many tiling paths as therearechromosomes).Forthetilingpathstospantheentiregenome,i.e.not tobeinterruptedby“gaps”,alargefractionifnottheentiregenomehastohave been sequenced at least twice. To achieve this, and accounting for random as wellassystematic(f.i.relatedtobasepaircomposition)variationsinsequence depth,thegenomehastohavebeensequencedatadepthof∼100.Assuminga genome size of 3x109 and a read length of 100 bp, this implies that one would have to generate 100*3*109/100=3*109 reads (100 fold depth) rather than 3*109/100=3*107reads(1folddepth). Experienceshowsthatevenifsequencingat∼100folddepth,itisimpossibleto reconstructtheentiregenomeusingthisapproachonly.Inthedraftsequenceof thehumangenomepublishedin2001,50%ofthebasesresidedincontigswith minimum size of 826,000 bp (“N50”) (total number of contigs = 4,884). In the so-called finished sequence of the human genome, published in 2004, N50 was 38,509,590, and the average contig size was ∼41Mb, hence corresponding to approximately 350 contigs. Several factors account for the difficulty (if not impossibility)togeneratecompletegenomicsequenceofforinstanceamammal by means of shotgun sequencing only. The main factor is the high content (∼50%; see above) in interspersed repetitive sequences. Finding the same interspersedrepeatintworeadsobviouslydoesnotmeanthattheyderivefrom DNA fragments that overlap in the genomic sequence. Thus interspersed repetitive sequences need to be ignored, or “masked”, before initiating the in silico reassembly process. This will make it very difficult if not impossible to reassemble regions that are rich in repetitive sequences particularly when the sequence reads are short as is the case with the “sequencing by synthesis” methods.Thus,manygapsinthesequencecorrespondtointerspersedrepeatrich regions. Segmental duplications are another nightmare for genome assemblers. The assembly process will often collapse segmental duplications into one, characterized only by an unusually high (∼ doubled for a true duplication)sequencedepthwhencomparedtotherestofthegenome.Satellite sequence corresponding to centromeres and telomeres are so difficult to Essentialhumangenomics–MichelGeorges Page13/41 assemble that they are usually just ignored when reconstructing genome sequences. A second factor complicating reassembly is the pervasive genetic polymorphism.TheDNAtosequenceistypicallyextractedfromcellsofatleast one,sometimesseveralindividuals.Theseindividualsarediploid:theyactually contain two genomes, one inherited from the father and one from the mother. Twogenomesdrawn“atrandom”fromthepopulationdifferatleastevery1,000 basepairsatso-called“polymorphicsites”(seehereafter).Regionsofoverlap between fragments may thus in effect differ if originating from different homologues (the paternal and the maternal). When two reads are characterizedbynearlyidenticalsequences,thequestionthusbecomeswhether theyderivefromdifferentpartsofthegenomeorfromoverlappingbut“allelic” fragments. To mitigate this issue, scientists select – whenever possible - individuals that are as inbred as possible as a source of DNA to generate a genomic sequence. Indeed the two alleles of inbred individuals have a higher chance to be “identical-by-descent” than for outbred individuals. In inbred strains of mice for instance, there are virtually (with the exception of de novo mutations;seehereafter)nopolymorphismsthatdifferentiatethepaternaland maternalgenomes. Shotgun sequencing, even at high depth, hence typically generates many unconnected sequence contigs, which is not very satisfying. Tohelp order and orient contigs, scientists have devised approaches based on “mate pairs” or “paired ends”. The first versions of these approaches were implemented with Sanger sequencing. Genomic libraries were constructed in plasmid (or phagemid)vectorsfromlargeDNAfragments,forinstance5,10oreven50Kb. DNA was subject to milder fragmentation treatments, fragments in the desired sizerangeselectedbygelelectrophoresisandligatedincloningvectors.Clones from the corresponding libraries were then sequenced with vector specific primersflankingbothendsoftheinsertyielding“matepair”reads,i.e.tworeads knowntocorrespondtothetwoendsofthesamefragment.Largenumbersof such mate pair reads were then used to reassemble the genome based on overlapping reads as described above to yield a number of disconnected sequence contig. The new mate pair information was used at this stage to identifymatepairsforwhichonereadwaspartofonecontig,andtheothermate pair of another contig. Such events would unambiguously indicate that the correspondingcontigsareneighbors,woulddeterminetheirrelativeorientation, andwouldsizethegapseparatingthem.Thiswouldestablishsetsofordered andorientedsetsof(previouslyunconnected)contigs,referredtoas“sequence scaffolds”. The draft sequence of the human genome published in 2001 comprised 2,191 such sequence scaffolds with N50 of 2,279 Kb. Moreover, knowing the mate pairs connecting neighboring sequence contigs, the corresponding clones could be used to complete the intervening sequence therebyclosingthegapstoprogressivelyleadtoatruly“finished”sequence. The “mate pair” approach was later adapted to “sequencing by synthesis” approaches.Inonesuchmethod,theDNAtobesequencedisgentlyfragmented andpiecesofdesiredsize(f.i.5Kb)isolatedbygelelectrophoresis.Theirends arethenenzymaticallyrepairedusingaDNApolymeraseandbiotinylateddNTPs. The resulting blunt-ended molecules are then circularized using ligases and more aggressively refragmented to obtain pieces of ∼250 bp. The fragments Essentialhumangenomics–MichelGeorges Page14/41 encompassingtheligationjunctionsarethenaffinitypurifiedwithstreptavidin, appended with Y-shaped adapters and both ends of the ensuing fragments sequencedonnextgenerationsequencersasdescribedabove. Linkagemaps,radiationhybridmapsandBACcontigs.Itispresentlypossible to generate a very descent reference sequence of the genome of any mammal “just”byapplyingshotgunsequencingincombinationwith“matepair”strategies on “sequencing by synthesis” sequencers. However, additional information was utilized to obtain the first reference sequence of the human genome, in a “hierarchicalsequencingapproach”. The“humangenomeproject”initiatedin∼1990andambitioningtoproducethe firstreferencesequenceofthehumangenome,startedwiththeconstructionofa seriesof“maps”asapreambletotheactualsequencingphase.Thefirstmapsto begeneratedwere“linkagemaps”.Thesewerecomposedoftensofthousands ofmicrosatellitemarkersthatwerepositionedrelativetoeachotherbyapplying theprinciplesoflinkageanalysisestablishedbyMorganinDrosophila.Tothat end, three-generation families with as many children as possible (mainly sampled in the Mormon population from Utah and known as the “CEPH families”) were genotyped for all known microsatellites. As microsatellite markers are often highly polymorphic, paternal and maternal alleles differ in many individuals, which are therefore said to be heterozygous. When tracking twoormoresuchmicrosatellitesjointlyinsuchextendedfamilies,itispossible to identify recombination events as instances were an offspring inherits f.i. an alleletracingbacktoitspaternalgrand-fatherforonemicrosatelliteandanallele tracingbacktoitspaternalgrand-motherforanothermicrosatellite.Forclosely “linked” microsatellites such recombination events (involving cross-overs) are expected to be rare, while for microsatellites that are located on distinct chromosomes or far from each other on the same chromosome, such recombination events will be observed in half the children. As established by Morgan, the observed recombination rate between two markers is thus a measure of their distance, which is typically measured in “centimorgans” (one centimorganisthedistanceseparatingtomarkersforwhichtherecombination rateis1%).Thisinformationcanbeusedtoidentifygroupsofsyntenicmarkers (i.e. located on the same chromosome) and to order them on the basis of observedcross-overevents.Linkagemapsofthehumangenomeincludingtens of thousands of microsatellite markers were hence constructed in the nineties. Inaddition,toformingabackboneforthehumangenomeproject,themotivation todeveloplinkagemapsliedintheopportunitiesthatitcreatedtolocategenes underlyinginheriteddiseases. While tens of thousands microsatellite landmarks spanning the genome may seemalot,theaveragedistanceseparatingsuchmarkersisstilloftheorderof 100 Kb which is still large. To increase the density of landmarks, geneticists turned their attention to “radiation hybrids” (RH). RH are obtained by first irradiatinganeuploidhumancelllinewithX-rays.Thisgenerateschromosome breaksatarate,whichcanbecontrolledbyvaryingtheapplieddoseofX-rays. Thenumberofbreaksistypicallysohigh,thatifleftontheirown,allcellswould rapidly die. However, irradiated cells can be rescued by growing them in the presenceofarodentcelllinewhilefavoringcellfusionbymeansofavirus(f.i. Sendaivirus)orchemical(f.i.polyethyleneglycol).Ifoneusesarodentcellline Essentialhumangenomics–MichelGeorges Page15/41 harboring a conditional defect (f.i. deficiency in thymidine kinase (TK)), and if growing the cells in a selective medium, only interspecific hybrids maintaining the human TK gene will survive. By doing so, one can select clones of interspecifichybridsorheterokaryons.Suchclonesaretypicallycharacterized by a full complement of rodent chromosomes having integrated fragments of humangenomesatmultiplelocations,amountingto∼15%ofthehumangenome. Given the selection procedure all of the surviving hybrid clones will have integrated a fragment of the human gene with the TK gene. But all other fragmentscanbeviewedasarandomsampleofthehumangenome.Eachhybrid clone will hence contain a different portion of the human genome, and this feature can be used to use a panel of RH as mapping resource. Such panel typically contains ∼100 independent clones. These are grown and DNA is extracted from all of them. This DNA panel is then used to perform PCR reactionsforasmany“amplicons”aspossible.Suchampliconscancorrespond to any short unique sequence in the genome and is sometimes referred to as a sequencetaggedsiteorSTS.Contrarytotherequirementsofgeneticmarkersin linkagemaps,theSTSdonotneedto(althoughtheycan)bepolymorphic.Any sequencewilldo,aslongasitcaneasilybeamplifiedbyPCRfromgenomicDNA. WhentypinganRHpanelforagivenSTS,approximately15%willbe“positive”, (i.e.thePCRamplificationwillyieldtheexpectedproduct),85%willbenegative. When typing an RH panel for two STS, which map for instance to different chromosome,oneexpects∼15%*15%=2.25%oftheRHclonestobepositivefor the two STS. To be more precise, the expected proportion of double-positive clones will be r1*r2, where r1 and r2 are the observed retention rates for STS1 andSTS2respectively(r1andr2willbecloseto,butnotexactly15%).Likewise, the expected proportion of double-negative clones are (1-r1)*(1-r2), while the expectedproportionofclonesthatarepositiveforoneoftwoSTSisr1*(1-r2)+(1r1)*r2. If the two STS are located very close to each other in the genome, the proportion of double-positive and double-negative clones will be higher than expected, while the proportion of single-positive clones will be lower than expected. Indeed, double-positives will occur at a rate of ∼ r1≈r2>r1*r2, and double-negatives at a rate of ∼(1-r1)≈(1-r2)>(1-r1)*(1-r2). A single-positive clonecanonlybeobservedifachromosomebreakoccurredbetweenSTS1and STS2, and if one (and only one) of the fragments was retained in the RH clone. ThiseventisthemoreunlikelythemoretheSTSsarecloselylocated.Thusthe degree of excess of double-positives and double-negatives and the depletion of single-positives is a measure of the distance between the two STS. Several RH panelshavebeengeneratedforthehumangenomeand“genotyped”infactorystyleforhundredsofthousandsofSTS,includingthemicrosatellitesformingthe basis of the linkage maps. By exploiting the principles outlined above, “radiation hybrid maps” comprising as many STS were constructed hence augmentingthedensityoflandmarksonthehumangenome.RHpanelsdifferby thedoseofX-raysused(f.i.5,000vs15,000rads),providingmappingpowerand resolutionatdifferentscales. Finally, before initiating the actual sequencing stage, scientists constructed a physical map of the human genome consisting in a series of “BAC contigs” anchoredoneitherthelinkageand/ortheRHmaps.BACs,orbacterialartificial chromosomes,areplasmid-likebacterialcloningvectors.Theirmainfeatureis Essentialhumangenomics–MichelGeorges Page16/41 thatBACscanbeusedtomaintainandpropagateinsertsaslargeas250Kb.This is primarily due to the fact that their origin of replication maintains a low number of copies per cell, thereby reducing the possibility for inter plasmid recombination (particularly when using recombination deficient bacteria) that would cause deletions and other rearrangements. Earlier efforts attempted to use YACs or yeast artificial chromosomes, which have an even higher cloning capacity. However, the high rate of rearrangements in YAC vectors has in essence precluded their large-scale use. Thus BAC libraries, with a number of independentclonescorrespondingtoaverylargenumberofgenomeequivalents, were constructed for the human genomes. In an approach that shares many featuresincommonwith“shotgunsequencing”(yetnotdependentatthisstage on actual sequencing), scientists devised strategies to very effectively identify partially overlapping clones. The main strategies were “STS content mapping” and“BACfingerprinting”,whichwerebothappliedinfactorymodeinspecialized genome centers. In the “STS content mapping” approach, DNA extracted from hundreds of thousands of independent BAC clones, is tested for the presence/absence of hundreds of thousands of STS. Different BAC clones that arepositiveforthesameSTSmust,bydefinition,beoverlapping.Bystudying theSTSsharingbetweenallpairsofBACsitispossibletogeneratewhatiscalled a BAC contig, i.e. a collection of partially overlapping BAC clones that together coveralargesegmentofthegenome.ToreducethenumberofPCRtoconduct, clever “pooling” strategies were used. Rather than to test BACs individually, PCRs were conducted using pools containing DNA from many BACs. As every BACisassignedtoseveralpools,itispossible,aposteriori,todeterminewhich BACsarepositiveinagivenpositivepool.Inthe“BACfingerprinting”approach, DNAwasextractedfromindividualBACs,digestedwithacocktailofrestriction enzymes, and the resulting fragments separated by polyacrylamide gel electrophoresis.Asubsetofthefragmentswasthenvisualizedusingavarietyof strategies. Obviously if one generates such BAC fingerprint twice for the same BAC, it will be identical. If one compares the BAC fingerprint of two nonoverlapping BACs, their restriction pattern will be completely different; the chancetoobservetwoormorefragmentsofthesamesizeislow.IftwoBACs arepartiallyoverlapping,thedegreeofbandsharingwillbeintermediate,hence allowingtorecognizeoverlappingBACs.Thisapproachwasalsousedinfactory mode, and the ensuing information used to generate BAC contigs. Whether generated using STS content mapping or BAC fingerprinting, the resulting “physical maps” of the human genome comprised large number of contigs spanningonaverage...Kb,withN50of...BACcontigscontainingmicrosatellites or STS positioned on linkage and/or RH maps could be anchored (that is positioned) and often ordered (if containing more than one mapped STS) with respectivetoeachotheronthehumangenome. To monitor how well the corresponding physical map covered the different chromosomes, a subset of individual BACs were fluorescently labeled and directly hybridized to metaphase chromosomes in experiments called fluorescentinsituhybridizationorFISH,todeterminetheexactpositionofthe correspondingBACcontigonthehumangenome. Subsequently,scientistsselecteda“minimumtilingpath”ofBACs,i.e.asmallest possible collection of BACs that would jointly cover as much as possible of the Essentialhumangenomics–MichelGeorges Page17/41 humangenomewithminimumoverlap,andthesewerethensubjecttoshot-gun sequencing as described above. This implied sub-cloning of the BAC DNA in a regular plasmid vector, Sanger sequencing of a large number of randomly selectedplasmidclonesforeachBAC,andreassemblyoftheBAC’ssequenceon the basis of the observed overlap between sequence reads. The sequence of neighboring BACs was then concatenated to progressively generate larger “chunks”ofthesequenceofthehumangenome.Thefirstreferencesequenceof thehumangenomewaslargelygeneratedbyapubliclyfundedeffortdominated by American and British laboratories that used this “hierarchical sequencing approach”.Aprivatecompany(“Celera”)attemptedtoovertakethepubliceffort using a direct approach based only on shotgun sequencing combined with the systematicuseofmatepairlibraries.Theracewassomewhatfraudhoweveras Celera had access to the data generated by the public effort and made readily availableinthepublicdomain.Bothversionsofthehumanreferencesequence, generated respectively by the publicly funded consortium and Celera, were publishedatthesametimeinNatureandScience,respectively.Thegeneration ofreferencegenomesforotherorganisms,including29mammalstoday,isnow nearlyentirelybasedondirectsequencingapproaches. C.ANNOTATINGTHEREFERENCESEQUENCEOFTHEHUMANGENOME The reference sequence of the human genome is in essence 24 (the 22 autosomes and the two gonosomes (X and Y)) long strings of As, Cs, Gs and Ts amounting to ∼3 x 109 bases. Obviously, that is not very informative per se, unless this sequence is “annotated”. This means that one likes to know the positionofthegenes,includinglimitsbetweenexonsandintrons,thatonewould like to know where the cis-acting regulatory elements are that control the expressionofthegenes,thatonewouldliketoknowthelocationandtypeofthe interspersed repetitive elements, that one would like to know where centromeresandtelomeresstart,and–whynot–thatonewouldidentifynovel essentialfeaturesofourgenomefromthecleverexaminationfromitssequence. Maskingrepetitivesequences.Oneofthefirststepsinannotatingagenomeis to identify repetitive sequences. By definition, the basis for the recognition of repetitive sequences is that these match multiple similar sequences in the genome when used as query sequence. From such matches, consensus sequencesrepresentativeofdistinctfamiliesandsub-familiescanbegenerated inaniterativeprocesswhichcanthenserveasqueriesfortheidentificationofas many “homologous” repeats as possible. The Repeatmasker software (http://www.repeatmasker.org) and associated Repbase database of repetitive elementsisthepreferredsoftwaretofulfillthistask.Itwillidentifyinterspersed repeatsbelongingtothefourmajorclassesaswellassimplesequencerepeats, provide summary statistics about the number of elements found in each class andwhatfractionofthegenometheyrepresent,andreturnasequenceinwhich therepeatshavebeenmaskedifsodemanded. Establishing a gene catalogue. Identifying our genes is obviously one of the mostimportantgoalsoftheannotationprocess,astheseareassumedtoembody the“raisond’être”ofourgenome. Essentialhumangenomics–MichelGeorges Page18/41 Thedefinitionofwhatconstitutesageneisevolving,yetanundisputedcommon feature of all our genes is that they are transcribed. Indeed it is either the resulting RNA (non-coding genes) or the product of its translation (protein codinggenes)thatfulfillstheirfunction.Thus,identifyingalltheregionsofour genome that are transcribed (i.e. our transcriptome) is probably the approach thathasbeenmostinformativeinestablishingourcatalogueofgenes.Notethat if all genes are transcribed, all that is transcribed does not per se constitute a gene. In the early days, the mRNA sequence of many genes was determined before their genomic sequence, particularly if abundantly expressed in specific tissue. cDNAs were used as probes to isolate and then sequence cognate genomic fragments. This lead to the recognition of the mosaic (i.e. comprising exons and introns) structure of eukaryotic genes. In the nineties, large-scale “brute force” sequencing of cDNAs was recognized as an effective strategy to identify new genes. cDNA libraries were generated (usually from polyadenylated RNA) from as many tissues (both healthy and diseased) and developmentalstagesaspossible.CloneswererandomlypickedfromsuchcDNA librariesandsubjecttolarge-scalesequencing,initiallyusingSangersequencing. The resulting reads, corresponding typically to partial mRNA sequences, were referred to as Expressed Sequence Tags (ESTs). Partially overlapping ESTs were assembled into longer-length (if not full-length) mRNA sequences, often revealing isoforms resulting from alternative splicing and polyadenylation. “Blasting” the ensuing mRNA sequences against the reference sequence of the human genome located the corresponding gene, revealed its exon-intron structure, and allowed prediction of its most likely open reading frame (ORF). Thisapproachwasappliedsystematicallybyseveralgenomecenters,bothpublic and private, including companies such as Celera. EST sequencing was complemented by approaches that extracted short mRNA-specific cDNA tags, whichwereconcatenatedpriortoSangersequencing.“SerialAnalysisofGene Expression” or SAGE, extracted tags in the vicinity of poly-A tails, while “Cap AnalysisofGeneExpression”orCAGE,extractedtagsinthevicinityofthe5’cap. Counting the number of times a given tag was sequenced provided the first “digital” measures of gene expression. Sanger sequencing-based approaches were subsequently complemented with array-based approaches to characterizethetranscriptome.Tissue-specificRNAswerefluorescentlylabeled and hybridized to arrays comprising millions of tiled oligonucleotides jointly spanning the entire human reference genome. Transcribed segments of the genomewererecognizedbyvirtueofthefluorescenceoftheRNAbindingtothe corresponding probes. The intensity of the fluorescent signal provided information about the relative abundance of distinct transcripts. These experiments were amongst the first to reveal the pervasive transcriptional potentialofourgenome.Indeed,inadditiontothedominatingsignalofprotein encoding and a growing catalogue of long and short non-coding genes, lower level hybridization (and hence transcription) appeared to be widespread, encompassingasmuchas75%ofourgenome.Thebiologicalrelevanceofthis pervasive transcription remains an essential, as of yet unanswered question. More recently, “RNA-Seq” has become the method of choice for qualitative and quantitative characterization of “transcriptomes”. Typically, RNAs are reverse transcribedintocDNAs,whicharethensequencedbysynthesis.Strand-specific methodshavebeendevelopedtomaintaininformationaboutthestrandthatis Essentialhumangenomics–MichelGeorges Page19/41 actuallytranscribed.Thestructureoftranscriptsisreconstructedonthebasisof overlapping reads and using paired-end information when available, hence informingaboutexon-introngeneorganizationandtheoccurrenceofisoforms. Sequencedepthprovidesdigitalinformationaboutexpressionlevel.Sequencing by synthesis still suffers some drawback linked to shortness of the sequence reads, and the necessity of reverse transcription and need for target amplification, which may introduce unwanted representational biases. In the future, single molecule sequencing may overcome some of these limitations. The picture that emerges from the application of these increasingly powerful methods for the characterization of the human transcriptome is (i) the lower than expected number of protein encoding genes, (ii) the abundance of non codingRNAgenesincludingpreviouslyunknownsmallandlongnon-codingRNA genes, with – for many of them - still uncharacterized function, (iii) the unsuspected complexity of the “transcriptional units” corresponding to known protein coding genes including pervasive alternative splicing, antisense transcripts,senseandantisensetranscriptsassociatedwiththe5’endofgenes andantisensetranscriptsassociated3’endofgenes. While aligning RNA sequences with the sequence of the reference genome has providedthebulkoftheinformationusedtoassembleourgenecatalogue,other methods have provided essential complementary and/or confirmatory information. Ab initio gene prediction programs based on machine learning methods (particularly Hidden Markov Models) have been developed and perform very well for the detection of protein coding genes. Coding exons are subjecttostrongpurifyingselection(theyevolvemoreslowlythantherestof thegenome,asmostmutationsinthemaredeleteriousandhenceeliminatedby naturalselection;seehereafter).Asaconsequence,thesequencesthatcanstill berecognizedashomologous,whencomparingthehumangenomewiththatof birdsoffish,areverystronglyenrichedincodingexons.Combinedwiththefact thatitsgenomeislargelydevoidofrepetitivesequences(whichgreatlyreduces themagnitudeoftheeffort),thisobservationwastheprimarymotivationforthe determination the complete genomic sequence of the pufferfish. Finally, the transcriptional process imposes specific epigenetic marks on the cognate chromatin(f.i.H3K36me3),andthishasbecomeaneffectivestrategytomonitor transcription(see“CHIP-Seq”hereafter).Asanexample,manylongnon-coding RNAgeneshavebeenidentifiedusingthisapproach. Identifying cis-acting regulatory elements. Gene annotations typically includethe5’UTR,codingexonsandinterveningintrons,andthe3’UTR.Itdoes notincludetheregulatoryelementsthatcontroltheexpressionofthegeneincis (whether at the transcriptional level or at a later stage). Yet, it is essential to identifythese“switches”asmutationsinthemmightperturbgenefunctionand henceunderliephenotypicvariationincludingdisease.Severalapproacheshave been devised in the last decades that can be adapted for the large-scale identificationofcis-actingregulatoryelements.Theseincludetheidentification ofevolutionaryconstraints,chromatinimmunoprecipitationcombinedwithnext generation sequencing, and the identification of DNaseI hypersensitive sites by meansofnextgenerationsequencing. All mammals derive from a common ancestor species that roamed the earth ∼ 225 millions years ago. As new species emerged and evolved independently, Essentialhumangenomics–MichelGeorges Page20/41 their genomes progressively diverged from that of the ancestor species and hence from each other. This progressive divergence is largely driven by the processes of mutation (that generates new mutations), and random drift (that leads to the fixation in the population of a small fraction of the neo-mutations; see hereafter). In the absence of other forces (corresponding to “neutral” evolution),thedivergenceevolvesatanearlyconstantrateof∼10-9substitutions per base and per year of evolution. However, this “background” rate applies onlytoregionsofthegenomethatarenotfunctional,suchaslargeproportions of the interspersed repetitive elements. Genomic segments that fulfill an essentialfunctionusuallyevolvemoreslowly:theyareevolutionaryconstraint. Thisisduetothefactthatmutationsinthemwillusuallybedeleterious,hence decreasingthereproductivecapacityoftheindividualsthatinheritthem,hence decreasingtheirchanceoffixationinthepopulation.Thisprocessof“purifying selection”iseasilyrecognizablefromthecomparisonoftheconservationof1st, 2nd and 3rd positions in codons. Mutations at 3rd positions are more often synonymous than those at 1st and 2nd positions, and 3rd positions are concomitantly less conserved than 1st and 2nd positions. It was rapidly recognizedthatcis-actingregulatoryelementssharedamongstspeciesmightbe effectivelyidentifiedbyvirtueoftheirevolutionaryconstraints.Assoonasthe human genome was completed, the main genome centers have applied their sequencingcapacitytoagrowingnumberofspeciesspanningthetreeoflife.If sufficient number of species were sequenced it would be possible to identify individual bases that – within the groups of sequenced species – evolve more slowlythanexpectedintheabsenceofselection.Recently,thegenomesequence ofthefirst29eutherianmammalswasreported.Theseallowedforthereliable identification of evolutionary constraints not at one but at 12-base pair resolution. Millions of constrained elements were hence identified of which > 60%liedoutsideofgeneboundariesdefinedasabove.Theyamountedto∼5% ofthehumanreferencegenome.Theyincludedknowncodons,newlydiscovered codons, RNAs with conserved secondary structure, cis-acting regulatory elementsoperatingatthetranscriptionalandpost-transcriptionallevelofwhich hundreds of thousands were derived from mobile elements, etc. Mining the sequencesinevolutionaryconstraintelementrevealedhundredsoftargetmotifs fortrans-actingfactorsincludingtranscriptionfactorsandmiRNAs. Thisstudyidentifiedelementswhosefunctionisconservedamongsteutherians orplacentalmammals.Identifyingregulatoryelements,whichareunique(and hence define) to for instance primates (or other branches of the mammalian tree), will require the sequencing of more primates to achieve appropriate statistical power. Initiatives to determine the sequence of 10,000 vertebrate species are underway, with that very objective in mind (http://genome10k.soe.ucsc.edu): to identify regulatory elements that are specific for the different clades in the vertebrate tree. By sequencing a large enough number of humans, one could likewise identify human-specific regulatory elements by virtue of the observed paucity or depletion in polymorphismsinspecificresidues. As an alternative approach for the identification of cis-acting regulatory elements,onecanstudytheepigeneticmarksthatcharacterizeanddifferentiate suchelements,i.e.thehistonecode.Toapproachthisinasystematicway,one Essentialhumangenomics–MichelGeorges Page21/41 increasingly uses chromatin immunoprecipitation combined with next generation sequencing (“CHIP-Seq”) for that purpose. In this method, cells representing a tissue type of interest (or if possible as many cell types as possible) are first soaked in formaldehyde. This freezes the structure of the chromatinclosetoitsnativestatebycross-linkingDNAandhistonescovalently, albeit reversibly. The frozen chromatin is then extracted, fragmented by sonication, and incubated with an antibody that recognizes a specific histone modification. Specific antibodies are presently available for tens such histone modifications including methylation, acetylation and phosphorylations of specific residues in the amino-terminal tail of the four nucleosomal histones (H2A,H2B,H3andH4).Theantibodiesarethenprecipitatedandwiththemthe chromatinfragmentstowhichtheybind.Theformaldehydeinducedcross-links are resolved by heat treatment, the released DNA fragments recovered and sequenced on next generation sequencers. By mapping the reads back to the referencegenome,oneobtainsadirect,quantitativereadingofwhichsegments of the genome carried the corresponding modification. The experiment is repeated with the different antibodies, yielding genome maps for the different histone modifications. The same CHIP-Seq experiments can be conducted with antibodies recognizing specific trans-acting transcription factors, DNA polymerases, factors binding to insulators (f.i CTCF), components of the polycomb groups, proteins mediating the interaction between trans-acting factorsandthebasaltranscriptionmachinery(f.i.p300),etc.Theanalysisof the resulting CHIP-Seq maps (by means of Hidden Markov Models), reveals a limited number of chromatin states, each with their own histone code, corresponding (amongst others) to promotors (active, weak or poised), enhancers (strong, poised), insulators, transcribed, polycomb repressed, heterochromatin,etc... Cis-acting regulatory elements, including promoters, enhancers/silencers and insulators,havebeenknownfordecadestosharehypersensitivitytoDNasesas common,distinctivefeature.Thus,ifonetreatsnucleiwithlimitingamountsof DNase, the genome will preferentially be cut at such cis-acting regulatory elements.Thispropertyhasbeenexploitedtolocatesuchelementsfordecades, classically by means of Southern blotting targeting genomic regions of interest. More recently, DNase hypersensitivity protocols have been adapted to next generation sequencing, to allow for the systematic identification of regulatory elementsintheentiregenome.Indeed,isolatingsmallDNAfragmentsreleased byDNasetreatment(andhenceenrichedinDNAlocatedincis-actingelements), subjectingthesetonextgenerationsequencing,andmappingtheresultingreads backtothereferencegenome,identifiesthecis-actingregulatoryelementsthat were active in the corresponding cell-type at high resolution. While this approach does not differentiate the different types of cis-acting regulatory elements,itrecognizesalargenumberoftheminasingleexperiment. Itisincreasinglyrecognizedthatcis-actingregulatoryelementscanactoverlong distances(f.i.>250Kb)andcontroltheexpressionofmultipletargetgenes.The presentworkingmodelisthataloopofDNAisformedallowingthetrans-acting factors bound to the cis-acting element to physically interact (directly or via mediator proteins such as p300) and stimulate (in the case of enhancers) or inhibit(inthecaseofsilencers)withthebasaltranscriptionmachineryboundto Essentialhumangenomics–MichelGeorges Page22/41 theproximalpromotors.Asamatteroffact,agenemayencompass(withinits introns) cis-acting regulatory elements controlling neighboring or even more distant genes. It is therefore not obvious to determine which genes are controlled by the gene switches identified by CHIP-Seq or DNase hypersensitivity.Anindirectstrategytoconnectcis-actingregulatoryelements withtheirtargetgenesistolookforacorrelationbetweentheactivityofagene (as measured by the analysis of the transcriptome) and the activity of the regulatory element (as measured by CHIP-Seq or DNase hypersentivity) across many cell types. A more direct, recently developed method is chromatin conformationcapture(3C,withextensionsreferredtoas4C,5CandHiC).In this method the chromatin is frozen by cross-linking with formaldehyde, and fragmented by sonication, as for CHIP-Seq. The cross-linking is predicted to freezetheloopsthatareformedbythetranscriptionfactor-mediatedinteraction between enhancers/silencers and their target promotors. Resulting DNA ends are then repaired using biotinylated nucleotides prior to ligation under diluted conditions (to favor ligation between DNA ends brought together by crosslinking).ThisshouldgenerateligationproductsbetweenDNAfragmentsinthe vicinity of the enhancers/silencers and DNA fragments in the vicinity of the promotors. After ligation, the cross-linking is reversed, the trapped DNA released and refragmented by sonication, the fragments encompassing ligation points enriched with streptavidin and then subjected to paired-end next generationsequencing.Selectionofloopsthatarestabilizedbyspecificmediator proteins can be enriched (prior to reversing the cross-linking) by CHIP using cognate antibodies. Interactions between cis-acting regulatory elements and theirtargetpromotorsarethenrecognizedbypairedendreadsthatmaponeto a proximal promotor and the other to a cis-acting regulatory element on the samechromosome. Thelargescaleimplementationofthemethodsdescribedabove,withtheaimto identifyallthefunctionallyimportantelementsinthehumangenome,hasbeen coordinated at the international level as part of the ENCODE project (Encyclopedia of DNA elements; https://genome.ucsc.edu/ENCODE/). Transcriptome, CHIP-Seq and DNase hypersentivity data were generated for > 70celltypes,identifyingmillionsofputativefunctionalelementsencompassing as much as 80% of the reference sequence of the human genome. The corresponding figures largely surpass the estimates based on evolutionary constraints(∼5%).Themeaningofthisdiscrepancystillremainstoberesolved. Other–omes. Genomebrowsers. Essentialhumangenomics–MichelGeorges Page23/41 2. Individualhumangenomes A.GENETICVARIANTS:SNPS,INDELS,SSRS,CNVSANDOTHERVARIANTS Types of genetic variants. The human reference genome is often depicted as being “our” genome. As a matter of fact, if we were to sequence our own, individual genomes (something which will increasingly become reality in the future), we would observe many differences between our own gene and the humanreferencegenome.Itiswellestablishednowthatifonesequencestwo human genomes drawn at random from a population (for instance the genome thatyouinheritedfromyourmotherandtheoneyouinheritedfromyoufather), thesewilldifferfromeachotheratmillionsofsites. ThefirstandnumericallypredominantsuchvariablesitesareSingleNucleotide PolymorphismsorSNPs.Astheirnameimplies,thesearedifferencesbetween two genomes involving a single base pair. SNPs comprise transitions (A = T versusG ≡ Cbase-pair),transversions(A = TvsT = A;A = TvsC ≡ G;G ≡ CvsC ≡ G), and the insertion or deletion (“INDEL”) of a single base-pair. Two human genomestypicallydifferatapproximately3millionSNPs,oroneSNPevery1,000 basepairs(seehereafter). SNPsarenottheonlykindofgeneticpolymorphism.Otherwell-knowngenetic variants are Simple Sequence Repeats or SSRs, corresponding to micro- and mini-satellites. These are tandem repetitions of sequence motifs, ranging from one to five base pairs for microsatellites and up to fifty or more for microsatellites(thislimitsareperfectlyarbitrary).ThemainfeatureofSSRsis thatthenumberoftandemrepetitionsoftendifferbetweenindividualgenomes. While SNPs are typically biallelic (i.e. only two alleles are usually observed at appreciablefrequenciesinthepopulation),SSRsareveryoftenmulti-allelic,i.e. characterized by more than two (sometimes many) alleles. As a consequence, and as soon as one considers ten or more highly polymorphic SSRs, two individuals virtually never have the same “composite” genotype (with the exception of monozygotic twins). These “composite” genotypes have therefore been called “DNA fingerprints” and have found widespread application in forensics. AnotherimportantclassofgeneticvariantsareCopyNumberVariantsorCNVs. CNVscorrespondtolargesegmentsofthegenome(typically>100Kb),whichare observedinindividualgenomesindifferentcopynumbers.CNVoftencoincide withso-called“segmentalduplications”(althoughnotallsegmentalduplications are polymorphic and hence CNVs). The multiple copies of a CNV can either localizetothesamechromosomalregion(oftenintandemorhead-to-tail)orbe dispersedinthegenome.Itisestimatedthatmorethan10%ofourgenomeis subjecttocopynumbervariation,includingsegmentsofthegenomethatcontain genes.Asaconsequence,individualsdifferinthenumberofcopiestheyhavefor some genes. CNVs are therefore thought to have an important impact on the individual’sphenotype,includingdisease(seehereafter). In addition to these main classes of genetic variants, individual genomes may differatchromosomalvariantsincludinglargedeletions,insertions,inversions andtranslocations. Essentialhumangenomics–MichelGeorges Page24/41 Minorallelefrequency,nucleotidediversityandnumberofpolymorphicsites. As mentioned above, sequencing and comparing two genomes (f.i. randomly sampled in Northern Europeans) will typically reveal of the order of 3 million SNPs.Fromthisonecancomputeaclassicalmeasureofgeneticpolymorphism, referred to as nucleotide diversity ( π ), corresponding to the average heterozygosity per nucleotide site, and which is approximately 0.001 in Northern Europeans. If one were to sequence and compare another pair of randomly sampled genomes, one would obtain a very similar value of π . However, if one now compares the four genomes jointly, the number of polymorphic sites (i.e. the number of sites for which at least one of the four genomesdiffersfromtheotherones)willbelargerthan3million.Asamatter of fact, every time one sequences a new individual, one will uncover new polymorphic sites (yet the average value of π between all pairs of individual genomeswillremainat0.001).Ifoneweretosequencetheentirehumankind,it is possible that nearly every one of the 3 billion sites of our genome will have been found to differ in at least one individual. Thus, if one looks well enough, everysiteinourgenomeispotentiallypolymorphic.Allthesepolymorphicsites can be sorted according to their Minor Allele Frequency or MAF. Assume a classicalSNPcharacterizedbytwoalleles,forinstanceA=TandG=C.Ifoneknew thegenotypeof1,000genomesforthisSNPonecouldcomputewhatproportion ofthesewouldcarrytheA=TalleleandwhatproportionwouldG=Callele.One oftheseislikelytobelessfrequentthantheotherone.Itistheminorallele,and thecorrespondingproportionorfrequencyistheMAFofthatSNP.Bydefinition, theMAFrangesfrom0to0.5.Thereareapproximately7millionSNPwithMAF >0.05inNorthernEuropeanpopulations.Thesearesaidtobe“common”SNPs. ThereareprobablyanequalnumberofSNPswith0.05>MAF>0.005.Theseare referredtoas“lowfrequency”SNPs.Andfinallythereisanundeterminedbut larger number of SNPs with MAF < 0.005 (again, these limits are arbitrarily defined). These are referred to as “rare” SNPs. The frequency distribution of MAFsistypicallyexponentialwithmanymorerarethancommonvariants. Germline mutations, drift and natural selection. What is the origin of the genetic polymorphism and what explains the exponential MAF distribution? Genetic variants are generated by the process of de novo mutations in the germline. Every gamete carries of the order of 30 new variants that were generated as a result of the inherent imperfections of the DNA replication and repair machinery during the many cell divisions undergone by cells of the germline between fertilization and gametogenesis. The average number of cell divisions between fertilization and the production of an oocyte is 24, and this numberisnotaffectedbytheageofthemother(oocyteshavereachedmeiosisI before the birth of the female fetus). The average number of cell divisions betweenfertilizationandtheproductionofaspermcellsis∼30+23n+5,wheren istheageofthefatherminus15.Thenumberofcelldivisionsneededtoproduce a sperm cell is therefore larger than for an oocyte, which largely explains why spermcellsonaveragecarrymoredenovomutationsthanoocytes,andwhythis increaseswiththeageofthefather.Theprocessofdenovomutationgenerates Essentialhumangenomics–MichelGeorges Page25/41 a new allele that is called the “derived allele” while the original sequence is calledthe“ancestralallele”. What is happening with these tens of de novo mutation inherited by every conceptus?Themainfactorthatdeterminesthefaithofthemajorityofthesede novomutationsis“luck”,referredtoingeneticsas“randomdrift”.Imaginean “isolated” population with 100 randomly breeding individuals. Consider one locus in the genome and imagine that you can unambiguously distinguish and track the 200 alleles of the 100 individuals. Let the individuals breed for one generationandletusexaminewhathappenedwiththe200allelesobservedin generation“0”.Itisverylikelythat–justbychance–partofthe200alleleswill already be missing in generation “+1”, while some others will be present more thanonce.Thus,justasaresultofthestochasticprocessbywhichchromosomes aresampledingeneration“0”toproducegeneration“+1”,someallelesarelost eachgenerationwhileothersseetheirfrequencyincrease.Thisprocessrepeats itselfeachgeneration,leadingtotheinescapableoutcomethatatsomepointin time,onlyonealleleoftheoriginal200willstillbepresentinthepopulation.At that time all the alleles in the population are said to be identical-by-descent as theyalltracebacktothesamecommonancestorallele.Thisprocessiscalledthe “coalescent”. It can be shown that the expected number of generations separatingalltheallelespresentinthepopulationatonepointintimefromtheir MostRecentCommonAncestor(MRCA)is∼4Ngenerations,whereNisthesize of the population. A de novo mutation inherited by an individual living in a population of size N, has a probability 1/2N to become fixed in the entire population(andhencecompletelyreplacetheancestralallele)∼4Ngenerations later.Thuswehaveontheonehandtheprocessofdenovomutationthatinjects new variants in the population, and on the other hand the process of random drift that purges old and new variants out of the population (letting however someluckynewonessurviveandsometimesevenreplacingtheoldoneshence explainingwhythegenomesofdifferentspeciesprogressivelydivergefromeach other). The result is a steady state equilibrium that is characterized by a predictablenucleotidediversity π= 4N µ 4N µ +1 (where µ corresponds to the mutation rate per generation), as well as by a predictable exponential frequency distribution of MAF. The larger the population size and the higher the mutation rate, the higher the nucleotide diversity.Thefrequencyofderivedallelesistypicallycorrelatedwiththeirage. Thedescriptionofthecombinedeffectsofdenovomutationandrandomdrifton the genetic polymorphism of populations is known as the neutral theory of molecular evolution. Although these two factors combined account to a considerable degree for the observed polymorphism, one obviously has to include selective forces for a full description and understanding of genetic variation. One typically distinguishes three categories of selective forces: negative or purifying selection, positive selection, and balancing selection. Purifyingselectionactsongeneticvariantsthathaveadeleteriouseffectonthe geneticfitnessoftheindividualsthatcarryit.Thisincreasestheprobabilitythat Essentialhumangenomics–MichelGeorges Page26/41 thevariantwillbepurgedfromthepopulation.Asitismuchmorelikelythata de novo mutation in a functionally important element of the genome will compromiseitsfunctionratherthanimproveit,functionallyimportantelements undergo the effects of purifying selection and evolve therefore more slowly. Thispropertyhasbeenexploitedtoidentifyfunctionallyimportantelementsin thehumangenomebycomparingitwiththegenomeofotherspecies(seeabove). Another clear manifestation of purifying selection is the observation that first and second codon positions are more often conserved amongst species than third codon position. This is due to the fact that mutations of the third codon positions are more likely to be synonymous (therefore not altering protein function)thanmutationsofthefirstandsecondpositionsofcodons,whichare nearly always non-synonymous. Sites undergoing purifying selection are typically characterized by lower π values, and a shift of MAF towards lower valuesaswell.Thelatterreflectthefactthatnegativeselectionmakesitharder for(mildly)deleteriousvariantstoincreaseinfrequencyinthepopulation. Exceptionally, de novo mutations generate variants that confer a selective advantage to their carriers. Such variants will be subject to positive selection. Theadvantagetheyconferincreasestheirprobabilityoffixationandaccelerates their rate of fixation. The resulting “selective sweep” may leave a detectable signature of reduced local genetic variation as the haplotype in which the favorablemutationwasembeddedisfixedwiththemutation.Averyconvincing example of such a sweep in humans is the selection for regulatory mutations nearthelactase(LCT)geneextendingitsintestinalexpressionandhencelactose tolerance past weaning. Such mutations were independently selected in populationsthatreliedheavilyontheconsumptionofdairyproductsintheirdiet, includinginNorthernEuropeandsomenilothicpopulationsofAfrica. The third form of selection is referred to as balancing selection. In this a genetic variant undergoes positive selection under some circumstances and negative selection under other circumstances, potentially leading to a situation in which the variant is maintained in the population at intermediate frequency over long periods of time. One cause of balancing selection is heterozygote advantage or overdominance. In this the heterozygotes have a higher fitness than either homozygotes. Overdominance underlies the unusually high frequency of sickle cell anemia, thalassemia and G6PD deficiency in regions wheremalariaisendemic,asthecorrespondingmutationsintheα-,β-globinand G6PDgenesconferresistancetotheparasiteinheterozygotes. B.GENOMICRECONSTRUCTIONOFOUREVOLUTIONARYHISTORY Amongstthemostexcitingoutcomesoftheaccumulationofincreasingamounts ofsequenceinformationforhumanandotherorganismsistheopportunitiesit provides to explore our evolutionary history. Comparison of the genomic sequenceofdistantorganismsanduncoveringtheremarkablecommonalitiesin terms of gene organization and content demonstrates, beyond any reasonable doubt,thatallorganismslivingonplanetearthdescendfromacommonancestor. Ithasclarifiedtherelationshipbetweenhumansanditsclosestprimaterelatives, Essentialhumangenomics–MichelGeorges Page27/41 theorang-outang,gorillaandchimpanzees.Ithasdemonstratedthatthecradle ofhumanityliesinAfrica.Morerecently,ithasshownthatpartofourgenome descendsfromotherhomininswithwhomourancestorscohabited.Itallowsfor thestudyoftherelationshipbetweenpopulationsincludingthedeterminationof one’sgeographicaloriginwithremarkableprecision. The African Eve. For some time now, it has been possible to determine the partial if not complete sequence of the mitochondrial genome from individuals thatrepresentanasbroadaspossiblepanelofethnicgroups.Usingavarietyof methods (including neighbor joining, parsimony methods and maximum likelihoodmethods),itispossibletogeneratea“genetree”thatisattemptingto reconstructthemostlikelyevolutionaryhistoryofthecorrespondingsequences and hence indirectly from the corresponding ethnic groups. Assuming that substitutionsaccumulateataconstantrate(“molecularclock”),thetimethathas elapsedsincethedivergenceofthedifferent“leaves”inthetreecanbeinferred from the lengths of the branches that separate them. The molecular clock is either calibrated indirectly using indications from the fossil record about the timeofspecificsplitsinthetree,ordirectlyusingtherateofdenovomutation estimatedfromsequencingpedigrees. When applied to DNA samples from “anatomically modern humans” (AMH) representing a broad panel of ethnic origins, the approach generated a tree for which the majority of “thick” branches carried leaves corresponding to African samples only. The deepest “split” separated the click-speaking San hunter gatherers from the other AMH. It’s timing was recently re-evaluated at ∼ 250,000-300,000yearsago.Allnon-Africansamplesinessenceclusteredonone “thick”branchonly.Whencomparedtosamplesoriginatingfromdifferentparts of Africa, the mitochondrial genomes of European, Asians and Amerindians are verysimilartoeachother.Itwasrecentlyre-evaluatedthattheMRCAofallnonAfrican samples lived ∼100,000 years ago. These findings provided strong support for the Recent African Origin (RAO) model for the origin of AMH. According to this model, AMH first appeared on the African continent at least 300,000 years ago. Europeans, Asians and Amerindians derive from an Eastern-AfricansubpopulationthatdivergedfromAfricans∼100,000yearsago, traversed the African horn and/or Levantine corridor, and progressively migratedintotherestoftheworld.This“exodus”wasaccompaniedbyastrong reduction in genetic variation. The migrant population appeared to have gone througha“geneticbottleneck”ofaround ∼10,000individualsorless.Thesplit between the branches leading respectively to Europeans and Asians is now datedat∼60,000yearsago.Thealternative(yetlesswellsupported)modelfor the origin of AMH is the model of multiregional evolution (MRE). According to MRE, modern features emerged independently in different continents from distinct populations of archaic humans that lived there (and for which there is fossilandarcheologicalevidence),andwerethencombinedbyinterbreedingto formAMH. Admixturewitharchaichominins-NeanderthalandDenisovanuncles.Ithas becomepossibletoextractDNAfromfossilsthatare∼100,000yearsoldandto Essentialhumangenomics–MichelGeorges Page28/41 subjectthisDNAtoNGSanalysis.Bydoingso,thecompletegenomicsequenceof severalNeanderthalindividualshasrecentlybeengenerated.Applyingthesame approach to a small metacarpian bone found in a cave in Denisova (Altai Mountains, Russia) indicated that it belonged to a distinct species of archaic hominins,hencecalledDenisovans.ComparingtheNeanderthalandDenisovans DNA with that of AMH indicated that Neanderthal and Denisovans split approximately ∼200,000 years ago, while AMH split from Neanderthal and Denisovansapproximately∼500,000yearsago.Fossilandarcheologicalrecords clearly indicate that Neanderthals and probably Denisovans still roamed the Eurasiancontinent∼40,000yearsago.WhenspreadingacrossEurasia,theAMH diasporamayhaveencounteredtheseandmaybeotherarchaichominins.Hence, thequestionthathasintriguedpaleontologistiswhetherthesespeciesinterbred. Toaddressthisquestion,geneticistshavesofarmainlyusedthetestcalledthe “Dstatistic”.TheDstatisticcomparesthegenomeoftwoAMH(f.i.H1=African andH2=Europeanindividual),ofanarchaichominin(X;canbeNeanderthalor Denisovan), and of an outgroup (f.i. C = chimpanzee). It counts the number of “BABA” (nBABA) and “ABBA” (nABBA) occurrences. Sites are selected where the archaicgenome(X)differsfromthechimpanzee(C)genomeandfromoneofthe AHM genomes. The residue observed in C and H is considered ancestral, the otherderived.Onethencomparesthenumberofsuchoccurrenceswhereitis H1 versus H2 that shares the derived allele with X using D = (nBABA − nABBA ) / (nBABA + nABBA ) .Apriori,ifH1andH2areequallyunrelatedtoX, Dshouldnotdeviatesignificantlyfrom0.However,whatemergedfromthese analyses is that European and Asians are more closely related to Neanderthal thanareAfricans,andthatMelanesiansaremorecloselyrelatedtoDenisovans than all other AMH. From this, it was inferred that 1-4% of the genome of EuropeansandAsiansisderived–byinterbreeding–fromNeanderthal,andthat 3.5% of the genome of Melanasians is derived – by interbreeding – from Denisovans!SpecificNeanderthalandDenisovan-derivedgenomesegments(or haplotypes-seehereafter)presentinourgenomehavenowbeenidentifiedand confirm this admixture hypothesis. What phenotypic properties, if any, are determined by the DNA that was inherited from archaic hominins remains unknown, but a topic of great interest. A scenario that now emerges, is that cohabitationwithotherhomininsmighthavebeenacommonconditionforour AMHancestors,bothinandoutofAfrica,andthatinterbreedingmayhavebeena partoflife. C.THEHAPMAPAND1,000GENOMESPROJECTS Linkagedisequilibrium.Whensimultaneouslyconsideringtwogeneticvariants - say A and B characterized respectively by alleles A, a and B, b – one can distinguish four allelic combinations or “haplotypes”: AB, Ab, aB, and ab. Each individualinheritstwosuchhaplotypes,onefromthefathertheotherfromthe mother, to generate a so-called “diplotype”. A double heterozygous individual (genotypesAaandBb)canthushavetwodistinctdiplotypes(AB/aborAb/aB, where the “/” separates the two composite haplotypes) depending on the haplotypes inherited from his parents. If the genotypes and diplotypes of a Essentialhumangenomics–MichelGeorges Page29/41 sampleofnindividualsisknow,onecanmeasureallelic(pA,pa,pB,pb)aswellas haplotype (pAB, pAb, paB, pab) frequencies in the sample. If the haplotype frequencies do not differ significantly from the product of the corresponding allelic frequencies (f.i. pAB≈pA*pB), the two variants are said to be in linkage equilibrium.If–onthecontrary-thehaplotypefrequenciesdiffersignificantly from the product of the corresponding allelic frequencies, the two variants are said to be in linkage disequilibrium (LD) by an amount D=pAB-pA*pB. D fully characterizes the LD between two bi-allelic variants, i.e. D=pAB-pA*pB=-(pAbpA*pb)=-(paB-pa*pB)=pab-pa*pb (Table 2). LD is more often quantified using “normalized” measures of linkage disequilibrium that are derived from D, but haveasusefulfeaturethattheyrangebetween0and1,facilitatingcomparison betweenpairsofgeneticvariants. r 2 correspondsto D 2 pA pa pB pb .Ithasavalue of1,whenthefrequencyoftwohaplotypesis0;asituationcalledperfectLD(f.i. AisonlyassociatedwithB,andaonlywithb).D’correspondstoDdividedby the most extreme value (of same sign) it could have given the corresponding allelic frequencies. It reaches 1 when at least one haplotype frequency is 0; a situationcalledcompleteLD(f.i.AisonlyassociatedwithB,althoughacanalso beassociatedwithB). Whenamutationgeneratinganewallele(sayAderivedfroma)occurs,thenew A allele will only exists in association with the alleles at neighboring variants characterizingthechromosomeuponwhichthemutationoccurred(sayalleleB, for a variant with a B and a b allele). Thus, initially A only occurs on chromosomes carrying B, although aB haplotypes obviously also exist in the population.Initially,thereisthereforecompleteLDbetweentheAandBvariant. However, as the new AB haplotype segregates in the population it may sometimes generate new Ab haplotypes, if paired with an ab haplotype at meiosis,andprovidedthatarecombinationoccurredbetweenthetwovariants. OnecanshowthattheLD,measuredbyD,willdecayateverygenerationbyan amount D ∗ θ ,where θ istherecombinationratebetweenthetwovariants.Thus, recombination tends to brake down the LD that resulted from the process of mutation. For distant variants (large θ ), the LD will ultimately disappear. For closely “linked” variants, however, “random drift” will counteract the effect of recombinationresultinginasteadystatedegreeofLDwithexpectedequilibrium valueof: r2 = 1 4Neθ +1 Asaconsequence,variantsthatareverycloselylocatedinthegenometendtobe in LD with each other. The distance over which significant LD is detectable dependsontheeffectivepopulationsize(i.e.Ne).Forspecieswithlargeeffective population size (such as Drosophila), LD only extends over hundreds of base pairs. For species with small effective population size (such as human), LD extendsovertensofthousandsofbasepairs. Haplotype blocks and recombination hotspots. LD has been measured systematicallyforlargenumbersofvariants(particularlySNPs;includingaspart oftheHapMapproject–seehereafter)indifferenthumanpopulations.Oneof the striking features that emerged from these studies is that LD doesn’t decay monotonously with distance as expected from the theory described above, but Essentialhumangenomics–MichelGeorges Page30/41 ratherina“step-wise”fashion.Thegenomeappearstobeorganizedinblocks. Variants within a block are in high LD with each other, but not (or much less) withvariantsbelongingtoneighboringblocks.Thetypicalsizeofsuchblockin humans is of the order of 25Kb (thus quite similar in size to the average gene, although there is no obvious coincidence between block and gene limits). A block may span hundreds of variants. Although these could in theory form thousandsofhaplotypes(2n,wherenisthenumberofSNPsintheblock),fiveto tenhaplotypesperblocktypicallyaccountfor>95%ofthechromosomesinmost human populations. The corresponding blocks are typically referred to as “haplotypeblocks”. The block-like structure of the human genome (as of the genome of other species), results from the fact that recombination events tend to cluster within “recombination hotspots”, which mark the boundaries between adjacent haplotype blocks. Recombination hotspots harbor short sequence motifs that are recognized by the zinc finger domain of PRDM9, a master regulator of meiotic recombination in many mammals. Recombination hotspots evolve rapidly; as an example, the human and chimpanzee recombination landscape havecompletelydiverged. The HapMap project. In 2002, an international collaboration was launched to identify all the common variants, as well as the haplotypes they form, in three human populations: Africans, Asians and Europeans. ∼300 individuals were genotypedformorethan5millioncommonvariants,andlinkagedisequilibrium patterns analyzed across the genome. One of the main drivers of the HapMap projectwastoprovidetheinformationneededtodevelopSNParraysthatwould effectively “tag” a large proportion of common variants. The private sector, particularly Affymetrix and Illumina, exploited this information and rapidly offeredarraysallowingforcost-effectivegenotypingofhundredsofthousandsto millions of common SNPs. These have been used very extensively to perform genome-wide association studies or GWAS, with the aim to detect genetic risk factors for nearly all common complex diseases (see hereafter). A lot of these efforts rested on the “Common Disease Common Variant” (CDCV) hypothesis. According to this hypothesis, inherited predisposition to common complex diseasesisduetospecificcombinationsofcommonriskalleleswithindividually small effects at many risk loci. The alternative hypothesis, referred to as the “Common Disease Rare Variant” (CDRV) hypothesis postulates that inherited predisposition to common complex diseases involves rare alleles with larger effectsatfewerloci(foragivenindividual). The1,000Genomesproject. Essentialhumangenomics–MichelGeorges Page31/41 3.Themorbidhumangenome A. NEUTRALVERSUSFUNCTIONALVARIANTS–GERMLINEVERSUSSOMATICMUTATIONS Asmentionedinthepreviouschapter,∼10millioncommongeneticvariantsand manymorelowfrequencyandrarevariantssegregateinthehumanpopulation. Most of these are probably having no or virtually no (see hereafter) effect on phenotype.However,aminorityofthemdo.Theseinclude(i)“codingvariants” thataffectthesequenceoftheproteinthatisencodedbytheaffectedgenes,as well as (ii) “regulatory” variants that affect gene switches and thereby the expression profile of the corresponding gene(s). Coding variants primarily include missense, nonsense (“stop gains”), splice site, frame-shift variants and largeinsertion-deletions.Synonymousvariantsaffectthecodingpartsofgenes without changing the amino-acid sequence. They will usually be harmless but maysometimesaffectsplicingiftheyfallinto“splicingenhancers”.Variantsthat affect gene function and hence phenotype may exceptionally confer a selective advantagetotheircarriers.Usually,however,theywillhaveanegativeimpact on phenotype and cause disease (they are therefore referred to as “causative variants”).Itisworthnotingthatsomevariantsmaybeadvantageousinsome circumstancesanddeleteriousinothers.Asanexample,variantsthataffectsalt retention may have been advantageous when salt was scarce but are presently causinghypertensioninindustrializedsocietieswheresaltisconsumedinexcess (cfr.thriftygenehypothesisfordiabetes). Genetic variants that are said to segregate in the population are inherited by offspring from their parents via either the sperm cell or the oocyte (or both). The transmitting parent(s) will typically have inherited the variant themselves fromtheirparents.Exceptionally,avariantmaybetransmittedbyaparenttoits offspringviathespermcellortheoocyte,whilethetransmittingparentdidnot inheritthecorrespondingvariantfromeitherofitsparent.Thismeansthatthe newvariantappearedbytheprocessof“denovomutation”inthegerm-lineof thetransmittingparent.Asmentionedbefore,spermcellstypicallycarryofthe order of 60 such de novo mutations, while oocytes carry of the order of 30. Dependingonwhendenovomutationoccurred,thetransmittingparentmaybe characterized by variable levels of “mosaïcism”. If the mutation occurred very lateduringspermatogenesis,itmayonlybedetectedinveryfewspermcells.If onegenotypesbulkspermDNA,onewillnotdetectthemutationasitrepresents toosmallafractionofthetotalspermDNA.Ifthemutationoccurredearlierin thedevelopmentofthegerm-line,asubstantialproportionofgametesmaycarry the mutation, leading to detectable levels of germ-line mosaïcism. If the mutation occurred very early in development of the parent (i.e. prior to the formationoftheprimordialgermcells),thedenovomutationmaynotonlybe detectableinthegerm-line,butalsoinsomaticcells.Theindividualwillbeboth germ-lineandsomaticmosaicforthecorrespondingmutation. ErrorsinDNAreplicationandrepairthatoccurduringmitosiswillalsogenerate denovomutationsinsomaticcellsthatdonotcontributetothegerm-line.Such somaticmutationswillneverbetransmittedtothenextgeneration.Thereareso many cells and cell divisions, that virtually all possible mutations must have occurred somewhere in our body. If they occurred early enough during Essentialhumangenomics–MichelGeorges Page32/41 development, or if they occur in an actively dividing tissue, they may generate detectablelevelsofsomaticmosaïcism.Somaticmutationsmayaffectthehealth of cells, and contribute to disease. The best-known example of the pathogenic effects of somatic mutations is cancer. Cancers are primarily due to the accumulation of somatic mutations that perturb “cancer drivers”, whether recessive loss-of-function mutations in anti-oncogenes, or dominant gain-offunctionmutationsinoncogenes. Inherited diseases can be subdivide in monogenic diseases, where variants at onesinglegeneexplainallornearallphenotypicdifferences,andpolygenic(also called complex or multifactorial) diseases that depend on large number of genetic risk variants and environmental risk factors. AN additional class of oligogenicdiseasesissometimesconsidered,involvingasmallnumber(>1)of genes.Thisterminologyissometimesusedwhenstudyingmodifiergenes,that mayaffecttheexpressivity(f.i.severity)orpenetranceofthedisease. B. MONOGENICDISEASES Monogenic diseases are typically severe disorders that are individually rare conditions but cumulatively affect ∼1% of the population and are therefore an important health concern. They include autosomal recessive, autosomal dominant, X-linked recessive and X-linked dominant diseases. There is an overrepresentation of consanguineous marriages amongst families suffering autosomal recessive conditions. Autosomal dominant conditions often involve de novo mutations for which one of the parents (usually the father) may be germ-linemosaic.MenaremoreoftenaffectedbyX-linkedrecessiveconditions, while women are more often affected by X-linked dominant conditions. Monogenicdiseasesmaybecharacterizedbyincompletepenetrance(i.e.notall individuals with “affected” genotype will suffer from the disease), and variable expressivity (i.e. not all individuals with “affected” genotype will suffer an equallysevereformofthedisease).Thepenetrancemaybeafunctionofage.As an example, the penetrance of Huntington’s disease increases with age until reaching∼100%at60yearsofage.Thenotionofphenocopyimpliesthatvery similarconditionsmayinvolvedifferentgenesorevennon-geneticcauses. Recessive diseases are typically caused by recessive “loss-of-function” (LoF) variantsinessentialgenes.Astheirnameimplies,LoFvariantsdestroythegene that they affect. LoF variants are strongly enriched in stop-gain, splice site, frame-shift and large deletions. Not all LoF variants will cause disease in homozygotes. As a matter of fact we are all carrying of the order of 120 LoF variants of which only ∼5 will cause disease. To cause a severe disease in homozygotes,LoFvariantsneedtodamageessentialgenes.Approximately30% ofourgenesarethoughttobeessential,i.e.weneedatleastoneintactcopyof such genes to survive. As the majority of LoF variants are very rare in the population,homozygosity(orcompoundheterozygosity)forLoFvariantsatany ofour∼7,500orsoessentialgenesisrelativelyrare(oftheorderof1%),unless the parents are closely related (i.e. consanguineous marriages). A majority (∼4.5/5) of LoF variants in essential genes is thought to cause embryonic Essentialhumangenomics–MichelGeorges Page33/41 lethality,whiletheremainder(∼0.5/5)willcausea“detectable”recessivedefect inatleastsomehomozygousindividuals.Recessivedefectsincludethe“inborn errorsofmetabolism”ascoinedbyGarrod.Inhumans,monogenicdiseasesare typically characterized by allelic heterogeneity, with a few relatively common andmanyrarealleles.Mostmonogenicdiseasesareveryrareconditions,with few exceptions. One of these is cystic fibrosis for which 5% of Europeans are carriers. It has been postulated that this may be due to “heterozygous advantage”: carriers may be more resistant to disease involving loss of body fluids, typically diarrhea. Other well-known examples of “heterozygous advantage” are the increased resistance to malaria of carriers of causative variantsforsickle-cellanemiaandthalassemia. Dominantdiseasestypicallyresultfrom“gain-of-function”mutations(including constitutively activated receptors) or large deletions. De novo germ-line mutations(oftenwithmosaïcism)underlieasubstantialproportionofcaseswith unaffected parents. Dominant conditions (including childhood retinoblastoma and multiple sporadic venous malformations) may involve “first hit” germ-line LoFvariantsthataretransmittedbyparentstoaffectedoffspring,combinedwith “secondhit”somaticmutationsthatleadtothelossoftheremainingalleleinthe affected tissue. The second hit may be a de novo somatic LoF mutation in the remaining allele, or “loss-of-heterozygosity” that can result from various chromosomal aberrations (mitotic recombination, chromosome loss, ...). Examplesofsuchdisordersincludechildhoodretinoblastoma,andspecificforms ofvenousmalformations. Between ∼1985 and ∼2005, genes (and causative variants therein) underlying monogenic diseases were identified by the tedious process of “positional cloning”.Theculpritgeneswerefirstmappedbylinkageanalysisusinggenomewide panels of microsatellite or SNP markers in affected families. Whenever possible, the initial mapping was followed by fine-mapping using linkage – disequilibriuminformation(akintoassociationmapping;seehereafter).Cloning and sequencing of the “critical interval” then lead in some instances to the identificationofthecausativegeneandmutations.Despitethetediousnatureof thisapproach,thecausativegeneandmutationshadbeenidentifiedfor∼2,000 monogenicconditionsby2003. The advent of Next Generation Sequencing, and the ensuing possibility to now sequencethewholegenomeat∼30-folddepthfor∼1,000Euros,hasdramatically improved this situation. Sequencing the whole genome of a few affected individualsandsomerelativeswill–in∼50%ofcases-leadtotheidentification ofthecausativegeneandmutations.Thekeysourceofinformationisthefinding ofrare,severeLoFvariantsinthesamegeneinsufficientindependentcasessuch thattheprobabilityofsuchfindingtooccurbychancealonebecomesverylow, andhencestatisticallysignificant.Thecorrespondingstatisticshavetoaccount fortheunexpectedfindingthatallofuscarryLoFvariantsinapproximately120 genes (with both alleles affected for 20 of these), hence an order ofmagnitude more than the expected number of “lethal equivalents” (∼5) carried by each individual(videsupra).Bynow,causativemutationsin∼3,000genesunderlying > 4,000 monogenic conditions have been discovered, and these numbers are increasingrapidly. Essentialhumangenomics–MichelGeorges Page34/41 Discovering the causative genes and mutations underlying monogenic diseases inform about gene function, sometimes allow for the development of targeted therapies, and always offer the somewhat ethically controversial but widely appliedsolutionofprenataldiagnosis. C.POLYGENICORCOMMONCOMPLEXDISEASES. Most heritable phenotypes in humans are not monogenic. These include traits likeheight,weightandIQ,butalsopredispositiontocommoncomplexdiseases such as hypertension, diabetes (type I and II), cancer, inflammatory bowel disease,schizophrenia,autism,etc.Howdoweknowthatthesephenotypesare heritable? First indications come from the “unusual” resemblance between parents and offspring. For quantitative traits with continuous Gaussian distribution (including height, weight, IQ and hypertension), the “regression” between the mean phenotype of the parents and that of the offspring is a measure of the heritability (h2), i.e. the proportion of the trait variance that is due to genetic differences between individuals. For binary traits, such as common complex diseases (involving cases and controls), the relative risk for relativesofaffectedindividuals(f.i.sibs)whencomparedtotheincidenceinthe general population is used as a measure of the importance of genetics. A problemishumangenetics,isthatrelatives(f.i.parentsandtheiroffspring)not only share genes but also a common environment. It is therefore difficult to evaluate to what extend the observed resemblance between relatives results from the sharing of genes rather than environment. To overcome this issue, human geneticists study adoptees or twins, or rely on novel molecular approaches. By studying the resemblance of parents and offspring that have been raised in a different environment one can to some extend overcome the problem of the confounding of genes and environment. An estimate of heritability can also be obtained from the increase in the rate of concordance between monozygotic twins when compared to dizygotic twins. One can also estimate the heritability from the higher phenotypic correlation observed for sibsthathaveahigherdegreeofgenomesharing(full-sibsshareonaverage50% of their alleles, but there is variation around this mean). To estimate their heritability, binary traits are often modeled as reflecting the distribution of an unobserved, underlying “liability” with Gaussian distribution, and a threshold valuesuchthatindividualswithaliabilityvalueabovethethresholdareaffected andothersnot.Itappearsfromthesestudiesthattraitssuchasheight,weight and IQ, as well as the liability for many complex diseases are unexpectedly “heritable”. The heritability of height for instance is estimated to be > 85%, whiletheheritabilityofCrohn’sdiseaseis>50%. Given the fact that these traits, including predisposition to common complex disease,areatleastinpartheritable,thisopensthepossibilitytouse“positional cloning”strategiestoidentifyatleastsomeoftheunderlyingcausativegenesand variants.Thereareatleasttworeasonablemotivationstodothis.Thefirstisto gain a better understanding of the molecular mechanisms underlying the corresponding diseases. This may lead to the identification of new targets for drugdevelopmentortothe“repositioning”ofexistingdrugs,atatimewhenthe Essentialhumangenomics–MichelGeorges Page35/41 pharmaceutical industry faces major difficulties in identifying new effective drugs. The second is the perspective of “predictive medicine”, or the identification – prior to disease onset - of individuals that are at higher risk to develop the disease, and to encourage them to more actively adopt preventive measurestoreducetherisk. Initial attempts to find genes underlying common complex diseases based on specific implementations of family-based approaches (such as the affected sib pair method) met with very little if any success. An alternative approach, dubbed Genome Wide Association Study (or GWAS), was proposed and predictedtohaveincreaseddetectionpower.Itwasbasedontherealizationof association studies in case-control cohorts using large numbers of genetic variants covering the entire genome. An association study in a case-control cohort is extremely simple, in principle, and consists in comparing the allelic frequency of a given variant between a group of cases and a group of properly matched (especially for ethnicity) controls. Finding a difference that is statisticallysignificantsuggestseitherthatthevariantisacausativevariantfor thestudieddisease(i.e.thatitperturbsagenetherebyincreasingdiseaserisk)(= direct association), or - which is more likely – that the interrogated variant is “associated with” or in linkage disequilibrium with one or more causative variants in its vicinity (= indirect association). In humans LD typically extends overdistanceoftheorderoftensofKb(cfr.Chapter2).Ideally,onewouldwant tointerrogatealltheexistinggeneticvariantsinthegenometoperformGWAS, i.e.sequencethegenomeofallcasesandallcontrols.Thisisstillimpracticalfor sometime.Inthemeantime,geneticistsinvolvedinGWAStypicallyinterrogate 300,000 to 1 million common SNPs that are scattered across the genome and jointly “tag” the majority of common haplotypes. Their hope is that this set of SNPs will allow them to detect common causative variants mostly by indirect association.Interrogating1millionSNPsatonceinacost-effectivemannerhas become possible thanks to the development of “SNP microarrays” mainly by Affymetrix first, and then by Illumina. Information about state-of-the-art technologies underlying SNP genotyping with microarrays can be found at http://www.illumina.com/technology/beadarray-technology/infinium-hdassay.html. AfirstwaveofGWASstudieswaspublishedstarting2006.Theywouldtypically involve of the order of 1,000 cases and 1,000 controls. The results would be summarizedas“Manhattanplots”,showing–fortheentiregenomeatonce–the strengthoftheassociation(expressedaslog(1/p)wherepistheprobabilityto obtain the observed association by chance alone) as a function of the genomic position of the interrogated SNPs. To avoid false positive reports, the field imposed itself very stringent significance thresholds. To account for the realizationofmultipletests,thethresholdtodeclarestatisticalsignificancewas set at p=10-8 or log(1/p)=8. Moreover, publication in high profile journals imposed replication of the initial finding (in the discovery cohort) in an independent replication cohort of at least equal size. Initial GWAS typically reported 1 to 2 confirmed, genome-wide significant risk loci, which was consideredamajorbreakthrough.However,theoddsratios(usedasasurrogate Essentialhumangenomics–MichelGeorges Page36/41 for relative risk)1of the most strongly associated variants was typically of the orderof1.2-1.5only,whichisverylow.Asamatteroffact,thepowertodetect suchsmalleffectswasshowntobeverylowgiventhesizeoftheutilizedcohorts, indicatingthatmanymoreeffectswerelikelytoexist(ifyoudetectonesignalfor whichyouhaveadetectionpowerof10%,itsuggestthat∼10xmoresucheffects wentundetected). ThisfirstwaveofindividualGWASwasthereforelogicallyfollowedbyasecond wave of “meta-analyses” in which individual case-control cohorts for a given disease were merged into larger cohorts to thereby increase statistical power. GWAS studies with tens to hundreds of thousands of cases and an equivalent numberofmatchedcontrolsarenowcommon.Theselettothediscoveryoftens ofrisklocifornearlyallstudieddiseases.Asanexample,morethan200riskloci havenowbeenreportedforInflammatoryBowelDisease.Anupdatedcatalogue reported the results of all pubmished GWAS can be found at: http://www.ebi.ac.uk/gwas/. It is important to realize that – with few exceptions - the causative genes and variants remain largely unknown for the vast majority of risk loci (across all diseases). The only causative genes that are known are those with strongly associatedLoFvariants,whicharemostlyexceptions.Itisindeedbelievedthat themajorityofcausativevariantsforcommoncomplexdiseasesareregulatory variants,whicharemoredifficulttoidentify,asarethegeneswhoseexpression theyperturb.GWAStypicallymaprisklocitochromosomeregionsof∼250Kb encompassing ∼5 genes on average (range: 0->50) and thousands of genetic variants.Mucheffortsarepresentlydevotedtofine-maptheidentifiedriskloci (to identify the actual causative variants) and to identify the causative genes. Fine-mapping relies on the use of sophisticated “multivariate” statistical approaches.Inthesetheeffectsofeachvariantisevaluatedconditionalonthe effects of the other variants. Identifying the causative genes relies on the integration of multiple sources of genomic information. Examples include the search for “networks” of connected genes amongst those that are mapping to GWAS-identifiedriskloci,theuseofeQTLinformation,andtheuseofepigenetic information including HiC data. The first approach tries to find sets of genes thatareconnectedbyco-citation,co-expressionorinteractomedataamongstthe geneswithinriskloci.Variousstatisticalapproachescanbeusedtomeasurethe significanceoftheidentifiednetworks(i.e.whatistheprobabilitytofindsucha large and highly interconnected network by chance alone given the number of tested genes). If such network is found, it increases the probability that the genes composing it are the actual causative genes. The second approach uses eQTL information, where eQTL stands for “expression QTL”. Transcriptome data is generated for large number of individuals in multiple cell types. The same individuals are being genotyped for SNPs spanning the entire genome 1The Relative Risk corresponds to the ratio between the probability (risk) to be sick when havinggenotypeABovertheprobability(risk)tobesickwhenhavinggenotypeAA,whereAis the reference allele. The Odds Ratio corresponds to the ratio between two odds. The first «odds»istheratiobetweentheprobabilitytobesickwhenhavinggenotypeABdividedbythe probabilitytobehealthywhenhavinggenotypeAB(whichis1–theprobabilitytobesick).The second «odds» is the probability to be sick when having reference genotype AA divied by the probabilitytobehealthywhenhavingreferencegenotypeAA. Essentialhumangenomics–MichelGeorges Page37/41 (sameSNParraysasthoseusedfordiseaseGWAS).Onecanthen-foreachgene – test whether the genotype (say AA, AG and GG) at a given SNP affects its expression level. In other words, does the mean expression level of the tested genedifferbetweenindividualssortedbySNPgenotype.Onetypicallyfirsttests theeffectofSNPsongenesthatarelocatedintheirvicinity,saynotfurtherthan 500Kb apart. The underlying hypothesis is that the interrogated SNP, or more likelyaSNPinLDwithit,affectsageneticswitchcontrollingthatgene.Ifsuchan effectisfound,itisdubbed“cis-eQTL”.OnecanalsotestwhetherSNPsaffectthe expressionofgenesthatarelocatedfurtherawayorevenonotherchromosome. TheideaunderlyingthistestisthattheSNPoravariantinLDwithitaffectsa trans-regulatorofthetargetgenesuchasatranscriptionfactororamiRNA.If suchaneffectisfound,onetalksabouta“trans-eQTL”.Tofindcausativegenesin GWAS-identified risk loci one searches for cis-eQTL that are determined by causative variants for the disease. If the same variant is both independently associated with the disease and a cis-eQTL effect it is reasonable to speculate that the effect on disease predisposition is mediated by the observed effect on gene expression. The third approach relies on HiC data. HiC data in essence produces a number of “gene switches” and the genes that they regulate in a specificcelltype.Ifacausativediseasevariantidentifiedbyfine-mappingmaps to such a switch, the corresponding gene stands out as a strong candidate causative gene. Final prove of variant and gene causality is difficult to obtain. Introducing the candidate variants in cells using for instance CRISPR/CAS9 technology and reproducing effects on gene expression is strong evidence that the tested variant is indeed “functional”. Proof of gene causality is typically soughtbyperforminga“burdentest”.Thisistypicallydonebysequencingthe exonsofthecorrespondinggeneinlargecase-controlcohortsandsearchingfor a differential burden of rare disruptive mutations in cases and controls. As an example,whensequencingtheNOD2genein∼300casesofCrohn’sdiseaseand ∼300healthycontrols,Hugoetal.2observedthatapproximately15%ofthecases carried rare non-synonymous variants in the NOD2 gene, while this was only observedin5%ofhealthycontrols.Thiswasverystrongconfirmatoryevidence that the NOD2 gene was indeed a causative gene for Crohn’s disease. The information that is sought is independent of the primary GWAS signal, which typicallyreliedontheassociationwithdiseaseofcommonvariants.Theburden testextractsinformationfromlowfrequencyandrarevariants.Thisimpliesthat there is allelic heterogeneity at each one of the causative genes, which is a reasonable,yetunprovenassumption. As mentioned above, GWAS have allowed for the identification of tens to hundreds of risk loci for nearly all studied diseases, and this is a major achievement. However, the relative risk conferred by individual risk alleles is typically of the order of 1.1-1.2. This is very low. As a matter of fact, some people have questioned the value of the GWAS findings because of these small effects.Itisimportanttoremember,however,thatitisnotbecausetheeffectsof risk variants that segregate in the population are small, that pharmacological 2HugotJP,ChamaillardM,ZoualiH,LesageS,CézardJP,BelaicheJ,AlmerS,TyskC,O'MorainCA, GassullM,BinderV,FinkelY,CortotA,ModiglianiR,Laurent-PuigP,Gower-RousseauC,MacryJ, Colombel JF, Sahbatou M, Thomas G. Association of NOD2 leucine-rich repeat variants with susceptibilitytoCrohn'sdisease.Nature.2001May31;411(6837):599-603. Essentialhumangenomics–MichelGeorges Page38/41 interventiononthesamegeneorpathwaymaynotcausealargeeffect.Indeed, thecommonvariantsthatunderpintheinitialassociationhavehadtowithstand theeffectsofnaturalselection.Variantswithlargeeffectsmayhavebeenwiped out of the population by selection. Accordingly, rare variants in the same gene often have much larger effects or relative risks. As an example, variants disrupting the IL23R have been found by GWAS to protect against IBD but the corresponding relative risks are small, of the order of 1.5. Yet targeting the IL23Rpathwayspharmacologicallyappearstohavealargetherapeuticeffect. Ifoneconsidersallidentifiedriskvariants,theircorrespondingrelativeriskand frequencyinthepopulation,onecancalculatewhichfractionoftheheritability theyaccountfor.Whenperformingthiscalculationforallstudieddisease,what comes out systematically is that the identified risk variants only explain a relatively small proportion of the heritability or inherited risk, typically of the orderof25%.Thisraisesthequestionofthemolecularnatureofthe“missing heritability”.Severalhypotheseshavebeenproposedtoaccountforthemissing heritability. Today, the one that is best supported by data is the “quasiinfinitesimal”architectureofcommoncomplexdiseases.Thismodelpositsthat the genetic part of the underlying liability for common complex diseases is determined by a very large number of variants with individually very small effects. The risk variants identified by GWAS would only be the “tip of the iceberg”. To detect the remaining ones, one would need to study even larger case-control cohorts. As a matter of fact, increasing the sample size always results in the identification of more risk loci, consistent with the model. Also, “polygenic models”, which estimate a genetic distance between pairs of individuals based on SNP information across the entire genome and correlate thisdistancewithphenotypicdistance,indeedsupportthevalidityofthequasiinfinitesimalmodel. Another noteworthy finding from GWAS is the overlap between risk loci for different diseases that were previously thought to be independent. There thus appearstobealotofpleiotropyamongstrisklociforcommoncomplexdiseases. The effects are sometimes concordant (i.e. the alleles that increases risk for diseaseAalsoincreasesriskfordiseaseB),butequallycommonlydiscordant. If we assume that we would know all the causative risk variants for a given common complex disease, what would the diagnostic value of such set of variants be? To gain some feeling about this we performed simulations using first height as an example. Height is viewed as the paradigmatic example of a polygenic trait in humans. Height has a heritability of ∼85%, which is determinedbythousandsofgenes.Weassumedapopulationwithmeanheight of165cmandstandarddeviationof10cms.Thus,ifyouhavenoinformation abouttheheightofaspecificindividualfromthatpopulationyourbestguessis themean(i.e.165cm),butyouwillonaveragebeoffby10cms.Imaginethat youwouldknowallthegenesthatcontrolheightandknowthegenotypeofthe individualforthesegenesyouwouldbeabletomakeabetterguess.Forheight, withitsheritabilityof85%thestandarderroroftheguessbecomes∼2.5cms– notbad!However,thatassumesthatyouknowallthegenesinvolved.Whatif youonlyknewhalveofthegenesinvolved.Yourstandarderrorwouldthendrop from∼2.5to8cms–notgood!Thus,ifyourgraspsofthegeneticsofthetrait Essentialhumangenomics–MichelGeorges Page39/41 drops,yourpredictiveabilitydropsevenmoretorapidlybecomepoor.Formost complexdiseases,theheritabilityislowerandtheunderstandingofthegenetics stillpoor(≤25%oftheheritability).Thepredictiveabilityisthereforeverypoor. Imagine that one would want to rank individuals based on their genetically determined liability using SNP information. One could then imagine to restrict reimbursement of expensive diagnostic procedures (or impose these tests?) to the10%ofindividualsthatarepredicted-onthebasisofsuchgenetictests-to be the most at risk. The utility of such tests can be evaluated from their sensitivity as well as positive predictive value (or precision). The sensitivity measurestheproportionoftheindividualsthatwilldevelopthediseasethatare foundpositiveforthetest.Theprecisionmeasurestheproportionofindividuals found positive for the test that will ultimately develop the disease. One can easily show that with the present understanding of the genetic architecture of common complex diseases, sensitivity and precision are abysmal: many individualsthatwillsufferfromthediseasewillbemissedbythetest,andmany oftheindividualsfoundpositivewouldhaveneverdevelopedthedisease.Does that mean that such test have no value and therefore no future? Maybe not. Evenifthepredictiveabilityappearsverypoorwhenconsideredfromthepoint of view of the individual patient, the cost savings that may be achieved at the population level by using this information may under certain circumstances be considerable. Weighting the benefits of the individual versus those of the communityarelikelytobecriticalwhenconsideringtheuseofgenomictesting intheclinic. D. SOMATICVERSUSGERMLINEMUTATIONS:CANCER. The prevailing view of cancer is that it is a “genetic” disease in the sense that what fundamentally differentiates cancerous from normal cells are accumulationsofmutationsinthegenomeoftheformer. Insomerarefamilialcancers,suchmutationsmaybeinheritedfromoneofthe parents. Well-known examples of such inherited cancers are familial retinoblastoma involving mutations in the Rb1 gene, hereditary breast and ovarian cancer involving mutations in the BRCA1 and BRCA2 genes, familial adenomatous polyposis involving mutations in the APC gene, and hereditary non-polyposis colon cancer involving mutations in mismatch repair genes. Familial cancers are typically transmitted in an autosomal dominant manner. Even in these rare familial instances, cancer progression requires the accumulation of additional somatic mutations including (but not only) in the non-mutatedalleleofthecorrespondinggenes(Knudson’stwohithypothesis). Most cancer cases are “sporadic”: the relative risk of relatives is not increased. Such sporadic cancers are thought to result from the accumulation - in specific cell lineages - of somatic mutations in “cancer driver genes”. Each new cancer drivingmutationisthoughttoconferaselectiveadvantagetothecorresponding cloneinthesensethatitcausestheclonetoeitherproliferatemoreaggressively or disseminate in the body (metastasis). Cancers typically accumulate mutationsin“mutator”genes,whichcausethesomaticmutationratetoincrease. In addition to the driving mutations in oncogenes, cancers therefore typically Essentialhumangenomics–MichelGeorges Page40/41 accumulate a large number of passenger mutations. Thus, not all mutations detected in tumors do contribute to the cancer phenotype. The genome of thousands of tumors has now been sequenced as part of large collaborative projects. The comparison of the healthy genome of the patient (obtained by sequencingDNAextractedfromhealthytissue)withthatofthetumorallowsfor the identification of somatic mutations that have accrued in the tumor. By comparingthelistsofsomaticmutationsacrosslargenumbersoftumorsofthe same tissue, one can identify genes that are more often affected by such mutations than expected if the mutations occurred at random in the genome. Thus one attempts to identify genes that are preferentially affected by somatic mutations in tumors. This enrichment is assumed to reflect that mutations in thesegenescontributetocancerprogressionandthatthecorrespondinggenes are“cancerdrivers”.Hundredsofcancerdrivershavebeenidentifiedusingthis approach.Thesestudieshaverevealedtheremarkableheterogeneityofcancers: there are plenty of distinct ways for a healthy cell to become cancerous; each tumor is nearly unique. It is possible that in the near future, the genome of tumorswillbesystematicallyandfullysequencedwiththehopetoidentifythe drivergenesthatare“activated”bymutationsinthecorrespondingtumor.This will in some instances allow for personalized therapies. Examples of such targeted therapies already exist. They include (i) trastuzumab (Herceptin) targetingthehumanepidermalgrowthfactorreceptor2protein(HER-2),which isexpressedathighlevelsasaresultofmutationsinsomebreastandstomach cancers, (ii) vemurafenib (Zelboraf) targets a mutated form of BRAF (V600E) which is found in some metastatic melanoma, and (iii) imatinib mesylate (Gleevec)atyrosine-kinaseinhibitortargetingtheBCR-ABLfusionproteinfound insomeleukemias. Somatic mutations are not the only conceivable mechanism that may lead to heritable perturbation of oncogenes. There is considerable evidence that oncogenesmayalsobeaffectedintumorsby“epimutations”.Insuchcases,their DNAsequenceisnotalteredbuttheirmethylationprofileorlocalhistonecode may be modified. This may affect the expression of the corresponding gene, which may contribute to cancer progression. Another possibility, which is generally neglected today, are the activation of positive feedback loops. What differentiateslyticfromlysogeniclambdaphagesisnottheirgenome(whichis identical),buttheexpressionornotofthelambdarepressorprotein.Oncethe lambda repressor is expressed it will block the expression of the genes needed for lysis while activating its own expression. It is conceivable that tumor cells also differ from normal cells by the inherited activation or downregulation of equivalentpositivefeedbackloops. Perturbationsofthestromaareincreasinglyconsideredasplayinganimportant role in tumor progression, i.e. changes of the cellular and extracellular microenvironmentinwhichthetumordevelops.Themostlikelyhypothesisis that tumor cells accumulate mutations that allow them to send signals to their surroundings,inducingstromalchangesthatinturnpromotecancergrowth.A subcloneofthetumormayharborsuchmutations,whichmayultimatelybenefit other subclones “in trans” without these mutations. One can also imagine that thesignalinginducesheritableepigeneticchangesinthestromalcells,suchthat Essentialhumangenomics–MichelGeorges Page41/41 the tumor promoting effects of the stromal cells is maintained even in the absenceoftheoriginalinducingsubclone.