Download Essential human genomics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Essentialhumangenomics
MichelGeorges
Essentialhumangenomics–MichelGeorges
Page2/41
TableofContents
1. Thereferencehumangenome
a. Basicanatomyofthehumangenome
i. Genomesizeandcomposition
ii. Chromosomes:number,autosomes,sexchromosomes,bands
iii. Codingandnon-codinggenes:numbers,types,structure
iv. Repetitivesequences
v. Epigenetics
b. Obtainingareferencesequenceofthehumangenome
i. FromSangertoNextGenerationSequencing
ii. Shotgunsequencingofmate-pairlibraries
iii. Linkagemaps,radiationhybridmapsandBACcontigs
c. Annotatingthereferencesequence
i. Maskingrepetitivesequences
ii. Establishingagenecatalogue
1. Identifyingcis-actingregulatoryelements:evolutionaryconstraints,CHIPSeq,DNaseIhypersentivity.
iii. 3-Dstructure
iv. Interactome
v. Genomebrowsers
2. Individualhumangenomes
a. Geneticvariants:SNPs,indelsCNVsandotherstructuralvariants
i. Typesofgeneticvariants
ii. Minorallelefrequency,nucleotidediversityandnumberofpolymorphicsites
iii. Germ-linemutations,driftandnaturalselection
b. Genomicreconstructionofourevolutionaryhistory
i. TheAfricaneve
ii. Admixturewitharchaichominins-NeanderthalsandDenisovans
c. TheHapMapand1,000Genomesprojects
i. Linkagedisequilibrium
ii. Haplotypeblocksandrecombinationhotspots
iii. TheHapMapproject
iv. The1,000Genomesproject
3. Themorbidhumangenome
a. Neutralversusfunctionalvariants–germlineversussomaticmutations
b. Monogenicdiseases
c. Polygenicorcommoncomplexdiseases
d. Somaticversusgermlinemutations:cancer
Essentialhumangenomics–MichelGeorges
Page3/41
1. TheReferenceHumanGenome
A.BASICANATOMYOFTHEHUMANGENOME
Genome size and composition. The haploid human genome comprises ∼3x109
basepairs.Thisamountstoapproximately2mofDNAineveryoneofthe∼1013
diploidcellsthatconstituteourbody.Thehumangenomeiscomparableinsize
tothatofallothermammals.Itisapproximatelytentimeslargerthanthesizeof
the genome of the multicellular worm C. elegans and fly D. melanogaster, 100
timeslargerthanthatoftheunicellulareukaryoticyeastS.cerevisiae,and1,000
timeslargerthanthatoftheunicellularprokaryoticE.Coli.
Thehumangenomecomprisesapproximately40%G-Cbasepairsand60%A-T
base pairs, a steady state equilibrium between the preferential loss of G-C’s by
mutation and preferential gain of G-C’s by meiotic recombination. Local
fluctuations around this mean depart very significantly from expectations at
differentscales.
Base-pair composition is characterized by a strong di- and even tri-nucleotide
dependence. The 5’-CpG-3’ dinucleotide, in particular, is strongly
underrepresentedacrossthegenome,exceptwithinso-calledCpGislandswhich
markthe5’endofahighproportionofgenes.
Chromosomes. The haploid human genome is subdivided over 22 linear
autosomes, the X and Y linear sex chromosomes, and the small (∼16.5 Kb)
circular mitochondrial genome. As most cells contain many mitochondria, the
DNAfromthisorganellemayaccountfor∼1%ofcellularDNAdespiteitssmall
size. The X and Y chromosome comprise a large X- and Y-specific portion,
respectively,aswellastwosmallcommonsegmentsoneitherendthatalignand
recombineduringmalemeiosis,knownasthepseudo-autosomalregions1and2.
The fluctuations in G-C content generate chromosomal bands upon coloration
thatwereusedbycytogeneticistsforchromosomeidentification.
Coding and non-coding genes. The human genome comprises of the order of
21,000protein-encodinggenes.Thisisconsiderablylowerthananticipatedand
possibly less than the gene content ofthe nematode C.elegans and the plant A.
thaliana. The vast majority of human genes are split, meaning that they are
composedofseveralusuallysmallexons(∼150bp)separatedbyusuallylarge(∼
3000bp)introns.Thisallowsforthegenerationofmultipleproteinisoformsper
genebyalternativesplicing,whichappearstobeusedbynearlyallhumangenes.
Thus, despite the relative small number of protein-coding genes in the human
genome, they have the capacity to encode a very complex proteome. The gene
catalogue has expanded by gene duplication and specialization. Genes which
descendfromacommonancestorgenebyduplicationaresaidtobehomologous
genesandparalogues.Thehumanandthyroglobulingenearealsohomologous
genes,yetmultiplicatedbytheprocessofspeciation;suchhomologousgenesare
saidtobeorthologues.
In addition to protein encoding genes, the human genome contains non-coding
genesgeneratingRNAmoleculesasfinalproducts.Theseincludetheribosomal
RNA(rRNA)genes,thetransferRNA(tRNA)genes,smallnuclearandnucleolar
Essentialhumangenomics–MichelGeorges
Page4/41
RNA genes, which have been known to be essential for RNA maturation and
translation for some time. It has recently become apparent, however, that the
human genome encodes a flurry of other non-coding RNA genes. Amongst the
best understood figure (i) the microRNA (miRNA) genes, encoding small RNA
molecules that are involved in fine-tuning the expression level of nearly all
messenger RNAs (mRNAs), and (ii) the “long intergenic non-coding” RNA
(lincRNA)genes,thatfulfilldiverseyetstillpoorlydefinedfunctions.Thereare
of the order of 1,000 miRNA genes, and several thousands of lincRNA genes in
thehumangenome.ThebestunderstoodlincRNAisprobablyXIST,whichplays
anessentialroleintheinactivationofoneoftheXchromosomesinwomen.
Inadditiontothesefunctionalgenes,thegenomeislitteredwithlargenumbers
of what appear to be non-functional gene copies, referred to as pseudogenes.
Pseudogenes come in two types: processed and non-processed. Processed
pseudogeneshavenopromoter,nointrons,butoftenapoly-Atail.Theyarein
fact retro-transcribed cDNA copies of mRNA that have been integrated in the
genome.Thereversetranscriptasesresponsibleoftheretrotranscriptionderive
either from retroviruses or – more likely – endogenous retrotransposons (see
hereafter).Non-processedpseudogeneshavethestructureofausualgene.We
suspectthattheyarenotfunctionalastheyareseldomevolutionaryconserved
andoftencontainmultiplehighlydisruptivemutationsincludingnonsense(stop
codons)andframeshiftmutations.
Repetitive sequences. A large proportion of the genome corresponds to
sequencesthatarepresentinmorethanonecopypergenome.Theserepetitive
sequencesincludetandemrepeatsandinterspersedrepeats.
Tandem repeats include satellites, minisatellites and microsatellites. Satellites
arecomposedofoftenthousandsoftandemrepetitionsofmotifsthatcanveryin
length from ten to hundreds of base pairs. They are found at telomeres,
centromeres and in constitutive heterochromatin. Minisatellites are typically
composed of tens to hundreds of tandem repetitions of motifs of 30-50 base
pairs. There are thousands of them and they are dispersed across the genome
with an enrichment in subtelomeric regions. Microsatellites are composed of
tens of tandem repetitions of short motifs ranging from one to <10 base pairs.
There are tens of thousands of them and they are dispersed throughout the
entiregenome.Oneofthecommonfeaturesofalltypesofsatellitesequencesis
thereveryhighdegreeofgeneticpolymorphism(seehereafter),explainingwhy
mini- and microsatellites have been used extensively for DNA fingerprinting.
Expansion of trinucleotide repeats that do or do not overlap with coding
sequences are the cause of a number of genetic disorders including fragile X
mentalretardation,Huntingtondiseaseandmyotonicdystrophy.
Interspersed repeats are primarily composed of transposable elements, which
aretypicallygroupedinfourcategories:(i)SINES,(ii)LINES,(iii)LTR-elements
and(iv)DNAtransposons.SINES(ShortInterspersedElements)areprocessed
pseudogenes from pol III dependent genes. The most abundant SINES in the
humangenomearetheAlusequenceswhicharederivedfromthe7SLRNAgene.
OtherSINESarederivedfromtRNAgenes.TheexpansionoftheSINESmaybe
related to (i) the fact that pol III promoters reside within the gene so that
processed pseudogenes remain transcriptionally competent, and (ii) that they
Essentialhumangenomics–MichelGeorges
Page5/41
are recognized as preferred substrates by reverse transcriptase possible as a
resultofanearlyintegrationeventwithinaretrotransposon.ThetypicalSINEis
∼100 to 300 bp long. The ∼850,000 SINE elements account for ∼13% of the
human genome. LINES (Long Interspersed Elements) are autonomous
retrotransposons,meaningthattheytransposeviaanRNAintermediatethatis
retrotranscribedtocDNApriortoreintegrationinthegenome.Full-lengthLINE
copiesareapproximately6-8Kblongandcontainopenreadingframesencoding
enzymes required for their transposition, including with reverse transcriptase
activity.ThemajorityofLINEelementsinthegenomearetruncated.The∼1.5
million LINE elements in the genome account for ∼21% of our genome. LTRelementsareanothertypeofautonomousretrotransposons.LTRstandsforLong
Terminal Repeats, and indeed LTR-elements share these structures, as well as
thegenecontentwhichtypicallycomprisesgag,polandenvgenes,withregular
retroviruses.LTR-elementscanbeviewedasretrovirusesthathaveforgonethe
extracellular phase of their life cycle. There are approximately 450,000 LTRelementsinourgenome,whichaccountof∼8%ofitsspace.Thefinalcategoryof
transposable elements are DNA transposons, which do not transpose via RNA
intermediatesastheothermobileelementsdo,butratherbymeansofacut-andpastemechanism.Thereare∼300,000suchelements,amountingto∼3%ofour
genome. There is a lot of questioning about the purely selfish nature of
interspersedrepeats.Thefactthatorganismssuchasthepufferfishinanimals,
or A. thaliana in plants, function very well with very little if any interspersed
repeats supports their predominantly parasitic nature. The observation that
distinct families of interspersed repeats have been active at different stages
during evolution to be finally silenced by their host, also supports a mainly
parasiticnature.However,thereisagrowinglistofexamplesof“exaptation”or
“domestication” of transposable elements. There are several examples of
essential genes in our genome that clearly derive from ancestral transposable
elements, as well as hundreds of thousands of examples of evolutionary
constraintselementsderivingfrommobileelements.
In addition, our genome encompasses large segments (∼100Kb-1Mb) that are
presentinthegenomeinmorethanone,yetasmallnumberofcopies.Theseare
referredtoaslowcopyrepeatsorsegmentalduplications.Thedifferentcopies
caneitherresideclosetoeachother(forinstancetandemcopies),orondifferent
chromosomes. Segmental duplications often coincide with Copy Number
VariantsorCNVs(seehereafter).
Epigenetics.Epigeneticsisaverypopular,all-encompassingtermthatdeserves
someclarification.Itinitiallyreferredtoeverythingthatwasneeded“ontop”of
theDNAtoallowproperexecutionofthegeneticprogram.Itisinthiscontext
thatConradWaddingtonintroducedthenotionof“epigeneticlandscape”.
Morerecently,thetermepigeneticshasmainlybeenusedtorefertovehiclesof
transmissible phenotypic variation other than variation in the DNA sequence.
Whydodaughtercellsofahepatocytemothercellbehaveashepatocytes,while
daughtercellsofathyrocytemothercellbehaveasthyrocytes,giventhefactthat
– in a given individual - both cell types harbor exactly the same genome? The
reason that is presently most commonly invoked is that the genome is not
inherited“naked”bydaughtercells,butthatitcomeswithasetoftissue-specific
“epigenetic marks”. The most studied epigenetic marks in animals are DNA
Essentialhumangenomics–MichelGeorges
Page6/41
methylation, and post-translational modifications of the amino-terminal tails of
thenucleosomalhistones(H2A,H2B,H3andH4).
Cytosines in the 5’-CpG-3’ dinucleotide can be methylated by DNA
methyltransferases at position 5. The most abundant DNA methyltransferase
(DNMT1),issaidtobea“maintenancemethylase”asitwillmethylatetheCofa
CpGdinucleotideonlyofthecomplementaryCpG(ontheotherstrand)isalready
methylated. DNMT1’s main role thus appears to be the faithful maintenance of
theDNAmethylationstatusafterDNAreplication.ThemodeofactionofDNMT1
explainshowtheDNAmethylationstatusofagivenregioninthegenomecanbe
faithfully transmitted “epigenetically” from mother to daughter cells. As DNA
methylationisoftencorrelatedwithgenesilencing,thesamegenesareactivein
motheranddaughtercells.AsDNAmethylationpatterns(i.e.whichpartsofthe
genome or methylated or not) are tissue-specific, DNA methylation contributes
to the epigenetic maintenance of tissue-specific gene expression patterns. But
how are hepatocytes and thyrocytes (and – for that matter - all other tissues
types) acquiring tissue-specific DNA methylation marks in the first place? In
addition DNMT1, our genome encodes “de novo DNA methylases”, including
DNMT3aandDNMT3b.ThesecanmethylateaCinaCpGdinucleotideevenifthe
complementarystrandisnotyetmethylated.SuchdenovoDNAmethylasesare
assumed to be involved in the establishment of the tissue-specific methylation
patterns,whicharethenpropagatedthroughcellulargenerationsbyDNMT1.
Another very important category of “epigenetic marks” are the myriad of posttranslational modifications of the nucleosomal histone tails. These include
methylations, acetylations, phosphorylations, ubiquitylations and even proline
isomerisations. Combinations of specific histone modifications correlate with
the functionality of the corresponding chromosome region. Accordingly, this
“histone code” can be used to distinguish promotors (active, weak, poised),
enhancers(strong,poised),insulators,transcribedregions,polycombrepressed
regions,heterochromatin,etc.Theenzymaticcomplexesthatimposethishistone
marksincludepolycombrepressivecomplexes1(PRC1)and2(PRC2).Contrary
to DNA methylation, there is no clear understanding of the molecular
mechanismsthatensurethefaithfulinheritancefrommothertodaughtercellsof
thetissue-specifichistonemarks.Apossibilityisthedirectphysicalinteraction
betweenthehistonemodifyingandDNAmethylationmachineries.
An often overlooked mechanism for epigenetic inheritance, despite its
extensively documented contribution to - for instance - the differentiation
betweenthelyticandlysogenicstatesofbacteriophages,are“positivefeedback
loops”. An inheritable differentiation switch may be determined by the
presence/absence of a transcription factor that not only determines tissuespecifictranscriptionprograms,butalsoenforcesitsownpresencebyadirector
indirect positive feedback loop. Once the transcription factor is expressed in a
mothercell,andprovidedthatitremainspresentatrequiredconcentrationsin
the daughter cells after mitosis and cytokinesis, it will remain expressed in the
daughter cells and contribute to the maintenance of the cell-type specific
transcriptionpattern.If–onthecontrary–itisnotexpressedinthemothercell,
it will remain “off” in the daughter cells, with concomitant effect on their
respectivetranscriptome.Itisnotknowninhowfarthismechanismcontributes
Essentialhumangenomics–MichelGeorges
Page7/41
tothedifferentiationbetweenanimalcell-types,or–forthatmatter-tocancer
progression.
Thusfarwehaveconsideredepigeneticsasamechanismcontributingtocellular
differentiation in multicellular organisms. A distinct question is whether
epigenetic mechanisms contribute to heritable differences between individuals.
This – somewhat controversial - phenomenon is referred to as
“transgenerational epigenetic effects” (TEE). One possible molecular
explanation to account for TEE is “transgenerational epigenetic inheritance”
(TEI)or“gameticepigeneticinheritance”.TEIimpliesthatdifferentindividuals
may inherit alleles that have identical nucleotide sequence yet differ
epigenetically,andthatthese“epialleles”functiondifferentiallytherebycausing
distinctphenotypes.Epialleleshavebeenreportedinplantsandinclude(i)the
peloricvariantofLinariaresultingfromthemethylationandsilencingoftheLcyc
gene,aswellas(ii)theepigeneticsilencingoftheparamutableb1locusinmaize.
A number of examples have also been reported in mice, including the agouti
viableyellow(Avy)allele.Thisalleleresultfromtheinsertionofanintracisternal
A particle (IAP) retrotransposon upstream of the agouti gene. The IAP LTR
drivesectopicexpressionoftheagoutigenecausingtheyellowphenotype.The
degree of ectopic expression correlates with the methylation status of the LTR,
whichis-atleastinpart–heritable(hencethenamemetastableepiallele).The
methylation status of the IAP LTR also appears to be sensitive diet. There is
someevidencethatrareinstancesofhereditarynon-polyposiscolorectalcancer
(HNPCC) involve the inheritance of inactivated epialleles of the mismatch
repair/tumorsuppressorMLH1gene.Thesegregationofepiallelesimpliestheir
resistance to the reprogramming that is thought to reset near all epigenetic
marksinthegermline.(Notethatparentalimprintinginvolvestheimpositionof
parent-of-origin specific epigenetic marks in the corresponding germline, but
thattheseareresetaccordingtotheparentalsexeachgeneration).
TEI is not the only possible explanation of TEE. A nice example of a TEE that
does not depend on TEI (yet involves DNA methylation in the soma) is the
transgenerationalinheritanceofmotheringstyleandstressintherat.Another
widely publicized case of possible TEE is the effect of malnutrition of pregnant
mothers on the future health of their offspring and possibly grand-offspring as
documentedforwomenthatwerepregnantduringthesecondworldwarinthe
NetherlandsandRussia.TheprecisemechanismsunderlyingtheseTEEremain
largelyunknown.
B.OBTAININGAREFERENCESEQUENCEOFTHEHUMANGENOME
FromSangertoNextGenerationSequencing.Forapproximatelythreedecades,
determiningthenucleotidesequenceofaDNAmoleculewasnearlyexclusively
donebymeansofFredSanger’s“dideoxy”method.Intheinitialversionofthis,
one strand of the DNA molecule to be sequenced was copied using a DNA
dependent DNA polymerase, in the presence of a mixture of the four regular
deoxynucleotidetriphosphates,oneradiolabelleddeoxynucleotidetriphosphate,
and one of the four dideoxynucleotide triphosphates. At each residue that is
complementary to the dideoxynucleotide used in the reaction, the DNA
polymerasehastheoptiontoincorporateeithertheregulardeoxynucleotideor
Essentialhumangenomics–MichelGeorges
Page8/41
the dideoxy “terminator”. If the DNA polymerase incorporates the
dideoxynucleotide,furtherextensionofthatmoleculebytheDNApolymeraseis
blocked by the absence of the needed hydroxyl group in the 3’ position of the
ribose of the incorporated dideoxynucleotide. If, on the contrary, the DNA
polymerase incorporates the regular deoxynucleotide at that position, further
extensionofthemoleculeisalloweduntilitfacesthesame“choice”atthenext
position that is complementary to the used dideoxy. To prime the
polymerization, an oligonucleotide that hybridizes to the 3’ extremity of the
template to be copied is added to the reaction. The reaction is conducted “in
parallel” on a very large number of copies of the template, which are either
obtainedbyconventionalcloningusingplasmidorphage-basedvectors,orby“in
vitro cloning” using the Polymerase Chain Reaction (PCR). When using
conventional cloning, the primer typically targets known vector sequences
flanking the insert to be sequenced. When completed, Sanger’s sequencing
reaction will have generated a mixture of DNA molecules of different sizes, but
systematically ending with the same nucleotide (by incorporation of the
correspondingdideoxy).Thisreactionisrepeatedfourtimes,witheachoneof
thefourpossibledideoxynucleotidetriphosphates.Runningtheproductsofthe
four reactions on an acrylamide gel (and visualizing the products by
autoradiography) revealed a ladder of fragments of different size that allowed
immediatereadingoftheDNAsequence.
Inthemiddleofthenineties,theuseofradiolabelednucleotideswasreplacedby
theuseofdideoxynucleotidesthatwerelabeledwithfourdistinctfluorophores.
This allowed the four reactions to be conducted simultaneously and their
product to be run in the same lane. Flat gels were replaced by capillaries, and
manual reading replaced by automatic data capture using charge-coupled
devices(CCD)thatwouldreadthecolorofthelightemittedattheextremityof
thecapillarieswhileelectrophoresiswasproceeding.Thiswouldtypicallyallow
the generation of sequence reads of ∼600-800 base pairs. The first reference
genomesofE.coli,S.cerevisiae,D.melanogaster,C.elegans,H.sapiensandM.m.
domesticus were generated using this first generation of “automatic capillary
sequencers”.
The next technological breakthrough was the development of methods of
“sequencing by synthesis”. In these, the heterogeneous mix of reaction
products are not examined at the very end of the sequencing reaction, but the
sequencing reaction is instead conducted one nucleotide at the time and the
incorporatednucleotidesdeterminedaftereachreaction“cycle”.Threedifferent
chemistrieshavedominatedapproachesfor“sequencingbysynthesis”.Thefirst
ispyrosequencing.AsforSangersequencing,pyrosequencingisbasedonthe
generationofacopyofaprimedtemplate-to-be-sequencedbyaDNA-dependent
DNApolymerase.Inthisreaction,thefournucleotidesareaddedserially,oneat
thetime,tothepolymerizationreaction(thesuccessionisforinstanceG,thenA,
thenC,thenT).Ifthenextbasetobecopiediscomplementarytothenucleotide
that is added to the reaction, the latter will be incorporated by the DNA
polymerase.Thisreactionwillstoichiometricallyreleaseapyrophosphateanda
hydrogenion.Thepyrophosphatecanbequantitativelydetectedbymeansofa
luciferase-catalyzed reaction that releases light. This is the approach that was
usedbythe454technologythatwasacquiredbyRoche.Thehydrogenioncan
Essentialhumangenomics–MichelGeorges
Page9/41
be quantitatively detected by virtue of the change in pH that it causes. This
approach,referredtoas“ionsemiconductorsequencing”,wasdevelopedbyIon
Torrent and acquired by Life Technologies (does not require scanning with a
CCDcamerawhichshortensthecycletime).Ifthenexttwoormorebasestobe
copied are complementary to the nucleotide that is added to the reaction, the
amounts of light emitted or the degree of pH change will be commensurate
allowing to accurately count the complementary residues up to some number
(whichexplainswhythesemethodsfacessomedifficultieswithmononucleotide
repeats).Theresultsofpyrosequencingareoftenrepresentedasa“pyrogram”.
Pyrosequencingallowsreadlengths>500basepairs.
Thesecond“sequencingbysynthesis”technologyis“sequencingbyligation”.It
takes advantage of the fact that DNA ligases can attach an incoming
oligonucleotide to a primed template-to-be-sequenced provided that the
oligonucleotide is perfectly complementary to the template sequence
immediately adjacent to the primer. The ligation reaction is carried out in the
presence of a mixture of degenerated oligonucleotides that are specified by
either one, or a pair of residues (either in the middle or at the 5’ end), and
labeledwithoneoffourspecificfluorophores.Atagivencycleinthesequencing
reaction,onlyoneprimercanbeaddedtothegrowingchainbytheDNAligase.
Its identity can be determined by virtue of the color of the emitted light. The
fluorophoreisthenreleasedbycleavingoffpartoftheoligonucleotidecarryingit,
andanewcycleisinitiated.Afteranumberofcycles,thechainoflinkedprimeroligonucleotides is released and the entire operation is repeated with a primer
withonenucleotideoffsetrelativetothepreviousone.Thisprocessisrepeated
until all bases in the template have been interrogated and its entire sequence
determined. This approach was developed by Solid, which was acquired by
AppliedBiosystemsandthenLifeTechnologies.
The third (and presently dominating) sequencing by synthesis method uses
“reversible terminators”. The 3’-OH group of these nucleotide analogues is
typicallyreplacedbyvariouschemicalgroupscausingthemtobeterminatorsin
the Sanger sense. Moreover, they carry one of four specific fluorophores
attached to the base. Both the 3’ blocking group and the fluorophore can be
chemically removed, restoring a 3’-OH group (hence the name reversible
terminator) and a nearly natural base. As in the previous sequencing-bysynthesismethods,determiningthenucleotidesequenceofatemplatemolecule
is achieved by its sequential replication by a DNA dependent polymerase
extendingacomplementaryprimer.Dependingonthenextbaseinthetemplate,
theDNApolymerasewillincorporateoneofthefourterminatorswhoseidentity
can be determined by reading the color of the light emitted by its fluorophore.
Blocking group and fluorophore are then eliminated, and this “wash and scan”
cycle repeated as many times as possible. Sequencing with reversible
terminators now allows the generation of reads over 250 base pairs in length.
The corresponding sequencing technology was developed by Solexa that was
later acquired by Illumina and which presently holds a > 85% share of the
sequencers’market.
Thebiochemistriesofthe“sequencing-by-synthesis”approachesdon’t–ontheir
own – explain why these would be more efficient than original Sanger
sequencing. As a matter of fact read length achieved by SBS is still inferior to
Essentialhumangenomics–MichelGeorges
Page10/41
whatcanbeachievedbyconventionalSangersequencing.ThesuperiorityofSBS
resultsfromitscombinationwithspecifictemplatepreparationandprocessing
methods allowing the analysis of millions of sequencing reactions in parallel.
ThelimitedsensitivityofallSBSmethodsstillrequiresthegenerationofclonally
amplifiedtemplates.InSangersequencingthiswasinitiallyachievedbyinvivo
cloning and then increasingly by conventional PCR. In SBS approaches this is
achievedbyapplyingmodifiedPCR-basedmethods,yieldingwhataresometimes
called“polonies”.
Thefirstoftheseis“emulsionPCR”.Inthis,“adapters”areligatedtothedouble
stranded DNA to be sequenced. The double-stranded adapters are “Y-shaped”
ensuringthatthe5’and3’endsofboththeWatsonandCrickstrandsaredistinct
andnon-complementary.Theresultingligationproductsarethendenaturedand
mixedwith(i)beadsthatarecoatedwithprimersthatarecomplementarytothe
3’ end of the ligation products, (ii) a solution containing PCR primers (one
complementarytothe3’endoftheligationproductandtheothercorresponding
to the 5’ end of the ligation products), deoxynucleotide triphosphates, and Taq
polymerase, and (iii) oil. This generates an emulsion of aqueous droplets
floatinginoil.Theemulsionisgeneratedsuchastomaximizetheproportionof
droplets containing one bead and one single-stranded template molecule. The
emulsion is then subject to temperature cycling allowing a polymerase chain
reaction to occur in each droplet. After the PCR, the emulsion is broken, the
beadsdenatured(suchastoonlykeeptheDNAstrandthatiscovalentlyattached
to the bead), washed and recovered. For the droplets that contained one bead
andonetemplatemolecule,thisprocesswillgeneratebeadscoveredwithalarge
number of identical, single-stranded templates. In the Roche 454/FLX
instrument, the corresponding beads are then loaded individually in one of ∼1
million PicoTitrePlate (PTP) wells, in which the pyrosequencing reaction will
take place. As a consequence, it will be possible to monitor ∼ one million
sequencingreactionsinparallel,eachyieldingupto1,000bpreads,foratypical
total of ∼700 Mb of sequence per run. (On the SOLiD platform, x million such
beadsarespreadandchemicallycross-linkedtoanamino-coatedglasssurface.
This allows for ∼ million sequencing-by-ligation reactions yielding up to ... bp
readstobeconductedinparallel,foratotalof...basepairsofsequenceperrun.)
The second approach towards generating massive amounts of polonies in
parallel is so-called “bridge PCR” as implemented on the Illumina instruments.
In this approach, the DNA fragments to sequence are ligated to “Y-shaped”
adapters as for emulsion PCR. The corresponding ligation products are then
denaturedanddirectlyhybridizedontoaglassplate(referredtoas“flowcell”)
thatiscoveredwitha“lawn”ofoligonucleotidescomprising(i)primersthatare
complementary to the 3’ end of the ligation products, and (ii) primers
correspondingtothe5’endoftheligationproducts.Hence,thesinglestranded
templateshybridizebytheir3’endtocomplementaryoligonucleotidesattached
totheflowcell.ADNApolymerasethenextendstheattachedprimerusingthe
annealedsinglestrandedtemplateasmodel.Thisgeneratesasequencethatis
bound at its 3’end by a sequence that is complementary to the other attached
primer (5’ end of the ligation products). The flow cell is then subject to
temperaturecyclesallowingforsolid-phaseamplificationorbridgePCR.The3’
end of the freshly synthesized strand first hybridizes to a neighboring
Essentialhumangenomics–MichelGeorges
Page11/41
complementary primer fixed on the glass support. A DNA polymerase then
extends the primer to generate the complementary strand, resulting in two
closely positioned complementary strands that are both covalently attached to
thesolidsupport.Thesuccessionofseveralsuchtemperaturecyclesgenerates
a tight cluster of clonally amplified molecules. It is at present possible to
generate > 1 billion such clusters per lane, or > 8 billion clusters per Illumina
flowcell(whichcontainseightsuchlanes).Theactualsequencing-by-synthesis
reaction using reversible terminators is then conducted on the corresponding
clustersusingeitheroneofthetwoprimers.Itisbeyondthescopeofthiscourse
todescribetheproceduresinfulldetail,but-ineffect–bothstrandscomposing
eachclustercanbesequencedoneaftertheother.
The hundreds of successive “wash-and-scan” cycles that are required to
complete a sequencing-by-synthesis reaction are time consuming. A typical
sequencing reaction requires ∼2 days on a Roche FLX and >10 days on an
IlluminaHiSeq2000instrument.OneoftheadvantagesoftheIonTorrentisthat
measuringthepHinindividuallanesismuchfasterthanthecollectionofimages
usingaCCDcamera.AtypicalrunonanIonTorrentmachinesrequiresonlya
fewhours.
(Table1summarizesthethroughputthatistypicallyachievedwiththefourmain
sequencing-by-synthesisinstrumentsthatarepresentlyavailable
While sequencing-by-synthesis has only become available in the last ten years,
the next generation of sequencing technologies already lures on the horizon.
These so-called third generation approaches target “single-molecule
sequencing”,henceobviatingtheneedforthepreparationofclonallyamplified
templates. Helicos commercialized the first instrument performing singlemoleculesequencing.Itusedreversibleterminatorsandsharedseveralfeatures
in common with the Illumina technology (except for the lack of a bridge PCR
step).Aradicallydifferentapproach,referredtoas“Single-moleculeRealTime
Sequencing” (SMART), is being commercialized by Pacific Biosciences. In this
approach a single primed template molecule is captured by a DNA polymerase
that is fixed to the bottom of a tiny well using biotin/strepatavidin interaction.
The DNA polymerase is fed deoxynucleotides that carry a fluorescent reporter
attachedtotheirtri-phosphateorequivalentmoiety.WhentheDNApolymerase
incorporates a given nucleotide, the corresponding fluorophore remains at the
catalytical site for a period of milliseconds generating a detectable signal when
combined with zero-mode waveguide (ZMW) technology (that restricts the
excitinglaserbeamtothebottomofthewell).Incorporationofthenucleotide
intothegrowingchainreleasesthefluorophorewhileleavingaperfectlynatural
basethatcanbefurtherextendedbytheDNApolymerase.SMARTsequencingis
fast,allowsreadlengthsofthousandstotensofthousandsofbasepairswhichis
a major advantage for the assembly of complex genomes (see hereafter), and
may allow for the identification of modified nucleotides including methylation
(asthisaffectsthekineticsoftheincorporationinameasurableway).However,
theerrorrateremainsveryhighandthenumberofZMWsthatcanbemonitored
in parallel is presently limited to ∼150,000, hence limiting the throughput.
Otherroutesthatarebeingexploredforsingle-moleculesequencingincludeDNA
sequencing with nanopores, and direct imaging of DNA sequences using
tunnelingandtransmission-electron-microscopy-basedapproaches.
Essentialhumangenomics–MichelGeorges
Page12/41
Shotgun sequencing and mate-pair libraries. Except for some viruses,
genomelengthisobviouslymuchlargerthanachievablereadlength.Toobtain
the complete sequence of a genome of interest one therefore typically applies
“shotgun sequencing”. In this, many copies of the genome of interest are
randomly fragmented to yield pieces of a size compatible with read length. As
multiple copies of the genome are subjected to fragmentation, resulting
fragmentsmayoverlap.Thislibraryoffragmentsisthen“clonallyamplified”.In
the time of Sanger sequencing, this was done by ligating the fragments in a
plasmidorphagevector,transformingorinfectingcompetentbacterialcells,and
generating colonies each amplifying a distinct fragment. When using
sequencing-by-synthesis methods this is done using either emulsion or bridge
PCR(seeabove).Intheend,onegeneratessequencereadsfromalargenumber
of random (hence the name “shotgun sequencing”) fragments derived from the
genome of interest. The complete genome sequence is then reconstructed “in
silico” by searching for overlapping sequences (corresponding to overlapping
fragments).Aligningalloverlappingreadsgeneratessequencecontigsincluding
a“tilingpath”that,inanidealworld,shouldspanandhencecorrespondtothe
entire genome (or to be more precise there should be as many tiling paths as
therearechromosomes).Forthetilingpathstospantheentiregenome,i.e.not
tobeinterruptedby“gaps”,alargefractionifnottheentiregenomehastohave
been sequenced at least twice. To achieve this, and accounting for random as
wellassystematic(f.i.relatedtobasepaircomposition)variationsinsequence
depth,thegenomehastohavebeensequencedatadepthof∼100.Assuminga
genome size of 3x109 and a read length of 100 bp, this implies that one would
have to generate 100*3*109/100=3*109 reads (100 fold depth) rather than
3*109/100=3*107reads(1folddepth).
Experienceshowsthatevenifsequencingat∼100folddepth,itisimpossibleto
reconstructtheentiregenomeusingthisapproachonly.Inthedraftsequenceof
thehumangenomepublishedin2001,50%ofthebasesresidedincontigswith
minimum size of 826,000 bp (“N50”) (total number of contigs = 4,884). In the
so-called finished sequence of the human genome, published in 2004, N50 was
38,509,590, and the average contig size was ∼41Mb, hence corresponding to
approximately 350 contigs. Several factors account for the difficulty (if not
impossibility)togeneratecompletegenomicsequenceofforinstanceamammal
by means of shotgun sequencing only. The main factor is the high content
(∼50%; see above) in interspersed repetitive sequences. Finding the same
interspersedrepeatintworeadsobviouslydoesnotmeanthattheyderivefrom
DNA fragments that overlap in the genomic sequence. Thus interspersed
repetitive sequences need to be ignored, or “masked”, before initiating the in
silico reassembly process. This will make it very difficult if not impossible to
reassemble regions that are rich in repetitive sequences particularly when the
sequence reads are short as is the case with the “sequencing by synthesis”
methods.Thus,manygapsinthesequencecorrespondtointerspersedrepeatrich regions. Segmental duplications are another nightmare for genome
assemblers. The assembly process will often collapse segmental duplications
into one, characterized only by an unusually high (∼ doubled for a true
duplication)sequencedepthwhencomparedtotherestofthegenome.Satellite
sequence corresponding to centromeres and telomeres are so difficult to
Essentialhumangenomics–MichelGeorges
Page13/41
assemble that they are usually just ignored when reconstructing genome
sequences. A second factor complicating reassembly is the pervasive genetic
polymorphism.TheDNAtosequenceistypicallyextractedfromcellsofatleast
one,sometimesseveralindividuals.Theseindividualsarediploid:theyactually
contain two genomes, one inherited from the father and one from the mother.
Twogenomesdrawn“atrandom”fromthepopulationdifferatleastevery1,000
basepairsatso-called“polymorphicsites”(seehereafter).Regionsofoverlap
between fragments may thus in effect differ if originating from different
homologues (the paternal and the maternal). When two reads are
characterizedbynearlyidenticalsequences,thequestionthusbecomeswhether
theyderivefromdifferentpartsofthegenomeorfromoverlappingbut“allelic”
fragments. To mitigate this issue, scientists select – whenever possible -
individuals that are as inbred as possible as a source of DNA to generate a
genomic sequence. Indeed the two alleles of inbred individuals have a higher
chance to be “identical-by-descent” than for outbred individuals. In inbred
strains of mice for instance, there are virtually (with the exception of de novo
mutations;seehereafter)nopolymorphismsthatdifferentiatethepaternaland
maternalgenomes.
Shotgun sequencing, even at high depth, hence typically generates many
unconnected sequence contigs, which is not very satisfying. Tohelp order and
orient contigs, scientists have devised approaches based on “mate pairs” or
“paired ends”. The first versions of these approaches were implemented with
Sanger sequencing. Genomic libraries were constructed in plasmid (or
phagemid)vectorsfromlargeDNAfragments,forinstance5,10oreven50Kb.
DNA was subject to milder fragmentation treatments, fragments in the desired
sizerangeselectedbygelelectrophoresisandligatedincloningvectors.Clones
from the corresponding libraries were then sequenced with vector specific
primersflankingbothendsoftheinsertyielding“matepair”reads,i.e.tworeads
knowntocorrespondtothetwoendsofthesamefragment.Largenumbersof
such mate pair reads were then used to reassemble the genome based on
overlapping reads as described above to yield a number of disconnected
sequence contig. The new mate pair information was used at this stage to
identifymatepairsforwhichonereadwaspartofonecontig,andtheothermate
pair of another contig. Such events would unambiguously indicate that the
correspondingcontigsareneighbors,woulddeterminetheirrelativeorientation,
andwouldsizethegapseparatingthem.Thiswouldestablishsetsofordered
andorientedsetsof(previouslyunconnected)contigs,referredtoas“sequence
scaffolds”. The draft sequence of the human genome published in 2001
comprised 2,191 such sequence scaffolds with N50 of 2,279 Kb. Moreover,
knowing the mate pairs connecting neighboring sequence contigs, the
corresponding clones could be used to complete the intervening sequence
therebyclosingthegapstoprogressivelyleadtoatruly“finished”sequence.
The “mate pair” approach was later adapted to “sequencing by synthesis”
approaches.Inonesuchmethod,theDNAtobesequencedisgentlyfragmented
andpiecesofdesiredsize(f.i.5Kb)isolatedbygelelectrophoresis.Theirends
arethenenzymaticallyrepairedusingaDNApolymeraseandbiotinylateddNTPs.
The resulting blunt-ended molecules are then circularized using ligases and
more aggressively refragmented to obtain pieces of ∼250 bp. The fragments
Essentialhumangenomics–MichelGeorges
Page14/41
encompassingtheligationjunctionsarethenaffinitypurifiedwithstreptavidin,
appended with Y-shaped adapters and both ends of the ensuing fragments
sequencedonnextgenerationsequencersasdescribedabove.
Linkagemaps,radiationhybridmapsandBACcontigs.Itispresentlypossible
to generate a very descent reference sequence of the genome of any mammal
“just”byapplyingshotgunsequencingincombinationwith“matepair”strategies
on “sequencing by synthesis” sequencers. However, additional information was
utilized to obtain the first reference sequence of the human genome, in a
“hierarchicalsequencingapproach”.
The“humangenomeproject”initiatedin∼1990andambitioningtoproducethe
firstreferencesequenceofthehumangenome,startedwiththeconstructionofa
seriesof“maps”asapreambletotheactualsequencingphase.Thefirstmapsto
begeneratedwere“linkagemaps”.Thesewerecomposedoftensofthousands
ofmicrosatellitemarkersthatwerepositionedrelativetoeachotherbyapplying
theprinciplesoflinkageanalysisestablishedbyMorganinDrosophila.Tothat
end, three-generation families with as many children as possible (mainly
sampled in the Mormon population from Utah and known as the “CEPH
families”) were genotyped for all known microsatellites. As microsatellite
markers are often highly polymorphic, paternal and maternal alleles differ in
many individuals, which are therefore said to be heterozygous. When tracking
twoormoresuchmicrosatellitesjointlyinsuchextendedfamilies,itispossible
to identify recombination events as instances were an offspring inherits f.i. an
alleletracingbacktoitspaternalgrand-fatherforonemicrosatelliteandanallele
tracingbacktoitspaternalgrand-motherforanothermicrosatellite.Forclosely
“linked” microsatellites such recombination events (involving cross-overs) are
expected to be rare, while for microsatellites that are located on distinct
chromosomes or far from each other on the same chromosome, such
recombination events will be observed in half the children. As established by
Morgan, the observed recombination rate between two markers is thus a
measure of their distance, which is typically measured in “centimorgans” (one
centimorganisthedistanceseparatingtomarkersforwhichtherecombination
rateis1%).Thisinformationcanbeusedtoidentifygroupsofsyntenicmarkers
(i.e. located on the same chromosome) and to order them on the basis of
observedcross-overevents.Linkagemapsofthehumangenomeincludingtens
of thousands of microsatellite markers were hence constructed in the nineties.
Inaddition,toformingabackboneforthehumangenomeproject,themotivation
todeveloplinkagemapsliedintheopportunitiesthatitcreatedtolocategenes
underlyinginheriteddiseases.
While tens of thousands microsatellite landmarks spanning the genome may
seemalot,theaveragedistanceseparatingsuchmarkersisstilloftheorderof
100 Kb which is still large. To increase the density of landmarks, geneticists
turned their attention to “radiation hybrids” (RH). RH are obtained by first
irradiatinganeuploidhumancelllinewithX-rays.Thisgenerateschromosome
breaksatarate,whichcanbecontrolledbyvaryingtheapplieddoseofX-rays.
Thenumberofbreaksistypicallysohigh,thatifleftontheirown,allcellswould
rapidly die. However, irradiated cells can be rescued by growing them in the
presenceofarodentcelllinewhilefavoringcellfusionbymeansofavirus(f.i.
Sendaivirus)orchemical(f.i.polyethyleneglycol).Ifoneusesarodentcellline
Essentialhumangenomics–MichelGeorges
Page15/41
harboring a conditional defect (f.i. deficiency in thymidine kinase (TK)), and if
growing the cells in a selective medium, only interspecific hybrids maintaining
the human TK gene will survive. By doing so, one can select clones of
interspecifichybridsorheterokaryons.Suchclonesaretypicallycharacterized
by a full complement of rodent chromosomes having integrated fragments of
humangenomesatmultiplelocations,amountingto∼15%ofthehumangenome.
Given the selection procedure all of the surviving hybrid clones will have
integrated a fragment of the human gene with the TK gene. But all other
fragmentscanbeviewedasarandomsampleofthehumangenome.Eachhybrid
clone will hence contain a different portion of the human genome, and this
feature can be used to use a panel of RH as mapping resource. Such panel
typically contains ∼100 independent clones. These are grown and DNA is
extracted from all of them. This DNA panel is then used to perform PCR
reactionsforasmany“amplicons”aspossible.Suchampliconscancorrespond
to any short unique sequence in the genome and is sometimes referred to as a
sequencetaggedsiteorSTS.Contrarytotherequirementsofgeneticmarkersin
linkagemaps,theSTSdonotneedto(althoughtheycan)bepolymorphic.Any
sequencewilldo,aslongasitcaneasilybeamplifiedbyPCRfromgenomicDNA.
WhentypinganRHpanelforagivenSTS,approximately15%willbe“positive”,
(i.e.thePCRamplificationwillyieldtheexpectedproduct),85%willbenegative.
When typing an RH panel for two STS, which map for instance to different
chromosome,oneexpects∼15%*15%=2.25%oftheRHclonestobepositivefor
the two STS. To be more precise, the expected proportion of double-positive
clones will be r1*r2, where r1 and r2 are the observed retention rates for STS1
andSTS2respectively(r1andr2willbecloseto,butnotexactly15%).Likewise,
the expected proportion of double-negative clones are (1-r1)*(1-r2), while the
expectedproportionofclonesthatarepositiveforoneoftwoSTSisr1*(1-r2)+(1r1)*r2. If the two STS are located very close to each other in the genome, the
proportion of double-positive and double-negative clones will be higher than
expected, while the proportion of single-positive clones will be lower than
expected. Indeed, double-positives will occur at a rate of ∼ r1≈r2>r1*r2, and
double-negatives at a rate of ∼(1-r1)≈(1-r2)>(1-r1)*(1-r2). A single-positive
clonecanonlybeobservedifachromosomebreakoccurredbetweenSTS1and
STS2, and if one (and only one) of the fragments was retained in the RH clone.
ThiseventisthemoreunlikelythemoretheSTSsarecloselylocated.Thusthe
degree of excess of double-positives and double-negatives and the depletion of
single-positives is a measure of the distance between the two STS. Several RH
panelshavebeengeneratedforthehumangenomeand“genotyped”infactorystyleforhundredsofthousandsofSTS,includingthemicrosatellitesformingthe
basis of the linkage maps. By exploiting the principles outlined above,
“radiation hybrid maps” comprising as many STS were constructed hence
augmentingthedensityoflandmarksonthehumangenome.RHpanelsdifferby
thedoseofX-raysused(f.i.5,000vs15,000rads),providingmappingpowerand
resolutionatdifferentscales.
Finally, before initiating the actual sequencing stage, scientists constructed a
physical map of the human genome consisting in a series of “BAC contigs”
anchoredoneitherthelinkageand/ortheRHmaps.BACs,orbacterialartificial
chromosomes,areplasmid-likebacterialcloningvectors.Theirmainfeatureis
Essentialhumangenomics–MichelGeorges
Page16/41
thatBACscanbeusedtomaintainandpropagateinsertsaslargeas250Kb.This
is primarily due to the fact that their origin of replication maintains a low
number of copies per cell, thereby reducing the possibility for inter plasmid
recombination (particularly when using recombination deficient bacteria) that
would cause deletions and other rearrangements. Earlier efforts attempted to
use YACs or yeast artificial chromosomes, which have an even higher cloning
capacity. However, the high rate of rearrangements in YAC vectors has in
essence precluded their large-scale use. Thus BAC libraries, with a number of
independentclonescorrespondingtoaverylargenumberofgenomeequivalents,
were constructed for the human genomes. In an approach that shares many
featuresincommonwith“shotgunsequencing”(yetnotdependentatthisstage
on actual sequencing), scientists devised strategies to very effectively identify
partially overlapping clones. The main strategies were “STS content mapping”
and“BACfingerprinting”,whichwerebothappliedinfactorymodeinspecialized
genome centers. In the “STS content mapping” approach, DNA extracted from
hundreds of thousands of independent BAC clones, is tested for the
presence/absence of hundreds of thousands of STS. Different BAC clones that
arepositiveforthesameSTSmust,bydefinition,beoverlapping.Bystudying
theSTSsharingbetweenallpairsofBACsitispossibletogeneratewhatiscalled
a BAC contig, i.e. a collection of partially overlapping BAC clones that together
coveralargesegmentofthegenome.ToreducethenumberofPCRtoconduct,
clever “pooling” strategies were used. Rather than to test BACs individually,
PCRs were conducted using pools containing DNA from many BACs. As every
BACisassignedtoseveralpools,itispossible,aposteriori,todeterminewhich
BACsarepositiveinagivenpositivepool.Inthe“BACfingerprinting”approach,
DNAwasextractedfromindividualBACs,digestedwithacocktailofrestriction
enzymes, and the resulting fragments separated by polyacrylamide gel
electrophoresis.Asubsetofthefragmentswasthenvisualizedusingavarietyof
strategies. Obviously if one generates such BAC fingerprint twice for the same
BAC, it will be identical. If one compares the BAC fingerprint of two nonoverlapping BACs, their restriction pattern will be completely different; the
chancetoobservetwoormorefragmentsofthesamesizeislow.IftwoBACs
arepartiallyoverlapping,thedegreeofbandsharingwillbeintermediate,hence
allowingtorecognizeoverlappingBACs.Thisapproachwasalsousedinfactory
mode, and the ensuing information used to generate BAC contigs. Whether
generated using STS content mapping or BAC fingerprinting, the resulting
“physical maps” of the human genome comprised large number of contigs
spanningonaverage...Kb,withN50of...BACcontigscontainingmicrosatellites
or STS positioned on linkage and/or RH maps could be anchored (that is
positioned) and often ordered (if containing more than one mapped STS) with
respectivetoeachotheronthehumangenome.
To monitor how well the corresponding physical map covered the different
chromosomes, a subset of individual BACs were fluorescently labeled and
directly hybridized to metaphase chromosomes in experiments called
fluorescentinsituhybridizationorFISH,todeterminetheexactpositionofthe
correspondingBACcontigonthehumangenome.
Subsequently,scientistsselecteda“minimumtilingpath”ofBACs,i.e.asmallest
possible collection of BACs that would jointly cover as much as possible of the
Essentialhumangenomics–MichelGeorges
Page17/41
humangenomewithminimumoverlap,andthesewerethensubjecttoshot-gun
sequencing as described above. This implied sub-cloning of the BAC DNA in a
regular plasmid vector, Sanger sequencing of a large number of randomly
selectedplasmidclonesforeachBAC,andreassemblyoftheBAC’ssequenceon
the basis of the observed overlap between sequence reads. The sequence of
neighboring BACs was then concatenated to progressively generate larger
“chunks”ofthesequenceofthehumangenome.Thefirstreferencesequenceof
thehumangenomewaslargelygeneratedbyapubliclyfundedeffortdominated
by American and British laboratories that used this “hierarchical sequencing
approach”.Aprivatecompany(“Celera”)attemptedtoovertakethepubliceffort
using a direct approach based only on shotgun sequencing combined with the
systematicuseofmatepairlibraries.Theracewassomewhatfraudhoweveras
Celera had access to the data generated by the public effort and made readily
availableinthepublicdomain.Bothversionsofthehumanreferencesequence,
generated respectively by the publicly funded consortium and Celera, were
publishedatthesametimeinNatureandScience,respectively.Thegeneration
ofreferencegenomesforotherorganisms,including29mammalstoday,isnow
nearlyentirelybasedondirectsequencingapproaches.
C.ANNOTATINGTHEREFERENCESEQUENCEOFTHEHUMANGENOME
The reference sequence of the human genome is in essence 24 (the 22
autosomes and the two gonosomes (X and Y)) long strings of As, Cs, Gs and Ts
amounting to ∼3 x 109 bases. Obviously, that is not very informative per se,
unless this sequence is “annotated”. This means that one likes to know the
positionofthegenes,includinglimitsbetweenexonsandintrons,thatonewould
like to know where the cis-acting regulatory elements are that control the
expressionofthegenes,thatonewouldliketoknowthelocationandtypeofthe
interspersed repetitive elements, that one would like to know where
centromeresandtelomeresstart,and–whynot–thatonewouldidentifynovel
essentialfeaturesofourgenomefromthecleverexaminationfromitssequence.
Maskingrepetitivesequences.Oneofthefirststepsinannotatingagenomeis
to identify repetitive sequences. By definition, the basis for the recognition of
repetitive sequences is that these match multiple similar sequences in the
genome when used as query sequence. From such matches, consensus
sequencesrepresentativeofdistinctfamiliesandsub-familiescanbegenerated
inaniterativeprocesswhichcanthenserveasqueriesfortheidentificationofas
many “homologous” repeats as possible. The Repeatmasker software
(http://www.repeatmasker.org) and associated Repbase database of repetitive
elementsisthepreferredsoftwaretofulfillthistask.Itwillidentifyinterspersed
repeatsbelongingtothefourmajorclassesaswellassimplesequencerepeats,
provide summary statistics about the number of elements found in each class
andwhatfractionofthegenometheyrepresent,andreturnasequenceinwhich
therepeatshavebeenmaskedifsodemanded.
Establishing a gene catalogue. Identifying our genes is obviously one of the
mostimportantgoalsoftheannotationprocess,astheseareassumedtoembody
the“raisond’être”ofourgenome.
Essentialhumangenomics–MichelGeorges
Page18/41
Thedefinitionofwhatconstitutesageneisevolving,yetanundisputedcommon
feature of all our genes is that they are transcribed. Indeed it is either the
resulting RNA (non-coding genes) or the product of its translation (protein
codinggenes)thatfulfillstheirfunction.Thus,identifyingalltheregionsofour
genome that are transcribed (i.e. our transcriptome) is probably the approach
thathasbeenmostinformativeinestablishingourcatalogueofgenes.Notethat
if all genes are transcribed, all that is transcribed does not per se constitute a
gene. In the early days, the mRNA sequence of many genes was determined
before their genomic sequence, particularly if abundantly expressed in specific
tissue. cDNAs were used as probes to isolate and then sequence cognate
genomic fragments. This lead to the recognition of the mosaic (i.e. comprising
exons and introns) structure of eukaryotic genes. In the nineties, large-scale
“brute force” sequencing of cDNAs was recognized as an effective strategy to
identify new genes. cDNA libraries were generated (usually from
polyadenylated RNA) from as many tissues (both healthy and diseased) and
developmentalstagesaspossible.CloneswererandomlypickedfromsuchcDNA
librariesandsubjecttolarge-scalesequencing,initiallyusingSangersequencing.
The resulting reads, corresponding typically to partial mRNA sequences, were
referred to as Expressed Sequence Tags (ESTs). Partially overlapping ESTs
were assembled into longer-length (if not full-length) mRNA sequences, often
revealing isoforms resulting from alternative splicing and polyadenylation.
“Blasting” the ensuing mRNA sequences against the reference sequence of the
human genome located the corresponding gene, revealed its exon-intron
structure, and allowed prediction of its most likely open reading frame (ORF).
Thisapproachwasappliedsystematicallybyseveralgenomecenters,bothpublic
and private, including companies such as Celera. EST sequencing was
complemented by approaches that extracted short mRNA-specific cDNA tags,
whichwereconcatenatedpriortoSangersequencing.“SerialAnalysisofGene
Expression” or SAGE, extracted tags in the vicinity of poly-A tails, while “Cap
AnalysisofGeneExpression”orCAGE,extractedtagsinthevicinityofthe5’cap.
Counting the number of times a given tag was sequenced provided the first
“digital” measures of gene expression. Sanger sequencing-based approaches
were subsequently complemented with array-based approaches to
characterizethetranscriptome.Tissue-specificRNAswerefluorescentlylabeled
and hybridized to arrays comprising millions of tiled oligonucleotides jointly
spanning the entire human reference genome. Transcribed segments of the
genomewererecognizedbyvirtueofthefluorescenceoftheRNAbindingtothe
corresponding probes. The intensity of the fluorescent signal provided
information about the relative abundance of distinct transcripts. These
experiments were amongst the first to reveal the pervasive transcriptional
potentialofourgenome.Indeed,inadditiontothedominatingsignalofprotein
encoding and a growing catalogue of long and short non-coding genes, lower
level hybridization (and hence transcription) appeared to be widespread,
encompassingasmuchas75%ofourgenome.Thebiologicalrelevanceofthis
pervasive transcription remains an essential, as of yet unanswered question.
More recently, “RNA-Seq” has become the method of choice for qualitative and
quantitative characterization of “transcriptomes”. Typically, RNAs are reverse
transcribedintocDNAs,whicharethensequencedbysynthesis.Strand-specific
methodshavebeendevelopedtomaintaininformationaboutthestrandthatis
Essentialhumangenomics–MichelGeorges
Page19/41
actuallytranscribed.Thestructureoftranscriptsisreconstructedonthebasisof
overlapping reads and using paired-end information when available, hence
informingaboutexon-introngeneorganizationandtheoccurrenceofisoforms.
Sequencedepthprovidesdigitalinformationaboutexpressionlevel.Sequencing
by synthesis still suffers some drawback linked to shortness of the sequence
reads, and the necessity of reverse transcription and need for target
amplification, which may introduce unwanted representational biases. In the
future, single molecule sequencing may overcome some of these limitations.
The picture that emerges from the application of these increasingly powerful
methods for the characterization of the human transcriptome is (i) the lower
than expected number of protein encoding genes, (ii) the abundance of non
codingRNAgenesincludingpreviouslyunknownsmallandlongnon-codingRNA
genes, with – for many of them - still uncharacterized function, (iii) the
unsuspected complexity of the “transcriptional units” corresponding to known
protein coding genes including pervasive alternative splicing, antisense
transcripts,senseandantisensetranscriptsassociatedwiththe5’endofgenes
andantisensetranscriptsassociated3’endofgenes.
While aligning RNA sequences with the sequence of the reference genome has
providedthebulkoftheinformationusedtoassembleourgenecatalogue,other
methods have provided essential complementary and/or confirmatory
information. Ab initio gene prediction programs based on machine learning
methods (particularly Hidden Markov Models) have been developed and
perform very well for the detection of protein coding genes. Coding exons are
subjecttostrongpurifyingselection(theyevolvemoreslowlythantherestof
thegenome,asmostmutationsinthemaredeleteriousandhenceeliminatedby
naturalselection;seehereafter).Asaconsequence,thesequencesthatcanstill
berecognizedashomologous,whencomparingthehumangenomewiththatof
birdsoffish,areverystronglyenrichedincodingexons.Combinedwiththefact
thatitsgenomeislargelydevoidofrepetitivesequences(whichgreatlyreduces
themagnitudeoftheeffort),thisobservationwastheprimarymotivationforthe
determination the complete genomic sequence of the pufferfish. Finally, the
transcriptional process imposes specific epigenetic marks on the cognate
chromatin(f.i.H3K36me3),andthishasbecomeaneffectivestrategytomonitor
transcription(see“CHIP-Seq”hereafter).Asanexample,manylongnon-coding
RNAgeneshavebeenidentifiedusingthisapproach.
Identifying cis-acting regulatory elements. Gene annotations typically
includethe5’UTR,codingexonsandinterveningintrons,andthe3’UTR.Itdoes
notincludetheregulatoryelementsthatcontroltheexpressionofthegeneincis
(whether at the transcriptional level or at a later stage). Yet, it is essential to
identifythese“switches”asmutationsinthemmightperturbgenefunctionand
henceunderliephenotypicvariationincludingdisease.Severalapproacheshave
been devised in the last decades that can be adapted for the large-scale
identificationofcis-actingregulatoryelements.Theseincludetheidentification
ofevolutionaryconstraints,chromatinimmunoprecipitationcombinedwithnext
generation sequencing, and the identification of DNaseI hypersensitive sites by
meansofnextgenerationsequencing.
All mammals derive from a common ancestor species that roamed the earth ∼
225 millions years ago. As new species emerged and evolved independently,
Essentialhumangenomics–MichelGeorges
Page20/41
their genomes progressively diverged from that of the ancestor species and
hence from each other. This progressive divergence is largely driven by the
processes of mutation (that generates new mutations), and random drift (that
leads to the fixation in the population of a small fraction of the neo-mutations;
see hereafter). In the absence of other forces (corresponding to “neutral”
evolution),thedivergenceevolvesatanearlyconstantrateof∼10-9substitutions
per base and per year of evolution. However, this “background” rate applies
onlytoregionsofthegenomethatarenotfunctional,suchaslargeproportions
of the interspersed repetitive elements. Genomic segments that fulfill an
essentialfunctionusuallyevolvemoreslowly:theyareevolutionaryconstraint.
Thisisduetothefactthatmutationsinthemwillusuallybedeleterious,hence
decreasingthereproductivecapacityoftheindividualsthatinheritthem,hence
decreasingtheirchanceoffixationinthepopulation.Thisprocessof“purifying
selection”iseasilyrecognizablefromthecomparisonoftheconservationof1st,
2nd and 3rd positions in codons. Mutations at 3rd positions are more often
synonymous than those at 1st and 2nd positions, and 3rd positions are
concomitantly less conserved than 1st and 2nd positions. It was rapidly
recognizedthatcis-actingregulatoryelementssharedamongstspeciesmightbe
effectivelyidentifiedbyvirtueoftheirevolutionaryconstraints.Assoonasthe
human genome was completed, the main genome centers have applied their
sequencingcapacitytoagrowingnumberofspeciesspanningthetreeoflife.If
sufficient number of species were sequenced it would be possible to identify
individual bases that – within the groups of sequenced species – evolve more
slowlythanexpectedintheabsenceofselection.Recently,thegenomesequence
ofthefirst29eutherianmammalswasreported.Theseallowedforthereliable
identification of evolutionary constraints not at one but at 12-base pair
resolution. Millions of constrained elements were hence identified of which >
60%liedoutsideofgeneboundariesdefinedasabove.Theyamountedto∼5%
ofthehumanreferencegenome.Theyincludedknowncodons,newlydiscovered
codons, RNAs with conserved secondary structure, cis-acting regulatory
elementsoperatingatthetranscriptionalandpost-transcriptionallevelofwhich
hundreds of thousands were derived from mobile elements, etc. Mining the
sequencesinevolutionaryconstraintelementrevealedhundredsoftargetmotifs
fortrans-actingfactorsincludingtranscriptionfactorsandmiRNAs.
Thisstudyidentifiedelementswhosefunctionisconservedamongsteutherians
orplacentalmammals.Identifyingregulatoryelements,whichareunique(and
hence define) to for instance primates (or other branches of the mammalian
tree), will require the sequencing of more primates to achieve appropriate
statistical power. Initiatives to determine the sequence of 10,000 vertebrate
species are underway, with that very objective in mind
(http://genome10k.soe.ucsc.edu): to identify regulatory elements that are
specific for the different clades in the vertebrate tree. By sequencing a large
enough number of humans, one could likewise identify human-specific
regulatory elements by virtue of the observed paucity or depletion in
polymorphismsinspecificresidues.
As an alternative approach for the identification of cis-acting regulatory
elements,onecanstudytheepigeneticmarksthatcharacterizeanddifferentiate
suchelements,i.e.thehistonecode.Toapproachthisinasystematicway,one
Essentialhumangenomics–MichelGeorges
Page21/41
increasingly uses chromatin immunoprecipitation combined with next
generation sequencing (“CHIP-Seq”) for that purpose. In this method, cells
representing a tissue type of interest (or if possible as many cell types as
possible) are first soaked in formaldehyde. This freezes the structure of the
chromatinclosetoitsnativestatebycross-linkingDNAandhistonescovalently,
albeit reversibly. The frozen chromatin is then extracted, fragmented by
sonication, and incubated with an antibody that recognizes a specific histone
modification. Specific antibodies are presently available for tens such histone
modifications including methylation, acetylation and phosphorylations of
specific residues in the amino-terminal tail of the four nucleosomal histones
(H2A,H2B,H3andH4).Theantibodiesarethenprecipitatedandwiththemthe
chromatinfragmentstowhichtheybind.Theformaldehydeinducedcross-links
are resolved by heat treatment, the released DNA fragments recovered and
sequenced on next generation sequencers. By mapping the reads back to the
referencegenome,oneobtainsadirect,quantitativereadingofwhichsegments
of the genome carried the corresponding modification. The experiment is
repeated with the different antibodies, yielding genome maps for the different
histone modifications. The same CHIP-Seq experiments can be conducted with
antibodies recognizing specific trans-acting transcription factors, DNA
polymerases, factors binding to insulators (f.i CTCF), components of the
polycomb groups, proteins mediating the interaction between trans-acting
factorsandthebasaltranscriptionmachinery(f.i.p300),etc.Theanalysisof
the resulting CHIP-Seq maps (by means of Hidden Markov Models), reveals a
limited number of chromatin states, each with their own histone code,
corresponding (amongst others) to promotors (active, weak or poised),
enhancers (strong, poised), insulators, transcribed, polycomb repressed,
heterochromatin,etc...
Cis-acting regulatory elements, including promoters, enhancers/silencers and
insulators,havebeenknownfordecadestosharehypersensitivitytoDNasesas
common,distinctivefeature.Thus,ifonetreatsnucleiwithlimitingamountsof
DNase, the genome will preferentially be cut at such cis-acting regulatory
elements.Thispropertyhasbeenexploitedtolocatesuchelementsfordecades,
classically by means of Southern blotting targeting genomic regions of interest.
More recently, DNase hypersensitivity protocols have been adapted to next
generation sequencing, to allow for the systematic identification of regulatory
elementsintheentiregenome.Indeed,isolatingsmallDNAfragmentsreleased
byDNasetreatment(andhenceenrichedinDNAlocatedincis-actingelements),
subjectingthesetonextgenerationsequencing,andmappingtheresultingreads
backtothereferencegenome,identifiesthecis-actingregulatoryelementsthat
were active in the corresponding cell-type at high resolution. While this
approach does not differentiate the different types of cis-acting regulatory
elements,itrecognizesalargenumberoftheminasingleexperiment.
Itisincreasinglyrecognizedthatcis-actingregulatoryelementscanactoverlong
distances(f.i.>250Kb)andcontroltheexpressionofmultipletargetgenes.The
presentworkingmodelisthataloopofDNAisformedallowingthetrans-acting
factors bound to the cis-acting element to physically interact (directly or via
mediator proteins such as p300) and stimulate (in the case of enhancers) or
inhibit(inthecaseofsilencers)withthebasaltranscriptionmachineryboundto
Essentialhumangenomics–MichelGeorges
Page22/41
theproximalpromotors.Asamatteroffact,agenemayencompass(withinits
introns) cis-acting regulatory elements controlling neighboring or even more
distant genes. It is therefore not obvious to determine which genes are
controlled by the gene switches identified by CHIP-Seq or DNase
hypersensitivity.Anindirectstrategytoconnectcis-actingregulatoryelements
withtheirtargetgenesistolookforacorrelationbetweentheactivityofagene
(as measured by the analysis of the transcriptome) and the activity of the
regulatory element (as measured by CHIP-Seq or DNase hypersentivity) across
many cell types. A more direct, recently developed method is chromatin
conformationcapture(3C,withextensionsreferredtoas4C,5CandHiC).In
this method the chromatin is frozen by cross-linking with formaldehyde, and
fragmented by sonication, as for CHIP-Seq. The cross-linking is predicted to
freezetheloopsthatareformedbythetranscriptionfactor-mediatedinteraction
between enhancers/silencers and their target promotors. Resulting DNA ends
are then repaired using biotinylated nucleotides prior to ligation under diluted
conditions (to favor ligation between DNA ends brought together by crosslinking).ThisshouldgenerateligationproductsbetweenDNAfragmentsinthe
vicinity of the enhancers/silencers and DNA fragments in the vicinity of the
promotors. After ligation, the cross-linking is reversed, the trapped DNA
released and refragmented by sonication, the fragments encompassing ligation
points enriched with streptavidin and then subjected to paired-end next
generationsequencing.Selectionofloopsthatarestabilizedbyspecificmediator
proteins can be enriched (prior to reversing the cross-linking) by CHIP using
cognate antibodies. Interactions between cis-acting regulatory elements and
theirtargetpromotorsarethenrecognizedbypairedendreadsthatmaponeto
a proximal promotor and the other to a cis-acting regulatory element on the
samechromosome.
Thelargescaleimplementationofthemethodsdescribedabove,withtheaimto
identifyallthefunctionallyimportantelementsinthehumangenome,hasbeen
coordinated at the international level as part of the ENCODE project
(Encyclopedia of DNA elements; https://genome.ucsc.edu/ENCODE/).
Transcriptome, CHIP-Seq and DNase hypersentivity data were generated for >
70celltypes,identifyingmillionsofputativefunctionalelementsencompassing
as much as 80% of the reference sequence of the human genome. The
corresponding figures largely surpass the estimates based on evolutionary
constraints(∼5%).Themeaningofthisdiscrepancystillremainstoberesolved.
Other–omes.
Genomebrowsers.
Essentialhumangenomics–MichelGeorges
Page23/41
2. Individualhumangenomes
A.GENETICVARIANTS:SNPS,INDELS,SSRS,CNVSANDOTHERVARIANTS
Types of genetic variants. The human reference genome is often depicted as
being “our” genome. As a matter of fact, if we were to sequence our own,
individual genomes (something which will increasingly become reality in the
future), we would observe many differences between our own gene and the
humanreferencegenome.Itiswellestablishednowthatifonesequencestwo
human genomes drawn at random from a population (for instance the genome
thatyouinheritedfromyourmotherandtheoneyouinheritedfromyoufather),
thesewilldifferfromeachotheratmillionsofsites.
ThefirstandnumericallypredominantsuchvariablesitesareSingleNucleotide
PolymorphismsorSNPs.Astheirnameimplies,thesearedifferencesbetween
two genomes involving a single base pair. SNPs comprise transitions (A = T
versusG ≡ Cbase-pair),transversions(A = TvsT = A;A = TvsC ≡ G;G ≡ CvsC ≡
G), and the insertion or deletion (“INDEL”) of a single base-pair. Two human
genomestypicallydifferatapproximately3millionSNPs,oroneSNPevery1,000
basepairs(seehereafter).
SNPsarenottheonlykindofgeneticpolymorphism.Otherwell-knowngenetic
variants are Simple Sequence Repeats or SSRs, corresponding to micro- and
mini-satellites. These are tandem repetitions of sequence motifs, ranging from
one to five base pairs for microsatellites and up to fifty or more for
microsatellites(thislimitsareperfectlyarbitrary).ThemainfeatureofSSRsis
thatthenumberoftandemrepetitionsoftendifferbetweenindividualgenomes.
While SNPs are typically biallelic (i.e. only two alleles are usually observed at
appreciablefrequenciesinthepopulation),SSRsareveryoftenmulti-allelic,i.e.
characterized by more than two (sometimes many) alleles. As a consequence,
and as soon as one considers ten or more highly polymorphic SSRs, two
individuals virtually never have the same “composite” genotype (with the
exception of monozygotic twins). These “composite” genotypes have therefore
been called “DNA fingerprints” and have found widespread application in
forensics.
AnotherimportantclassofgeneticvariantsareCopyNumberVariantsorCNVs.
CNVscorrespondtolargesegmentsofthegenome(typically>100Kb),whichare
observedinindividualgenomesindifferentcopynumbers.CNVoftencoincide
withso-called“segmentalduplications”(althoughnotallsegmentalduplications
are polymorphic and hence CNVs). The multiple copies of a CNV can either
localizetothesamechromosomalregion(oftenintandemorhead-to-tail)orbe
dispersedinthegenome.Itisestimatedthatmorethan10%ofourgenomeis
subjecttocopynumbervariation,includingsegmentsofthegenomethatcontain
genes.Asaconsequence,individualsdifferinthenumberofcopiestheyhavefor
some genes. CNVs are therefore thought to have an important impact on the
individual’sphenotype,includingdisease(seehereafter).
In addition to these main classes of genetic variants, individual genomes may
differatchromosomalvariantsincludinglargedeletions,insertions,inversions
andtranslocations.
Essentialhumangenomics–MichelGeorges
Page24/41
Minorallelefrequency,nucleotidediversityandnumberofpolymorphicsites.
As mentioned above, sequencing and comparing two genomes (f.i. randomly
sampled in Northern Europeans) will typically reveal of the order of 3 million
SNPs.Fromthisonecancomputeaclassicalmeasureofgeneticpolymorphism,
referred to as nucleotide diversity ( π ), corresponding to the average
heterozygosity per nucleotide site, and which is approximately 0.001 in
Northern Europeans. If one were to sequence and compare another pair of
randomly sampled genomes, one would obtain a very similar value of π .
However, if one now compares the four genomes jointly, the number of
polymorphic sites (i.e. the number of sites for which at least one of the four
genomesdiffersfromtheotherones)willbelargerthan3million.Asamatter
of fact, every time one sequences a new individual, one will uncover new
polymorphic sites (yet the average value of π between all pairs of individual
genomeswillremainat0.001).Ifoneweretosequencetheentirehumankind,it
is possible that nearly every one of the 3 billion sites of our genome will have
been found to differ in at least one individual. Thus, if one looks well enough,
everysiteinourgenomeispotentiallypolymorphic.Allthesepolymorphicsites
can be sorted according to their Minor Allele Frequency or MAF. Assume a
classicalSNPcharacterizedbytwoalleles,forinstanceA=TandG=C.Ifoneknew
thegenotypeof1,000genomesforthisSNPonecouldcomputewhatproportion
ofthesewouldcarrytheA=TalleleandwhatproportionwouldG=Callele.One
oftheseislikelytobelessfrequentthantheotherone.Itistheminorallele,and
thecorrespondingproportionorfrequencyistheMAFofthatSNP.Bydefinition,
theMAFrangesfrom0to0.5.Thereareapproximately7millionSNPwithMAF
>0.05inNorthernEuropeanpopulations.Thesearesaidtobe“common”SNPs.
ThereareprobablyanequalnumberofSNPswith0.05>MAF>0.005.Theseare
referredtoas“lowfrequency”SNPs.Andfinallythereisanundeterminedbut
larger number of SNPs with MAF < 0.005 (again, these limits are arbitrarily
defined). These are referred to as “rare” SNPs. The frequency distribution of
MAFsistypicallyexponentialwithmanymorerarethancommonvariants.
Germline mutations, drift and natural selection. What is the origin of the
genetic polymorphism and what explains the exponential MAF distribution?
Genetic variants are generated by the process of de novo mutations in the
germline. Every gamete carries of the order of 30 new variants that were
generated as a result of the inherent imperfections of the DNA replication and
repair machinery during the many cell divisions undergone by cells of the
germline between fertilization and gametogenesis. The average number of cell
divisions between fertilization and the production of an oocyte is 24, and this
numberisnotaffectedbytheageofthemother(oocyteshavereachedmeiosisI
before the birth of the female fetus). The average number of cell divisions
betweenfertilizationandtheproductionofaspermcellsis∼30+23n+5,wheren
istheageofthefatherminus15.Thenumberofcelldivisionsneededtoproduce
a sperm cell is therefore larger than for an oocyte, which largely explains why
spermcellsonaveragecarrymoredenovomutationsthanoocytes,andwhythis
increaseswiththeageofthefather.Theprocessofdenovomutationgenerates
Essentialhumangenomics–MichelGeorges
Page25/41
a new allele that is called the “derived allele” while the original sequence is
calledthe“ancestralallele”.
What is happening with these tens of de novo mutation inherited by every
conceptus?Themainfactorthatdeterminesthefaithofthemajorityofthesede
novomutationsis“luck”,referredtoingeneticsas“randomdrift”.Imaginean
“isolated” population with 100 randomly breeding individuals. Consider one
locus in the genome and imagine that you can unambiguously distinguish and
track the 200 alleles of the 100 individuals. Let the individuals breed for one
generationandletusexaminewhathappenedwiththe200allelesobservedin
generation“0”.Itisverylikelythat–justbychance–partofthe200alleleswill
already be missing in generation “+1”, while some others will be present more
thanonce.Thus,justasaresultofthestochasticprocessbywhichchromosomes
aresampledingeneration“0”toproducegeneration“+1”,someallelesarelost
eachgenerationwhileothersseetheirfrequencyincrease.Thisprocessrepeats
itselfeachgeneration,leadingtotheinescapableoutcomethatatsomepointin
time,onlyonealleleoftheoriginal200willstillbepresentinthepopulation.At
that time all the alleles in the population are said to be identical-by-descent as
theyalltracebacktothesamecommonancestorallele.Thisprocessiscalledthe
“coalescent”. It can be shown that the expected number of generations
separatingalltheallelespresentinthepopulationatonepointintimefromtheir
MostRecentCommonAncestor(MRCA)is∼4Ngenerations,whereNisthesize
of the population. A de novo mutation inherited by an individual living in a
population of size N, has a probability 1/2N to become fixed in the entire
population(andhencecompletelyreplacetheancestralallele)∼4Ngenerations
later.Thuswehaveontheonehandtheprocessofdenovomutationthatinjects
new variants in the population, and on the other hand the process of random
drift that purges old and new variants out of the population (letting however
someluckynewonessurviveandsometimesevenreplacingtheoldoneshence
explainingwhythegenomesofdifferentspeciesprogressivelydivergefromeach
other). The result is a steady state equilibrium that is characterized by a
predictablenucleotidediversity
π=
4N µ
4N µ +1
(where µ corresponds to the mutation rate per generation), as well as by a
predictable exponential frequency distribution of MAF. The larger the
population size and the higher the mutation rate, the higher the nucleotide
diversity.Thefrequencyofderivedallelesistypicallycorrelatedwiththeirage.
Thedescriptionofthecombinedeffectsofdenovomutationandrandomdrifton
the genetic polymorphism of populations is known as the neutral theory of
molecular evolution. Although these two factors combined account to a
considerable degree for the observed polymorphism, one obviously has to
include selective forces for a full description and understanding of genetic
variation. One typically distinguishes three categories of selective forces:
negative or purifying selection, positive selection, and balancing selection.
Purifyingselectionactsongeneticvariantsthathaveadeleteriouseffectonthe
geneticfitnessoftheindividualsthatcarryit.Thisincreasestheprobabilitythat
Essentialhumangenomics–MichelGeorges
Page26/41
thevariantwillbepurgedfromthepopulation.Asitismuchmorelikelythata
de novo mutation in a functionally important element of the genome will
compromiseitsfunctionratherthanimproveit,functionallyimportantelements
undergo the effects of purifying selection and evolve therefore more slowly.
Thispropertyhasbeenexploitedtoidentifyfunctionallyimportantelementsin
thehumangenomebycomparingitwiththegenomeofotherspecies(seeabove).
Another clear manifestation of purifying selection is the observation that first
and second codon positions are more often conserved amongst species than
third codon position. This is due to the fact that mutations of the third codon
positions are more likely to be synonymous (therefore not altering protein
function)thanmutationsofthefirstandsecondpositionsofcodons,whichare
nearly always non-synonymous. Sites undergoing purifying selection are
typically characterized by lower π values, and a shift of MAF towards lower
valuesaswell.Thelatterreflectthefactthatnegativeselectionmakesitharder
for(mildly)deleteriousvariantstoincreaseinfrequencyinthepopulation.
Exceptionally, de novo mutations generate variants that confer a selective
advantage to their carriers. Such variants will be subject to positive selection.
Theadvantagetheyconferincreasestheirprobabilityoffixationandaccelerates
their rate of fixation. The resulting “selective sweep” may leave a detectable
signature of reduced local genetic variation as the haplotype in which the
favorablemutationwasembeddedisfixedwiththemutation.Averyconvincing
example of such a sweep in humans is the selection for regulatory mutations
nearthelactase(LCT)geneextendingitsintestinalexpressionandhencelactose
tolerance past weaning. Such mutations were independently selected in
populationsthatreliedheavilyontheconsumptionofdairyproductsintheirdiet,
includinginNorthernEuropeandsomenilothicpopulationsofAfrica.
The third form of selection is referred to as balancing selection. In this a
genetic variant undergoes positive selection under some circumstances and
negative selection under other circumstances, potentially leading to a situation
in which the variant is maintained in the population at intermediate frequency
over long periods of time. One cause of balancing selection is heterozygote
advantage or overdominance. In this the heterozygotes have a higher fitness
than either homozygotes. Overdominance underlies the unusually high
frequency of sickle cell anemia, thalassemia and G6PD deficiency in regions
wheremalariaisendemic,asthecorrespondingmutationsintheα-,β-globinand
G6PDgenesconferresistancetotheparasiteinheterozygotes.
B.GENOMICRECONSTRUCTIONOFOUREVOLUTIONARYHISTORY
Amongstthemostexcitingoutcomesoftheaccumulationofincreasingamounts
ofsequenceinformationforhumanandotherorganismsistheopportunitiesit
provides to explore our evolutionary history. Comparison of the genomic
sequenceofdistantorganismsanduncoveringtheremarkablecommonalitiesin
terms of gene organization and content demonstrates, beyond any reasonable
doubt,thatallorganismslivingonplanetearthdescendfromacommonancestor.
Ithasclarifiedtherelationshipbetweenhumansanditsclosestprimaterelatives,
Essentialhumangenomics–MichelGeorges
Page27/41
theorang-outang,gorillaandchimpanzees.Ithasdemonstratedthatthecradle
ofhumanityliesinAfrica.Morerecently,ithasshownthatpartofourgenome
descendsfromotherhomininswithwhomourancestorscohabited.Itallowsfor
thestudyoftherelationshipbetweenpopulationsincludingthedeterminationof
one’sgeographicaloriginwithremarkableprecision.
The African Eve. For some time now, it has been possible to determine the
partial if not complete sequence of the mitochondrial genome from individuals
thatrepresentanasbroadaspossiblepanelofethnicgroups.Usingavarietyof
methods (including neighbor joining, parsimony methods and maximum
likelihoodmethods),itispossibletogeneratea“genetree”thatisattemptingto
reconstructthemostlikelyevolutionaryhistoryofthecorrespondingsequences
and hence indirectly from the corresponding ethnic groups. Assuming that
substitutionsaccumulateataconstantrate(“molecularclock”),thetimethathas
elapsedsincethedivergenceofthedifferent“leaves”inthetreecanbeinferred
from the lengths of the branches that separate them. The molecular clock is
either calibrated indirectly using indications from the fossil record about the
timeofspecificsplitsinthetree,ordirectlyusingtherateofdenovomutation
estimatedfromsequencingpedigrees.
When applied to DNA samples from “anatomically modern humans” (AMH)
representing a broad panel of ethnic origins, the approach generated a tree for
which the majority of “thick” branches carried leaves corresponding to African
samples only. The deepest “split” separated the click-speaking San hunter
gatherers from the other AMH. It’s timing was recently re-evaluated at ∼
250,000-300,000yearsago.Allnon-Africansamplesinessenceclusteredonone
“thick”branchonly.Whencomparedtosamplesoriginatingfromdifferentparts
of Africa, the mitochondrial genomes of European, Asians and Amerindians are
verysimilartoeachother.Itwasrecentlyre-evaluatedthattheMRCAofallnonAfrican samples lived ∼100,000 years ago. These findings provided strong
support for the Recent African Origin (RAO) model for the origin of AMH.
According to this model, AMH first appeared on the African continent at least
300,000 years ago. Europeans, Asians and Amerindians derive from an
Eastern-AfricansubpopulationthatdivergedfromAfricans∼100,000yearsago,
traversed the African horn and/or Levantine corridor, and progressively
migratedintotherestoftheworld.This“exodus”wasaccompaniedbyastrong
reduction in genetic variation. The migrant population appeared to have gone
througha“geneticbottleneck”ofaround ∼10,000individualsorless.Thesplit
between the branches leading respectively to Europeans and Asians is now
datedat∼60,000yearsago.Thealternative(yetlesswellsupported)modelfor
the origin of AMH is the model of multiregional evolution (MRE). According to
MRE, modern features emerged independently in different continents from
distinct populations of archaic humans that lived there (and for which there is
fossilandarcheologicalevidence),andwerethencombinedbyinterbreedingto
formAMH.
Admixturewitharchaichominins-NeanderthalandDenisovanuncles.Ithas
becomepossibletoextractDNAfromfossilsthatare∼100,000yearsoldandto
Essentialhumangenomics–MichelGeorges
Page28/41
subjectthisDNAtoNGSanalysis.Bydoingso,thecompletegenomicsequenceof
severalNeanderthalindividualshasrecentlybeengenerated.Applyingthesame
approach to a small metacarpian bone found in a cave in Denisova (Altai
Mountains, Russia) indicated that it belonged to a distinct species of archaic
hominins,hencecalledDenisovans.ComparingtheNeanderthalandDenisovans
DNA with that of AMH indicated that Neanderthal and Denisovans split
approximately ∼200,000 years ago, while AMH split from Neanderthal and
Denisovansapproximately∼500,000yearsago.Fossilandarcheologicalrecords
clearly indicate that Neanderthals and probably Denisovans still roamed the
Eurasiancontinent∼40,000yearsago.WhenspreadingacrossEurasia,theAMH
diasporamayhaveencounteredtheseandmaybeotherarchaichominins.Hence,
thequestionthathasintriguedpaleontologistiswhetherthesespeciesinterbred.
Toaddressthisquestion,geneticistshavesofarmainlyusedthetestcalledthe
“Dstatistic”.TheDstatisticcomparesthegenomeoftwoAMH(f.i.H1=African
andH2=Europeanindividual),ofanarchaichominin(X;canbeNeanderthalor
Denisovan), and of an outgroup (f.i. C = chimpanzee). It counts the number of
“BABA” (nBABA) and “ABBA” (nABBA) occurrences. Sites are selected where the
archaicgenome(X)differsfromthechimpanzee(C)genomeandfromoneofthe
AHM genomes. The residue observed in C and H is considered ancestral, the
otherderived.Onethencomparesthenumberofsuchoccurrenceswhereitis
H1 versus H2 that shares the derived allele with X using
D = (nBABA − nABBA ) / (nBABA + nABBA ) .Apriori,ifH1andH2areequallyunrelatedtoX,
Dshouldnotdeviatesignificantlyfrom0.However,whatemergedfromthese
analyses is that European and Asians are more closely related to Neanderthal
thanareAfricans,andthatMelanesiansaremorecloselyrelatedtoDenisovans
than all other AMH. From this, it was inferred that 1-4% of the genome of
EuropeansandAsiansisderived–byinterbreeding–fromNeanderthal,andthat
3.5% of the genome of Melanasians is derived – by interbreeding – from
Denisovans!SpecificNeanderthalandDenisovan-derivedgenomesegments(or
haplotypes-seehereafter)presentinourgenomehavenowbeenidentifiedand
confirm this admixture hypothesis. What phenotypic properties, if any, are
determined by the DNA that was inherited from archaic hominins remains
unknown, but a topic of great interest. A scenario that now emerges, is that
cohabitationwithotherhomininsmighthavebeenacommonconditionforour
AMHancestors,bothinandoutofAfrica,andthatinterbreedingmayhavebeena
partoflife.
C.THEHAPMAPAND1,000GENOMESPROJECTS
Linkagedisequilibrium.Whensimultaneouslyconsideringtwogeneticvariants
- say A and B characterized respectively by alleles A, a and B, b – one can
distinguish four allelic combinations or “haplotypes”: AB, Ab, aB, and ab. Each
individualinheritstwosuchhaplotypes,onefromthefathertheotherfromthe
mother, to generate a so-called “diplotype”. A double heterozygous individual
(genotypesAaandBb)canthushavetwodistinctdiplotypes(AB/aborAb/aB,
where the “/” separates the two composite haplotypes) depending on the
haplotypes inherited from his parents. If the genotypes and diplotypes of a
Essentialhumangenomics–MichelGeorges
Page29/41
sampleofnindividualsisknow,onecanmeasureallelic(pA,pa,pB,pb)aswellas
haplotype (pAB, pAb, paB, pab) frequencies in the sample. If the haplotype
frequencies do not differ significantly from the product of the corresponding
allelic frequencies (f.i. pAB≈pA*pB), the two variants are said to be in linkage
equilibrium.If–onthecontrary-thehaplotypefrequenciesdiffersignificantly
from the product of the corresponding allelic frequencies, the two variants are
said to be in linkage disequilibrium (LD) by an amount D=pAB-pA*pB. D fully
characterizes the LD between two bi-allelic variants, i.e. D=pAB-pA*pB=-(pAbpA*pb)=-(paB-pa*pB)=pab-pa*pb (Table 2). LD is more often quantified using
“normalized” measures of linkage disequilibrium that are derived from D, but
haveasusefulfeaturethattheyrangebetween0and1,facilitatingcomparison
betweenpairsofgeneticvariants. r 2 correspondsto D 2 pA pa pB pb .Ithasavalue
of1,whenthefrequencyoftwohaplotypesis0;asituationcalledperfectLD(f.i.
AisonlyassociatedwithB,andaonlywithb).D’correspondstoDdividedby
the most extreme value (of same sign) it could have given the corresponding
allelic frequencies. It reaches 1 when at least one haplotype frequency is 0; a
situationcalledcompleteLD(f.i.AisonlyassociatedwithB,althoughacanalso
beassociatedwithB).
Whenamutationgeneratinganewallele(sayAderivedfroma)occurs,thenew
A allele will only exists in association with the alleles at neighboring variants
characterizingthechromosomeuponwhichthemutationoccurred(sayalleleB,
for a variant with a B and a b allele). Thus, initially A only occurs on
chromosomes carrying B, although aB haplotypes obviously also exist in the
population.Initially,thereisthereforecompleteLDbetweentheAandBvariant.
However, as the new AB haplotype segregates in the population it may
sometimes generate new Ab haplotypes, if paired with an ab haplotype at
meiosis,andprovidedthatarecombinationoccurredbetweenthetwovariants.
OnecanshowthattheLD,measuredbyD,willdecayateverygenerationbyan
amount D ∗ θ ,where θ istherecombinationratebetweenthetwovariants.Thus,
recombination tends to brake down the LD that resulted from the process of
mutation. For distant variants (large θ ), the LD will ultimately disappear. For
closely “linked” variants, however, “random drift” will counteract the effect of
recombinationresultinginasteadystatedegreeofLDwithexpectedequilibrium
valueof:
r2 =
1
4Neθ +1
Asaconsequence,variantsthatareverycloselylocatedinthegenometendtobe
in LD with each other. The distance over which significant LD is detectable
dependsontheeffectivepopulationsize(i.e.Ne).Forspecieswithlargeeffective
population size (such as Drosophila), LD only extends over hundreds of base
pairs. For species with small effective population size (such as human), LD
extendsovertensofthousandsofbasepairs.
Haplotype blocks and recombination hotspots. LD has been measured
systematicallyforlargenumbersofvariants(particularlySNPs;includingaspart
oftheHapMapproject–seehereafter)indifferenthumanpopulations.Oneof
the striking features that emerged from these studies is that LD doesn’t decay
monotonously with distance as expected from the theory described above, but
Essentialhumangenomics–MichelGeorges
Page30/41
ratherina“step-wise”fashion.Thegenomeappearstobeorganizedinblocks.
Variants within a block are in high LD with each other, but not (or much less)
withvariantsbelongingtoneighboringblocks.Thetypicalsizeofsuchblockin
humans is of the order of 25Kb (thus quite similar in size to the average gene,
although there is no obvious coincidence between block and gene limits). A
block may span hundreds of variants. Although these could in theory form
thousandsofhaplotypes(2n,wherenisthenumberofSNPsintheblock),fiveto
tenhaplotypesperblocktypicallyaccountfor>95%ofthechromosomesinmost
human populations. The corresponding blocks are typically referred to as
“haplotypeblocks”.
The block-like structure of the human genome (as of the genome of other
species), results from the fact that recombination events tend to cluster within
“recombination hotspots”, which mark the boundaries between adjacent
haplotype blocks. Recombination hotspots harbor short sequence motifs that
are recognized by the zinc finger domain of PRDM9, a master regulator of
meiotic recombination in many mammals. Recombination hotspots evolve
rapidly; as an example, the human and chimpanzee recombination landscape
havecompletelydiverged.
The HapMap project. In 2002, an international collaboration was launched to
identify all the common variants, as well as the haplotypes they form, in three
human populations: Africans, Asians and Europeans. ∼300 individuals were
genotypedformorethan5millioncommonvariants,andlinkagedisequilibrium
patterns analyzed across the genome. One of the main drivers of the HapMap
projectwastoprovidetheinformationneededtodevelopSNParraysthatwould
effectively “tag” a large proportion of common variants. The private sector,
particularly Affymetrix and Illumina, exploited this information and rapidly
offeredarraysallowingforcost-effectivegenotypingofhundredsofthousandsto
millions of common SNPs. These have been used very extensively to perform
genome-wide association studies or GWAS, with the aim to detect genetic risk
factors for nearly all common complex diseases (see hereafter). A lot of these
efforts rested on the “Common Disease Common Variant” (CDCV) hypothesis.
According to this hypothesis, inherited predisposition to common complex
diseasesisduetospecificcombinationsofcommonriskalleleswithindividually
small effects at many risk loci. The alternative hypothesis, referred to as the
“Common Disease Rare Variant” (CDRV) hypothesis postulates that inherited
predisposition to common complex diseases involves rare alleles with larger
effectsatfewerloci(foragivenindividual).
The1,000Genomesproject.
Essentialhumangenomics–MichelGeorges
Page31/41
3.Themorbidhumangenome
A. NEUTRALVERSUSFUNCTIONALVARIANTS–GERMLINEVERSUSSOMATICMUTATIONS
Asmentionedinthepreviouschapter,∼10millioncommongeneticvariantsand
manymorelowfrequencyandrarevariantssegregateinthehumanpopulation.
Most of these are probably having no or virtually no (see hereafter) effect on
phenotype.However,aminorityofthemdo.Theseinclude(i)“codingvariants”
thataffectthesequenceoftheproteinthatisencodedbytheaffectedgenes,as
well as (ii) “regulatory” variants that affect gene switches and thereby the
expression profile of the corresponding gene(s). Coding variants primarily
include missense, nonsense (“stop gains”), splice site, frame-shift variants and
largeinsertion-deletions.Synonymousvariantsaffectthecodingpartsofgenes
without changing the amino-acid sequence. They will usually be harmless but
maysometimesaffectsplicingiftheyfallinto“splicingenhancers”.Variantsthat
affect gene function and hence phenotype may exceptionally confer a selective
advantagetotheircarriers.Usually,however,theywillhaveanegativeimpact
on phenotype and cause disease (they are therefore referred to as “causative
variants”).Itisworthnotingthatsomevariantsmaybeadvantageousinsome
circumstancesanddeleteriousinothers.Asanexample,variantsthataffectsalt
retention may have been advantageous when salt was scarce but are presently
causinghypertensioninindustrializedsocietieswheresaltisconsumedinexcess
(cfr.thriftygenehypothesisfordiabetes).
Genetic variants that are said to segregate in the population are inherited by
offspring from their parents via either the sperm cell or the oocyte (or both).
The transmitting parent(s) will typically have inherited the variant themselves
fromtheirparents.Exceptionally,avariantmaybetransmittedbyaparenttoits
offspringviathespermcellortheoocyte,whilethetransmittingparentdidnot
inheritthecorrespondingvariantfromeitherofitsparent.Thismeansthatthe
newvariantappearedbytheprocessof“denovomutation”inthegerm-lineof
thetransmittingparent.Asmentionedbefore,spermcellstypicallycarryofthe
order of 60 such de novo mutations, while oocytes carry of the order of 30.
Dependingonwhendenovomutationoccurred,thetransmittingparentmaybe
characterized by variable levels of “mosaïcism”. If the mutation occurred very
lateduringspermatogenesis,itmayonlybedetectedinveryfewspermcells.If
onegenotypesbulkspermDNA,onewillnotdetectthemutationasitrepresents
toosmallafractionofthetotalspermDNA.Ifthemutationoccurredearlierin
thedevelopmentofthegerm-line,asubstantialproportionofgametesmaycarry
the mutation, leading to detectable levels of germ-line mosaïcism. If the
mutation occurred very early in development of the parent (i.e. prior to the
formationoftheprimordialgermcells),thedenovomutationmaynotonlybe
detectableinthegerm-line,butalsoinsomaticcells.Theindividualwillbeboth
germ-lineandsomaticmosaicforthecorrespondingmutation.
ErrorsinDNAreplicationandrepairthatoccurduringmitosiswillalsogenerate
denovomutationsinsomaticcellsthatdonotcontributetothegerm-line.Such
somaticmutationswillneverbetransmittedtothenextgeneration.Thereareso
many cells and cell divisions, that virtually all possible mutations must have
occurred somewhere in our body. If they occurred early enough during
Essentialhumangenomics–MichelGeorges
Page32/41
development, or if they occur in an actively dividing tissue, they may generate
detectablelevelsofsomaticmosaïcism.Somaticmutationsmayaffectthehealth
of cells, and contribute to disease. The best-known example of the pathogenic
effects of somatic mutations is cancer. Cancers are primarily due to the
accumulation of somatic mutations that perturb “cancer drivers”, whether
recessive loss-of-function mutations in anti-oncogenes, or dominant gain-offunctionmutationsinoncogenes.
Inherited diseases can be subdivide in monogenic diseases, where variants at
onesinglegeneexplainallornearallphenotypicdifferences,andpolygenic(also
called complex or multifactorial) diseases that depend on large number of
genetic risk variants and environmental risk factors. AN additional class of
oligogenicdiseasesissometimesconsidered,involvingasmallnumber(>1)of
genes.Thisterminologyissometimesusedwhenstudyingmodifiergenes,that
mayaffecttheexpressivity(f.i.severity)orpenetranceofthedisease.
B. MONOGENICDISEASES
Monogenic diseases are typically severe disorders that are individually rare
conditions but cumulatively affect ∼1% of the population and are therefore an
important health concern. They include autosomal recessive, autosomal
dominant, X-linked recessive and X-linked dominant diseases. There is an
overrepresentation of consanguineous marriages amongst families suffering
autosomal recessive conditions. Autosomal dominant conditions often involve
de novo mutations for which one of the parents (usually the father) may be
germ-linemosaic.MenaremoreoftenaffectedbyX-linkedrecessiveconditions,
while women are more often affected by X-linked dominant conditions.
Monogenicdiseasesmaybecharacterizedbyincompletepenetrance(i.e.notall
individuals with “affected” genotype will suffer from the disease), and variable
expressivity (i.e. not all individuals with “affected” genotype will suffer an
equallysevereformofthedisease).Thepenetrancemaybeafunctionofage.As
an example, the penetrance of Huntington’s disease increases with age until
reaching∼100%at60yearsofage.Thenotionofphenocopyimpliesthatvery
similarconditionsmayinvolvedifferentgenesorevennon-geneticcauses.
Recessive diseases are typically caused by recessive “loss-of-function” (LoF)
variantsinessentialgenes.Astheirnameimplies,LoFvariantsdestroythegene
that they affect. LoF variants are strongly enriched in stop-gain, splice site,
frame-shift and large deletions. Not all LoF variants will cause disease in
homozygotes. As a matter of fact we are all carrying of the order of 120 LoF
variants of which only ∼5 will cause disease. To cause a severe disease in
homozygotes,LoFvariantsneedtodamageessentialgenes.Approximately30%
ofourgenesarethoughttobeessential,i.e.weneedatleastoneintactcopyof
such genes to survive. As the majority of LoF variants are very rare in the
population,homozygosity(orcompoundheterozygosity)forLoFvariantsatany
ofour∼7,500orsoessentialgenesisrelativelyrare(oftheorderof1%),unless
the parents are closely related (i.e. consanguineous marriages). A majority
(∼4.5/5) of LoF variants in essential genes is thought to cause embryonic
Essentialhumangenomics–MichelGeorges
Page33/41
lethality,whiletheremainder(∼0.5/5)willcausea“detectable”recessivedefect
inatleastsomehomozygousindividuals.Recessivedefectsincludethe“inborn
errorsofmetabolism”ascoinedbyGarrod.Inhumans,monogenicdiseasesare
typically characterized by allelic heterogeneity, with a few relatively common
andmanyrarealleles.Mostmonogenicdiseasesareveryrareconditions,with
few exceptions. One of these is cystic fibrosis for which 5% of Europeans are
carriers. It has been postulated that this may be due to “heterozygous
advantage”: carriers may be more resistant to disease involving loss of body
fluids, typically diarrhea. Other well-known examples of “heterozygous
advantage” are the increased resistance to malaria of carriers of causative
variantsforsickle-cellanemiaandthalassemia.
Dominantdiseasestypicallyresultfrom“gain-of-function”mutations(including
constitutively activated receptors) or large deletions. De novo germ-line
mutations(oftenwithmosaïcism)underlieasubstantialproportionofcaseswith
unaffected parents. Dominant conditions (including childhood retinoblastoma
and multiple sporadic venous malformations) may involve “first hit” germ-line
LoFvariantsthataretransmittedbyparentstoaffectedoffspring,combinedwith
“secondhit”somaticmutationsthatleadtothelossoftheremainingalleleinthe
affected tissue. The second hit may be a de novo somatic LoF mutation in the
remaining allele, or “loss-of-heterozygosity” that can result from various
chromosomal aberrations (mitotic recombination, chromosome loss, ...).
Examplesofsuchdisordersincludechildhoodretinoblastoma,andspecificforms
ofvenousmalformations.
Between ∼1985 and ∼2005, genes (and causative variants therein) underlying
monogenic diseases were identified by the tedious process of “positional
cloning”.Theculpritgeneswerefirstmappedbylinkageanalysisusinggenomewide panels of microsatellite or SNP markers in affected families. Whenever
possible, the initial mapping was followed by fine-mapping using linkage –
disequilibriuminformation(akintoassociationmapping;seehereafter).Cloning
and sequencing of the “critical interval” then lead in some instances to the
identificationofthecausativegeneandmutations.Despitethetediousnatureof
thisapproach,thecausativegeneandmutationshadbeenidentifiedfor∼2,000
monogenicconditionsby2003.
The advent of Next Generation Sequencing, and the ensuing possibility to now
sequencethewholegenomeat∼30-folddepthfor∼1,000Euros,hasdramatically
improved this situation. Sequencing the whole genome of a few affected
individualsandsomerelativeswill–in∼50%ofcases-leadtotheidentification
ofthecausativegeneandmutations.Thekeysourceofinformationisthefinding
ofrare,severeLoFvariantsinthesamegeneinsufficientindependentcasessuch
thattheprobabilityofsuchfindingtooccurbychancealonebecomesverylow,
andhencestatisticallysignificant.Thecorrespondingstatisticshavetoaccount
fortheunexpectedfindingthatallofuscarryLoFvariantsinapproximately120
genes (with both alleles affected for 20 of these), hence an order ofmagnitude
more than the expected number of “lethal equivalents” (∼5) carried by each
individual(videsupra).Bynow,causativemutationsin∼3,000genesunderlying
> 4,000 monogenic conditions have been discovered, and these numbers are
increasingrapidly.
Essentialhumangenomics–MichelGeorges
Page34/41
Discovering the causative genes and mutations underlying monogenic diseases
inform about gene function, sometimes allow for the development of targeted
therapies, and always offer the somewhat ethically controversial but widely
appliedsolutionofprenataldiagnosis.
C.POLYGENICORCOMMONCOMPLEXDISEASES.
Most heritable phenotypes in humans are not monogenic. These include traits
likeheight,weightandIQ,butalsopredispositiontocommoncomplexdiseases
such as hypertension, diabetes (type I and II), cancer, inflammatory bowel
disease,schizophrenia,autism,etc.Howdoweknowthatthesephenotypesare
heritable? First indications come from the “unusual” resemblance between
parents and offspring. For quantitative traits with continuous Gaussian
distribution (including height, weight, IQ and hypertension), the “regression”
between the mean phenotype of the parents and that of the offspring is a
measure of the heritability (h2), i.e. the proportion of the trait variance that is
due to genetic differences between individuals. For binary traits, such as
common complex diseases (involving cases and controls), the relative risk for
relativesofaffectedindividuals(f.i.sibs)whencomparedtotheincidenceinthe
general population is used as a measure of the importance of genetics. A
problemishumangenetics,isthatrelatives(f.i.parentsandtheiroffspring)not
only share genes but also a common environment. It is therefore difficult to
evaluate to what extend the observed resemblance between relatives results
from the sharing of genes rather than environment. To overcome this issue,
human geneticists study adoptees or twins, or rely on novel molecular
approaches. By studying the resemblance of parents and offspring that have
been raised in a different environment one can to some extend overcome the
problem of the confounding of genes and environment. An estimate of
heritability can also be obtained from the increase in the rate of concordance
between monozygotic twins when compared to dizygotic twins. One can also
estimate the heritability from the higher phenotypic correlation observed for
sibsthathaveahigherdegreeofgenomesharing(full-sibsshareonaverage50%
of their alleles, but there is variation around this mean). To estimate their
heritability, binary traits are often modeled as reflecting the distribution of an
unobserved, underlying “liability” with Gaussian distribution, and a threshold
valuesuchthatindividualswithaliabilityvalueabovethethresholdareaffected
andothersnot.Itappearsfromthesestudiesthattraitssuchasheight,weight
and IQ, as well as the liability for many complex diseases are unexpectedly
“heritable”. The heritability of height for instance is estimated to be > 85%,
whiletheheritabilityofCrohn’sdiseaseis>50%.
Given the fact that these traits, including predisposition to common complex
disease,areatleastinpartheritable,thisopensthepossibilitytouse“positional
cloning”strategiestoidentifyatleastsomeoftheunderlyingcausativegenesand
variants.Thereareatleasttworeasonablemotivationstodothis.Thefirstisto
gain a better understanding of the molecular mechanisms underlying the
corresponding diseases. This may lead to the identification of new targets for
drugdevelopmentortothe“repositioning”ofexistingdrugs,atatimewhenthe
Essentialhumangenomics–MichelGeorges
Page35/41
pharmaceutical industry faces major difficulties in identifying new effective
drugs. The second is the perspective of “predictive medicine”, or the
identification – prior to disease onset - of individuals that are at higher risk to
develop the disease, and to encourage them to more actively adopt preventive
measurestoreducetherisk.
Initial attempts to find genes underlying common complex diseases based on
specific implementations of family-based approaches (such as the affected sib
pair method) met with very little if any success. An alternative approach,
dubbed Genome Wide Association Study (or GWAS), was proposed and
predictedtohaveincreaseddetectionpower.Itwasbasedontherealizationof
association studies in case-control cohorts using large numbers of genetic
variants covering the entire genome. An association study in a case-control
cohort is extremely simple, in principle, and consists in comparing the allelic
frequency of a given variant between a group of cases and a group of properly
matched (especially for ethnicity) controls. Finding a difference that is
statisticallysignificantsuggestseitherthatthevariantisacausativevariantfor
thestudieddisease(i.e.thatitperturbsagenetherebyincreasingdiseaserisk)(=
direct association), or - which is more likely – that the interrogated variant is
“associated with” or in linkage disequilibrium with one or more causative
variants in its vicinity (= indirect association). In humans LD typically extends
overdistanceoftheorderoftensofKb(cfr.Chapter2).Ideally,onewouldwant
tointerrogatealltheexistinggeneticvariantsinthegenometoperformGWAS,
i.e.sequencethegenomeofallcasesandallcontrols.Thisisstillimpracticalfor
sometime.Inthemeantime,geneticistsinvolvedinGWAStypicallyinterrogate
300,000 to 1 million common SNPs that are scattered across the genome and
jointly “tag” the majority of common haplotypes. Their hope is that this set of
SNPs will allow them to detect common causative variants mostly by indirect
association.Interrogating1millionSNPsatonceinacost-effectivemannerhas
become possible thanks to the development of “SNP microarrays” mainly by
Affymetrix first, and then by Illumina. Information about state-of-the-art
technologies underlying SNP genotyping with microarrays can be found at
http://www.illumina.com/technology/beadarray-technology/infinium-hdassay.html.
AfirstwaveofGWASstudieswaspublishedstarting2006.Theywouldtypically
involve of the order of 1,000 cases and 1,000 controls. The results would be
summarizedas“Manhattanplots”,showing–fortheentiregenomeatonce–the
strengthoftheassociation(expressedaslog(1/p)wherepistheprobabilityto
obtain the observed association by chance alone) as a function of the genomic
position of the interrogated SNPs. To avoid false positive reports, the field
imposed itself very stringent significance thresholds. To account for the
realizationofmultipletests,thethresholdtodeclarestatisticalsignificancewas
set at p=10-8 or log(1/p)=8. Moreover, publication in high profile journals
imposed replication of the initial finding (in the discovery cohort) in an
independent replication cohort of at least equal size. Initial GWAS typically
reported 1 to 2 confirmed, genome-wide significant risk loci, which was
consideredamajorbreakthrough.However,theoddsratios(usedasasurrogate
Essentialhumangenomics–MichelGeorges
Page36/41
for relative risk)1of the most strongly associated variants was typically of the
orderof1.2-1.5only,whichisverylow.Asamatteroffact,thepowertodetect
suchsmalleffectswasshowntobeverylowgiventhesizeoftheutilizedcohorts,
indicatingthatmanymoreeffectswerelikelytoexist(ifyoudetectonesignalfor
whichyouhaveadetectionpowerof10%,itsuggestthat∼10xmoresucheffects
wentundetected).
ThisfirstwaveofindividualGWASwasthereforelogicallyfollowedbyasecond
wave of “meta-analyses” in which individual case-control cohorts for a given
disease were merged into larger cohorts to thereby increase statistical power.
GWAS studies with tens to hundreds of thousands of cases and an equivalent
numberofmatchedcontrolsarenowcommon.Theselettothediscoveryoftens
ofrisklocifornearlyallstudieddiseases.Asanexample,morethan200riskloci
havenowbeenreportedforInflammatoryBowelDisease.Anupdatedcatalogue
reported the results of all pubmished GWAS can be found at:
http://www.ebi.ac.uk/gwas/.
It is important to realize that – with few exceptions - the causative genes and
variants remain largely unknown for the vast majority of risk loci (across all
diseases). The only causative genes that are known are those with strongly
associatedLoFvariants,whicharemostlyexceptions.Itisindeedbelievedthat
themajorityofcausativevariantsforcommoncomplexdiseasesareregulatory
variants,whicharemoredifficulttoidentify,asarethegeneswhoseexpression
theyperturb.GWAStypicallymaprisklocitochromosomeregionsof∼250Kb
encompassing ∼5 genes on average (range: 0->50) and thousands of genetic
variants.Mucheffortsarepresentlydevotedtofine-maptheidentifiedriskloci
(to identify the actual causative variants) and to identify the causative genes.
Fine-mapping relies on the use of sophisticated “multivariate” statistical
approaches.Inthesetheeffectsofeachvariantisevaluatedconditionalonthe
effects of the other variants. Identifying the causative genes relies on the
integration of multiple sources of genomic information. Examples include the
search for “networks” of connected genes amongst those that are mapping to
GWAS-identifiedriskloci,theuseofeQTLinformation,andtheuseofepigenetic
information including HiC data. The first approach tries to find sets of genes
thatareconnectedbyco-citation,co-expressionorinteractomedataamongstthe
geneswithinriskloci.Variousstatisticalapproachescanbeusedtomeasurethe
significanceoftheidentifiednetworks(i.e.whatistheprobabilitytofindsucha
large and highly interconnected network by chance alone given the number of
tested genes). If such network is found, it increases the probability that the
genes composing it are the actual causative genes. The second approach uses
eQTL information, where eQTL stands for “expression QTL”. Transcriptome
data is generated for large number of individuals in multiple cell types. The
same individuals are being genotyped for SNPs spanning the entire genome
1The Relative Risk corresponds to the ratio between the probability (risk) to be sick when
havinggenotypeABovertheprobability(risk)tobesickwhenhavinggenotypeAA,whereAis
the reference allele. The Odds Ratio corresponds to the ratio between two odds. The first
«odds»istheratiobetweentheprobabilitytobesickwhenhavinggenotypeABdividedbythe
probabilitytobehealthywhenhavinggenotypeAB(whichis1–theprobabilitytobesick).The
second «odds» is the probability to be sick when having reference genotype AA divied by the
probabilitytobehealthywhenhavingreferencegenotypeAA.
Essentialhumangenomics–MichelGeorges
Page37/41
(sameSNParraysasthoseusedfordiseaseGWAS).Onecanthen-foreachgene
– test whether the genotype (say AA, AG and GG) at a given SNP affects its
expression level. In other words, does the mean expression level of the tested
genedifferbetweenindividualssortedbySNPgenotype.Onetypicallyfirsttests
theeffectofSNPsongenesthatarelocatedintheirvicinity,saynotfurtherthan
500Kb apart. The underlying hypothesis is that the interrogated SNP, or more
likelyaSNPinLDwithit,affectsageneticswitchcontrollingthatgene.Ifsuchan
effectisfound,itisdubbed“cis-eQTL”.OnecanalsotestwhetherSNPsaffectthe
expressionofgenesthatarelocatedfurtherawayorevenonotherchromosome.
TheideaunderlyingthistestisthattheSNPoravariantinLDwithitaffectsa
trans-regulatorofthetargetgenesuchasatranscriptionfactororamiRNA.If
suchaneffectisfound,onetalksabouta“trans-eQTL”.Tofindcausativegenesin
GWAS-identified risk loci one searches for cis-eQTL that are determined by
causative variants for the disease. If the same variant is both independently
associated with the disease and a cis-eQTL effect it is reasonable to speculate
that the effect on disease predisposition is mediated by the observed effect on
gene expression. The third approach relies on HiC data. HiC data in essence
produces a number of “gene switches” and the genes that they regulate in a
specificcelltype.Ifacausativediseasevariantidentifiedbyfine-mappingmaps
to such a switch, the corresponding gene stands out as a strong candidate
causative gene. Final prove of variant and gene causality is difficult to obtain.
Introducing the candidate variants in cells using for instance CRISPR/CAS9
technology and reproducing effects on gene expression is strong evidence that
the tested variant is indeed “functional”. Proof of gene causality is typically
soughtbyperforminga“burdentest”.Thisistypicallydonebysequencingthe
exonsofthecorrespondinggeneinlargecase-controlcohortsandsearchingfor
a differential burden of rare disruptive mutations in cases and controls. As an
example,whensequencingtheNOD2genein∼300casesofCrohn’sdiseaseand
∼300healthycontrols,Hugoetal.2observedthatapproximately15%ofthecases
carried rare non-synonymous variants in the NOD2 gene, while this was only
observedin5%ofhealthycontrols.Thiswasverystrongconfirmatoryevidence
that the NOD2 gene was indeed a causative gene for Crohn’s disease. The
information that is sought is independent of the primary GWAS signal, which
typicallyreliedontheassociationwithdiseaseofcommonvariants.Theburden
testextractsinformationfromlowfrequencyandrarevariants.Thisimpliesthat
there is allelic heterogeneity at each one of the causative genes, which is a
reasonable,yetunprovenassumption.
As mentioned above, GWAS have allowed for the identification of tens to
hundreds of risk loci for nearly all studied diseases, and this is a major
achievement. However, the relative risk conferred by individual risk alleles is
typically of the order of 1.1-1.2. This is very low. As a matter of fact, some
people have questioned the value of the GWAS findings because of these small
effects.Itisimportanttoremember,however,thatitisnotbecausetheeffectsof
risk variants that segregate in the population are small, that pharmacological
2HugotJP,ChamaillardM,ZoualiH,LesageS,CézardJP,BelaicheJ,AlmerS,TyskC,O'MorainCA,
GassullM,BinderV,FinkelY,CortotA,ModiglianiR,Laurent-PuigP,Gower-RousseauC,MacryJ,
Colombel JF, Sahbatou M, Thomas G. Association of NOD2 leucine-rich repeat variants with
susceptibilitytoCrohn'sdisease.Nature.2001May31;411(6837):599-603.
Essentialhumangenomics–MichelGeorges
Page38/41
interventiononthesamegeneorpathwaymaynotcausealargeeffect.Indeed,
thecommonvariantsthatunderpintheinitialassociationhavehadtowithstand
theeffectsofnaturalselection.Variantswithlargeeffectsmayhavebeenwiped
out of the population by selection. Accordingly, rare variants in the same gene
often have much larger effects or relative risks. As an example, variants
disrupting the IL23R have been found by GWAS to protect against IBD but the
corresponding relative risks are small, of the order of 1.5. Yet targeting the
IL23Rpathwayspharmacologicallyappearstohavealargetherapeuticeffect.
Ifoneconsidersallidentifiedriskvariants,theircorrespondingrelativeriskand
frequencyinthepopulation,onecancalculatewhichfractionoftheheritability
theyaccountfor.Whenperformingthiscalculationforallstudieddisease,what
comes out systematically is that the identified risk variants only explain a
relatively small proportion of the heritability or inherited risk, typically of the
orderof25%.Thisraisesthequestionofthemolecularnatureofthe“missing
heritability”.Severalhypotheseshavebeenproposedtoaccountforthemissing
heritability. Today, the one that is best supported by data is the “quasiinfinitesimal”architectureofcommoncomplexdiseases.Thismodelpositsthat
the genetic part of the underlying liability for common complex diseases is
determined by a very large number of variants with individually very small
effects. The risk variants identified by GWAS would only be the “tip of the
iceberg”. To detect the remaining ones, one would need to study even larger
case-control cohorts. As a matter of fact, increasing the sample size always
results in the identification of more risk loci, consistent with the model. Also,
“polygenic models”, which estimate a genetic distance between pairs of
individuals based on SNP information across the entire genome and correlate
thisdistancewithphenotypicdistance,indeedsupportthevalidityofthequasiinfinitesimalmodel.
Another noteworthy finding from GWAS is the overlap between risk loci for
different diseases that were previously thought to be independent. There thus
appearstobealotofpleiotropyamongstrisklociforcommoncomplexdiseases.
The effects are sometimes concordant (i.e. the alleles that increases risk for
diseaseAalsoincreasesriskfordiseaseB),butequallycommonlydiscordant.
If we assume that we would know all the causative risk variants for a given
common complex disease, what would the diagnostic value of such set of
variants be? To gain some feeling about this we performed simulations using
first height as an example. Height is viewed as the paradigmatic example of a
polygenic trait in humans. Height has a heritability of ∼85%, which is
determinedbythousandsofgenes.Weassumedapopulationwithmeanheight
of165cmandstandarddeviationof10cms.Thus,ifyouhavenoinformation
abouttheheightofaspecificindividualfromthatpopulationyourbestguessis
themean(i.e.165cm),butyouwillonaveragebeoffby10cms.Imaginethat
youwouldknowallthegenesthatcontrolheightandknowthegenotypeofthe
individualforthesegenesyouwouldbeabletomakeabetterguess.Forheight,
withitsheritabilityof85%thestandarderroroftheguessbecomes∼2.5cms–
notbad!However,thatassumesthatyouknowallthegenesinvolved.Whatif
youonlyknewhalveofthegenesinvolved.Yourstandarderrorwouldthendrop
from∼2.5to8cms–notgood!Thus,ifyourgraspsofthegeneticsofthetrait
Essentialhumangenomics–MichelGeorges
Page39/41
drops,yourpredictiveabilitydropsevenmoretorapidlybecomepoor.Formost
complexdiseases,theheritabilityislowerandtheunderstandingofthegenetics
stillpoor(≤25%oftheheritability).Thepredictiveabilityisthereforeverypoor.
Imagine that one would want to rank individuals based on their genetically
determined liability using SNP information. One could then imagine to restrict
reimbursement of expensive diagnostic procedures (or impose these tests?) to
the10%ofindividualsthatarepredicted-onthebasisofsuchgenetictests-to
be the most at risk. The utility of such tests can be evaluated from their
sensitivity as well as positive predictive value (or precision). The sensitivity
measurestheproportionoftheindividualsthatwilldevelopthediseasethatare
foundpositiveforthetest.Theprecisionmeasurestheproportionofindividuals
found positive for the test that will ultimately develop the disease. One can
easily show that with the present understanding of the genetic architecture of
common complex diseases, sensitivity and precision are abysmal: many
individualsthatwillsufferfromthediseasewillbemissedbythetest,andmany
oftheindividualsfoundpositivewouldhaveneverdevelopedthedisease.Does
that mean that such test have no value and therefore no future? Maybe not.
Evenifthepredictiveabilityappearsverypoorwhenconsideredfromthepoint
of view of the individual patient, the cost savings that may be achieved at the
population level by using this information may under certain circumstances be
considerable. Weighting the benefits of the individual versus those of the
communityarelikelytobecriticalwhenconsideringtheuseofgenomictesting
intheclinic.
D. SOMATICVERSUSGERMLINEMUTATIONS:CANCER.
The prevailing view of cancer is that it is a “genetic” disease in the sense that
what fundamentally differentiates cancerous from normal cells are
accumulationsofmutationsinthegenomeoftheformer.
Insomerarefamilialcancers,suchmutationsmaybeinheritedfromoneofthe
parents. Well-known examples of such inherited cancers are familial
retinoblastoma involving mutations in the Rb1 gene, hereditary breast and
ovarian cancer involving mutations in the BRCA1 and BRCA2 genes, familial
adenomatous polyposis involving mutations in the APC gene, and hereditary
non-polyposis colon cancer involving mutations in mismatch repair genes.
Familial cancers are typically transmitted in an autosomal dominant manner.
Even in these rare familial instances, cancer progression requires the
accumulation of additional somatic mutations including (but not only) in the
non-mutatedalleleofthecorrespondinggenes(Knudson’stwohithypothesis).
Most cancer cases are “sporadic”: the relative risk of relatives is not increased.
Such sporadic cancers are thought to result from the accumulation - in specific
cell lineages - of somatic mutations in “cancer driver genes”. Each new cancer
drivingmutationisthoughttoconferaselectiveadvantagetothecorresponding
cloneinthesensethatitcausestheclonetoeitherproliferatemoreaggressively
or disseminate in the body (metastasis). Cancers typically accumulate
mutationsin“mutator”genes,whichcausethesomaticmutationratetoincrease.
In addition to the driving mutations in oncogenes, cancers therefore typically
Essentialhumangenomics–MichelGeorges
Page40/41
accumulate a large number of passenger mutations. Thus, not all mutations
detected in tumors do contribute to the cancer phenotype. The genome of
thousands of tumors has now been sequenced as part of large collaborative
projects. The comparison of the healthy genome of the patient (obtained by
sequencingDNAextractedfromhealthytissue)withthatofthetumorallowsfor
the identification of somatic mutations that have accrued in the tumor. By
comparingthelistsofsomaticmutationsacrosslargenumbersoftumorsofthe
same tissue, one can identify genes that are more often affected by such
mutations than expected if the mutations occurred at random in the genome.
Thus one attempts to identify genes that are preferentially affected by somatic
mutations in tumors. This enrichment is assumed to reflect that mutations in
thesegenescontributetocancerprogressionandthatthecorrespondinggenes
are“cancerdrivers”.Hundredsofcancerdrivershavebeenidentifiedusingthis
approach.Thesestudieshaverevealedtheremarkableheterogeneityofcancers:
there are plenty of distinct ways for a healthy cell to become cancerous; each
tumor is nearly unique. It is possible that in the near future, the genome of
tumorswillbesystematicallyandfullysequencedwiththehopetoidentifythe
drivergenesthatare“activated”bymutationsinthecorrespondingtumor.This
will in some instances allow for personalized therapies. Examples of such
targeted therapies already exist. They include (i) trastuzumab (Herceptin)
targetingthehumanepidermalgrowthfactorreceptor2protein(HER-2),which
isexpressedathighlevelsasaresultofmutationsinsomebreastandstomach
cancers, (ii) vemurafenib (Zelboraf) targets a mutated form of BRAF (V600E)
which is found in some metastatic melanoma, and (iii) imatinib mesylate
(Gleevec)atyrosine-kinaseinhibitortargetingtheBCR-ABLfusionproteinfound
insomeleukemias.
Somatic mutations are not the only conceivable mechanism that may lead to
heritable perturbation of oncogenes. There is considerable evidence that
oncogenesmayalsobeaffectedintumorsby“epimutations”.Insuchcases,their
DNAsequenceisnotalteredbuttheirmethylationprofileorlocalhistonecode
may be modified. This may affect the expression of the corresponding gene,
which may contribute to cancer progression. Another possibility, which is
generally neglected today, are the activation of positive feedback loops. What
differentiateslyticfromlysogeniclambdaphagesisnottheirgenome(whichis
identical),buttheexpressionornotofthelambdarepressorprotein.Oncethe
lambda repressor is expressed it will block the expression of the genes needed
for lysis while activating its own expression. It is conceivable that tumor cells
also differ from normal cells by the inherited activation or downregulation of
equivalentpositivefeedbackloops.
Perturbationsofthestromaareincreasinglyconsideredasplayinganimportant
role in tumor progression, i.e. changes of the cellular and extracellular
microenvironmentinwhichthetumordevelops.Themostlikelyhypothesisis
that tumor cells accumulate mutations that allow them to send signals to their
surroundings,inducingstromalchangesthatinturnpromotecancergrowth.A
subcloneofthetumormayharborsuchmutations,whichmayultimatelybenefit
other subclones “in trans” without these mutations. One can also imagine that
thesignalinginducesheritableepigeneticchangesinthestromalcells,suchthat
Essentialhumangenomics–MichelGeorges
Page41/41
the tumor promoting effects of the stromal cells is maintained even in the
absenceoftheoriginalinducingsubclone.