Download Genome Biology and

Document related concepts

Polyploid wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Mitochondrial DNA wikipedia , lookup

Ridge (biology) wikipedia , lookup

Gene desert wikipedia , lookup

Gene expression profiling wikipedia , lookup

Genomic imprinting wikipedia , lookup

Oncogenomics wikipedia , lookup

Genetic engineering wikipedia , lookup

Copy-number variation wikipedia , lookup

Segmental Duplication on the Human Y Chromosome wikipedia , lookup

Gene wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Microevolution wikipedia , lookup

NUMT wikipedia , lookup

Transposable element wikipedia , lookup

Designer baby wikipedia , lookup

RNA-Seq wikipedia , lookup

History of genetic engineering wikipedia , lookup

Non-coding DNA wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

ENCODE wikipedia , lookup

Metagenomics wikipedia , lookup

Helitron (biology) wikipedia , lookup

Human genetic variation wikipedia , lookup

Genome (book) wikipedia , lookup

Public health genomics wikipedia , lookup

Minimal genome wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Pathogenomics wikipedia , lookup

Genomic library wikipedia , lookup

Tag SNP wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Genomics wikipedia , lookup

Human genome wikipedia , lookup

Genome editing wikipedia , lookup

Human Genome Project wikipedia , lookup

Genome evolution wikipedia , lookup

Transcript
Genome Biology and
Biotechnology
3. The genome structures of vertebrates
Prof. M. Zabeau
Department of Plant Systems Biology
Flanders Interuniversity Institute for Biotechnology (VIB)
University of Gent
International course 2005
The Genome Sequences of vertebrates
¤ Fish genomes: “compact” vertebrate genomes
– Fugu rubripes (2002)
– Tetraodon nigroviridis (2004)
¤ Bird genome: Interesting evolutionary intermediate
– Chicken - Gallus gallus (2004)
¤ Rodent genomes: the model organism for the human
– Mouse - mus musculus (2002)
– Rat – Rattus norvegicus (2004)
¤ Primate genomes: our closest relatives
– Chimpanzee
¤ Human genome
– Draft genome sequence (2001)
– Finished genome sequence (2004)
vertebrate
evolution
310 MY
450 MY
Reprinted from: ICGSC, Nature 432, 695 - 716 (2004)
Whole-Genome Shotgun Assembly and Analysis
of the Genome of Fugu rubripes
Aparicio et al., Science, 297, 1301-1310 (2002)
¤ Paper presents
– Low quality draft genome sequence of Fugu rubripes
– the sequence provided a valuable reference for annotating the
human and mouse genomes
• Small genome (350 Mb versus 3000 Mb)
• low repetitive DNA content
The Fugu Genome Sequence
¤ The draft sequence covers a total of 332.5 Mb
– Highly fragmented sequence (~30 Mb unassembled sequences)
– The total genome size is estimated at ~365 Mb
¤ The number of predicted genes: 31,059
– similar to the number of human genes predicted from the draft
sequence
¤ Repetitive sequences
– Density of <15% far below the 35 to 45% observed in mammals
• Transposable elements are still very active
Reprinted from: Aparicio et al., Science, 297, 1301-1310 (2002)
Protein-coding genes
¤ The gene-containing fraction is a ~ 108 Mb (30%)
– The average gene density: one gene per 10.9 kb
– The Fugu genome is compact because introns are shorter than in
the human genome
• Genome contains ~500 large introns (> 10 kb) compared > 12,000
large introns in human
• Genes are scaled in proportion to the compact genome size
– The number of introns is roughly the same as in human
• Both gain and loss of introns in the Fugu lineage are observed
¤ The compactness of the Fugu is accounted for by
– Low abundance of repeated sequences
– The small size of introns and intergenic regions
Reprinted from: Aparicio et al., Science, 297, 1301-1310 (2002)
Comparison of Fugu and Human Proteomes
¤ 75% of predicted human proteins have a strong match
to Fugu
Reprinted from: Aparicio et al., Science, 297, 1301-1310 (2002)
Genome duplication in the teleost fish Tetraodon
nigroviridis reveals the early vertebrate protokaryotype
Jaillon et. al., Nature 431, 946 - 957 (2004)
¤ Paper presents
– High quality draft genome sequence with long-range linkage and
chromosome anchoring of Tetraodon nigroviridis
• freshwater puffer fish with the smallest known vertebrate
genome
The Tetraodon genome sequence
¤ The draft genome sequence (8,3 x) spans 342 Mb
– Largest scaffolds were mapped onto the chromosomes
• Draft is much less fragmented than that of Fugu
¤ Genome landscape
– Transposable elements are very rare (<4000 copies)
• Fewer than Fugu (15% of the genome)
¤ Estimated 20,000–25,000 protein coding genes
– Very similar to the recent (2004) human gene count
– Much lower than reported for Fugu (current Fugu is also lower)
– Gene ontology (GO) classifications shows only subtle differences
between fish and mammals
• Improved fish gene catalogue aids human gene predictions
Reprinted from: Jaillon et. al., Nature 431, 946 - 957 (2004)
Evidence For Whole-genome Duplication
¤ Duplicated genes cluster on paralogous chromosomes
– paralogous chromosomes arising from whole-genome duplication
each contain one member of duplicated gene pairs in the same
order
Reprinted from: Jaillon et. al., Nature 431, 946 - 957 (2004)
Evidence For Whole-genome Duplication
¤ Blocks of doubly conserved synteny
– The synteny map typically associates two regions in Tetraodon
with one region in human
Tni: Tetraodon
Hsa: human
Reprinted from: Jaillon et. al., Nature 431, 946 - 957 (2004)
Ancestral genome of bony vertebrates
¤ The patterns of doubly conserved synteny are consistent with
– 12 ancestral chromosomes which have rearranged to form
– the present day chromosomes of human and fish
Human
Fish
Reprinted from: Jaillon et. al., Nature 431, 946 - 957 (2004)
Sequence and comparative analysis of the
chicken genome provide unique perspectives on
vertebrate evolution
International Chicken Genome Sequencing Consortium, Nature 432, 695 - 716
(2004)
¤ Paper presents
– a draft genome sequence of the red jungle fowl Gallus gallus
– The first genome of non-mammalian amniote
• provides a new perspective on vertebrate genome evolution
– The evolutionary distance between chicken and human provides
an excellent signal-to-noise ratio to detect functional elements
• 310 MY since the divergence of birds and mammals
The chicken genome sequence
¤ The draft genome sequence (6,6 x) spans 1.050 Mb
– Draft represents ~96% of the euchromatic part of the genome
– 23,212 chicken mRNAs and 485,000 ESTs
¤ Chicken genome is 3x smaller than mammalian
genomes reflecting substantially fewer
– interspersed repeats
• transposable elements make up <9% of the genome, markedly
lower than the 40–50% observed in mammalian genomes
– Pseudogenes
• 51 retrotransposed genes vs. > 15,000 in mammalian genomes
– segmental duplications
• Limited to very small (<10kb) intrachromosomal duplications
Reprinted from: ICGSC, Nature 432, 695 - 716 (2004)
Gene content of the chicken genome
¤
Protein-coding genes
–
–
¤
Predict 20,000 to 23,000 protein-coding genes
Matches the current (2004) estimate for mammalian genomes
Non-coding RNA genes
–
–
571 ncRNA genes from >20 gene families
• Fewer than in human: many ncRNA genes are pseudogenes
Syntenic relationships for non-coding RNA genes differ from
those of protein-coding genes
• implies a novel mode of evolution for some ncRNA genes
• Only certain ncRNA genes are in regions of conserved synteny
– microRNAs (miRNAs) and small nucleolar RNAs (snoRNAs)
found in introns of protein-coding genes
Reprinted from: ICGSC, Nature 432, 695 - 716 (2004)
Evolutionary conservation of gene components
¤ Sequence conservation of chicken and human orthologs
– highest in protein-coding exons
– minimal in introns
– Significant in the 5' and 3' flanking and untranslated regions
5’UTR
exon
3’UTR
Reprinted from: ICGSC, Nature 432, 695 - 716 (2004)
Conservation of vertebrate protein content
¤ 60% of chicken genes have
a single human orthologue
– also have a single orthologue in
the Fugu genome
– Represent a conserved core
present in most vertebrates
Reprinted from: ICGSC, Nature 432, 695 - 716 (2004)
Conserved core orthologues in vertebrates
¤ Core orthologues conserved in vertebrates have
– Highly conserved protein sequences indicating that
• They have been subject to purifying selection
Reprinted from: ICGSC, Nature 432, 695 - 716 (2004)
Expansion of multigene families
¤ Expansion and contraction of multigene families were
– major factors in the independent evolution of mammals and birds
Reprinted from: ICGSC, Nature 432, 695 - 716 (2004)
Chromosomal dynamics in the vertebrates
¤ Maps of conserved synteny: orthologous chromosomal
segments with conserved gene order show
– slow rate of rearrangement in the human lineage
• 3-fold higher rate in the rodent lineage
– The human genome is closer to the chicken in terms of synteny
Reprinted from: ICGSC, Nature 432, 695 - 716 (2004)
Ancestral
mammalian genome
¤
Long blocks of conserved
chicken–human synteny
–
¤
Entire chromosomes
Genome rearrangements
–
–
–
Many intrachromosomal
rearrangements
Few translocations
between chromosomes
Chicken has a number of
micro-chromosomes
Reprinted from: ICGSC, Nature
432, 695 - 716 (2004)
Conserved sequences in chicken and human
¤
High substitution rates between human and chicken
–
¤
70 Mb (2.5%) of human sequence aligns with chicken
–
–
¤
Can be used to detect functionally conserved sequences
44% are in protein-coding regions - exons
66% is non-coding: intronic (25%) and intergenic (31%)
Conserved non-coding segments occur clustered and
far from genes
–
–
Identified 57 segments with average length of 1,1 MB
• gene poor, G+C poor and have no interspersed repeats
the functional significance of these sequences is completely
unknown
Reprinted from: ICGSC, Nature 432, 695 - 716 (2004)
Conclusion
¤ The chicken genome sequence
– is a key resource for comparative genomics
• to distinguish derived or ancient features of mammalian
biology
– mammalian innovation and adaptation
• conserved non-coding sequences in particular
– Provides a framework for discovering the functional
polymorphisms underlying
• interesting quantitative traits to further exploit the genetic
potential of the chicken
Reprinted from: ICGSC, Nature 432, 695 - 716 (2004)
Initial sequencing and comparative analysis
of the mouse genome
Mouse Genome Sequencing Consortium, Nature 420, 520 - 562 (2002)
¤ Paper presents
– Draft genome sequence of the mouse
– comparative analysis of the mouse and human genomes
• 75 MY since the divergence of rodents and primates
• The two genome sequences diverge by nearly one substitution for
every two nucleotides
– the insights that can be gleaned from the two sequences
The Mouse Genome Project
¤ The laboratory mouse is an experimental model system
– for studying human disease and mammalian biology
¤ The Mouse Genome Project
– International collaboration of centres in the US and the UK
– Adopted mixed strategy for the draft genome sequencing
• a BAC-based physical map of the mouse genome
• The initial draft genome sequence was generated by
– WGS sequencing to ~7-fold coverage
– Hierarchical shotgun sequencing of BAC clones
– The finished sequence should be completed in 2005
• using the BAC clones for directed finishing
Reprinted from: Mouse Genome Sequencing Consortium, Nature 420, 520 - 562 (2002)
The draft mouse genome sequence
¤ The euchromatic mouse genome is estimated ~2.5 Gb
– The draft genome sequence covers ~96% of the genome
¤ Generation of the draft genome sequence
– Sequencing
• 41.4 Mi paired-end sequence reads derived from various clone types
– Assembly
•
•
•
•
represents ~7.7-fold sequence coverage
224,713 sequence contigs
total of 7,418 supercontigs
The 200 largest supercontigs span more than 98% of the assembled
sequence
– Anchoring to chromosomes
• Anchored all supercontigs >500 kb with the mouse genetic map
Reprinted from: Mouse Genome Sequencing Consortium, Nature 420, 520 - 562 (2002)
The draft mouse genome sequence
¤ The euchromatic mouse genome is estimated ~2.5 Gb
– The draft genome sequence covers ~96% of the genome
¤ Comparative analysis of human and mouse genomes
– The mouse genome is about 14% smaller than the human genome
– High degree of synteny
• >90% of the two genomes can be partitioned into
corresponding regions of conserved synteny
– At the nucleotide level, approximately 40% of the human genome
can be aligned to the mouse genome.
• represent orthologous sequences conserved from the common
ancestor
Reprinted from: Mouse Genome Sequencing Consortium, Nature 420, 520 - 562 (2002)
Synteny between mouse and human
¤ Regions containing orthologous sequence pairs define
– Syntenic segments as regions in which
• Orthologous sequence pairs are in the same order on a chromosome
in both species
Reprinted from: Mouse Genome Sequencing Consortium, Nature 420, 520 - 562 (2002)
Synteny between mouse and human
¤ Conservation of orthologous sequence pairs shows
– Each genome can be parsed into a total of 342 conserved
syntenic segments.
• The segments vary greatly in length, from 303 kb to 64.9 Mb
• In total, about 90.2% of the human genome and 93.3% of the mouse
genome reside within conserved syntenic segments
– The segments can be aggregated into a total of 217 conserved
syntenic blocks
¤ The syntenic block and segment sizes are
– consistent with the random breakage model of genome evolution
– the minimal number of rearrangements needed to 'transform'
one genome into the other is 295 rearrangements
Reprinted from: Mouse Genome Sequencing Consortium, Nature 420, 520 - 562 (2002)
Blocks of conserved synteny in the human and
mouse genomes
Reprinted from: Mouse Genome Sequencing Consortium, Nature 420, 520 - 562 (2002)
Repetitive sequences in human and mouse
¤ The most prevalent feature of mammalian genomes is
their high content of repetitive sequences
– Most of which are interspersed repeats representing 'fossils' of
transposable elements
¤ The repetitive sequences in mouse and human differ
– Only 37.5% of the mouse genome
– ~46% of the human genome is transposon-derived
– Insertions of transposable elements occured in the last 150–200
million years
• The most notable difference is the rate of transposition over time
– in mouse the rate has remained fairly constant
– in human the rate increased to a peak at ~40 Myr, and
then plummeted
Reprinted from: Mouse Genome Sequencing Consortium, Nature 420, 520 - 562 (2002)
Age distribution of interspersed repeats in the mouse
and human genomes
Mouse
Human
Reprinted from: Mouse Genome Sequencing Consortium, Nature 420, 520 - 562 (2002)
Protein-coding genes in mouse and human
¤ Human and mouse gene catalogues
– The current human gene catalogue (Ensembl build 29) contains
22,808 predicted genes
– The current mouse gene catalogue contains 22,011 predicted genes
¤ Comparative analysis of protein coding genes shows
– 80% of the mouse genes have orthologues in the human genome
• The proportion of mouse/human genes without any homologue is < 1%.
– Many local gene family expansions have occurred in the mouse
lineage
• Most seem to involve genes for reproduction, immunity and olfaction
– The rate of protein evolution
• Most proteins evolve at fairly constant rate
• Certain proteins evolve much more rapidly: positive selection
– Proteins implicated in reproduction, host defence and immune
response seem to be under, which drives
Reprinted from: Mouse Genome Sequencing Consortium, Nature 420, 520 - 562 (2002)
Conclusions
¤ The mouse genome provides a powerful resource to
unravel the secrets of the human genome
– Demonstrates the power of comparative genomics in identifying
relevant genetic elements
– These findings inspired additional animal genome sequencing
projects to fully exploit the power of comparative genomics
• As illustrated in: Thomas et. al., Nature 423, 788 - 793 (2003)
– The sequence provides a comprehensive framework for
functional genomics approaches to unravel gene functions in both
human and mouse
Genome sequence of the Brown Norway rat
yields insights into mammalian evolution
RGSPC, Nature 424: 493 - 521 (2004)
¤ Paper presents
– a high-quality 'draft' sequence covering > 90% of the genome
– a three-way comparison with the human and mouse genomes to
study the mammalian genome evolution
• Rat - mouse common ancestor: 12–24 Myr
• Rodent - human common ancestor: 75 Myr
The rat genome sequence
¤ The draft genome sequence covers 2,75 Gb
– A 'combined' sequencing strategy using
• WGS sequencing and light sequence coverage of BACs
• Sequential assembly of 'enriched BACs' (eBACs) joined into bactigs,
superbactigs and ultrabactigs
eBAC
Reprinted from: RGSPC, Nature 424: 493 - 521 (2004)
Rat – mouse – human genome sequences
¤ Sequence elements in
human, mouse and rat
genomes
– 40% align in all 3 species
• 'ancestral core' of 1 Gb
• 95% of the exons and
regulatory regions
– 28% aligns only with mouse
• rodent-specific repeats
– 29% does not align
• rat-specific repeats
Reprinted from: RGSPC, Nature 424: 493 - 521 (2004)
Evolution of genes
¤ Estimate that 90% of rat genes possess
– strict orthologues in both mouse and human
• Intronic structures are well conserved
– Most of the non-orthologous genes
• Arose by expansions of gene families in the different lineages
• Rapidly evolving genes
– Rat-specific genes comprise novel genes for “life style”
• pheromones, immunity, chemosensation, detoxification, proteolysis
Reprinted from: RGSPC, Nature 424: 493 - 521 (2004)
Rat – mouse – human synteny
¤ orthologous chromosome
segments
– 105 mouse–rat segments
– 278 human-rat segments
– 280 human-mouse segments
Reprinted from: RGSPC, Nature 424: 493 - 521 (2004)
Rat – mouse – human genome rearrangements
¤ Reconstruction of the ancestral mammalian genome
– Identified a total of 353 rearrangements
• 247 between the murid ancestor and human
• 50 from the murid ancestor to mouse
• 56 from the murid ancestor to rat
– much higher (3x) rearrangement rate in the rodent than in the
human lineage
247
50
56
Reprinted from: RGSPC, Nature 424: 493 - 521 (2004)
The Human Genome
¤ The human genome project was launched in 1990
– Phase I: generation of genetic and physical maps (1990-1995)
• Demonstration that large scale sequencing is feasible: yeast, worm
– Phase II: large scale sequencing (1995-2005)
• Pilot phase: finished sequence with 99.99% accuracy and no gaps of
the human chromosomes 21 and 22 (published in ’98 and ‘99)
• Draft phase : draft sequence covering >90% of the genome
completed in June 2000 (published in 20001 ) – took ~1 year
• Finishing phase: “finished” covering 99% of the genome sequence,
completed in spring 2004 – took ~3 years
• Aftermath: no end point projected
– closing the last couple of hundred gaps
– Sequencing the centromeres
Reprinted from: International Human Genome Sequencing Consortium Nature 409, 860 (2001)
The Human Genome Sequences
¤ Draft genome sequences (2001)
– International Human Genome Sequencing Consortium
(Collaboration of 20 public sequencing centers in 6 countries)
• Used a hierarchical shotgun sequencing strategy
• Sequence published in: Nature 409, 860 (2001)
– Celera Genomics - private initiative
• Used a whole-genome shotgun approach
• Assembly of the sequence combined their whole-genome shotgun
data and the public genome sequence data
• Sequence published in: Venter et. al., Science, 291, 1304 (2001)
¤ Finished genome sequence (2004)
– International Human Genome Sequencing Consortium
• Sequence published in: Nature 431, 931 - 945 (2004)
Finished Human Genome Sequence
¤ Finishing process: complex iterative process
– Resolving problematic sequences
• From single nucleotide errors and gaps to the integrity of whole
chromosomes
– The finishing process involved two distinct components
• producing finished maps consisting of continuous and accurate paths
of overlapping large-insert clones
• producing finished clone sequences, consisting of continuous and
accurate nucleotide sequences for each clone
– generated shotgun sequence of ~59.000 BACs comprising a
total sequence (redundant) length of 5,8 Gb
– Assembled sequences of ~46.000 BACs
Reprinted from: International Human Genome Sequencing Consortium Nature 431, 931 - 945 (2004)
Finished Human Genome Sequence
¤ Finished genome sequence
– Build 35 comprises 2.851 Mbp
– Interrupted by “only” 341 gaps
• 308 gaps in the euchromatic sequence: totalling ~28 Mb
• 33 heterochromatic gaps (including 24 centromeres) : total ~198 Mb
– The total human genome size is estimated at ~3,080 Mb
¤ Comparison with draft sequence
– Substantially fewer gaps (341 versus 147,821)
– More accurate and complete sequence: error rate ~1 per 105
• Confirmed local order and orientation of the sequences
• Corrected artefactual duplications resulting from mixups
• Verified most of the sequence with
– BAC cloned overlap sequence, paired end sequence reads from
fosmids, draft chimpanzee genome sequence
Reprinted from: International Human Genome Sequencing Consortium Nature 431, 931 - 945 (2004)
Finished Human Genome Sequence
¤ Importance of a completely finished genome sequence
– Accurate reference for identifying genetic variation in the human
population
• Error rate of 10-5 << frequency of SNP of 10-3
– Identification of segmental duplications
• Estimated to cover >5% of the genome sequence
– Located primarily in the pericentromeric and subtelomeric regions
– much higher than in mouse and rat
• Great medical interest: predisposes to deletion or rearrangement
– Williams syndrome region (7q)
– Charcot–Marie–Tooth region (17p)
– DiGeorge syndrome region (22q)
• Many remaining gaps involve unresolved segmental duplications
– Correct identification of all protein- coding genes structures
• ~60% of the gene models were corrected compared to the draft
Reprinted from: International Human Genome Sequencing Consortium Nature 431, 931 - 945 (2004)
Estimates of the Number of Human Genes
¤ Reassociation kinetics (60s and 70s)
– Early estimates based estimated the mRNA complexity of typical
vertebrate tissues to be 10,000–20,000, and were extrapolated to
suggest around 40,000 genes for the entire genome
¤ Estimates from approximate gene and genome sizes
– Calculation based on the size of a typical gene ( 3*104 bp) and the size of
the genome (3*109 bp) yielded 100,000 genes (W. Gilbert, pers. Com.)
¤ Number of CpG islands associated with known genes
– An estimate of 70,000–80,000 genes was made
¤ Estimates based on ESTs
– Estimates based on ESTs varied widely, from 35,000 to 120,000 genes
– Discrepancy results from contaminating genomic sequences and multiple
ESTs from single genes
¤ Whole-genome shotgun sequence from the pufferfish
– Suggested around 30,000 human genes
Reprinted from: International Human Genome Sequencing Consortium Nature 409, 860 (2001)
Identification of Protein Coding Genes
¤ Draft genome sequence
– Initial estimate: 30.000–35.000 genes
¤ Finished genome sequence
– The human gene catalogue contains 22,287 gene loci consisting of
19,438 known genes and 2,188 predicted genes
– Current estimate: 20.000–25.000 genes
• 25.000 is an upper limit, the actual number may be ~23.000
• Consistent with gene counts in other vertebrates: fish and chicken
Reprinted from: International Human Genome Sequencing Consortium Nature 431, 931 - 945 (2004)
Basic Characteristics of Gene Structures
¤ Mean and median values of gene structures
– Based on the draft sequence
• In particular, the UTRs in the RefSeq database are incomplete
Reprinted from: International Human Genome Sequencing Consortium Nature 409, 860 (2001)
Protein Coding Genes
¤ General features of human genes
– average coding length of about 1,400 bp
• Similar to al eukaryotic organisms
– average genomic extent of about 30 kb
• Much larger than in lower eukaryotes
– The variation in gene and intron size
• GC-rich regions: gene-dense with many compact genes
• AT-rich regions: gene-poor with genes containing large introns
¤ Known and predicted exons: ~231.000
– 1,2% of the human genome
– Average of 10.4 exons per gene
¤ Pseudogenes
– Current estimates: 20.000 processed and unprocessed
pseudogenes
• The total number of pseudogenes is thus likely to exceed the total
number of functional genes
• Only those of recent origin can be identified with confidence
Reprinted from: International Human Genome Sequencing Consortium Nature 431, 931 - 945 (2004)
Basic Characteristics of Gene Structures
¤ High variation in overall intron size
– distribution has very long tails
• Many genes are over 100 kb long
– Largest gene: dystrophin gene (DMD) 2.4 Mb
– longest known coding sequence: titin gene 80,780 bp, 178 exons
Reprinted from: International Human Genome Sequencing Consortium Nature 409, 860 (2001)
Comparison with fly, worm and yeast
¤ Apparent homologues of human proteins
– 40% to 60% of the yeast, worm and fly proteomes
¤ Human genes differ from those in worm and fly
– Spread out over much larger regions of genomic DNA
– Have a substantially larger number of exons
• 4,5 to 5 in fly and worm compared to 10,4 in human
– Are used to construct more alternative transcripts
• Larger number of proteins in human than in the worm or fly
¤ Increased complexity of the proteome
– Complexity of the human proteome is a consequence of largescale protein innovation
• Multi-domain proteins with multiple functions, and domain
architectures
Reprinted from: International Human Genome Sequencing Consortium Nature 409, 860 (2001)
Protein Coding Gene Evolution in Human
¤ Gene birth in the human lineage
– gene duplications that arose after divergence from the mouse
• Identified 1,183 gene clusters containing 3,300 recently duplicated
genes ( with a peak 3–4 million years ago ) enriched in genes with
– immune function
– olfactory function
– reproductive functions
– Duplicated genes are the raw material for adaptive evolution:
• extra copies are able to undergo functional divergence in response to
positive selection
¤ Gene death in the human lineage
– Recently inactivated genes include genes in olfactory function
Reprinted from: International Human Genome Sequencing Consortium Nature 431, 931 - 945 (2004)
Future Perspectives
¤ Vertebrate genome sequencing projects ongoing or
planned (currently totaling 25)
–
–
–
–
Fish: zebrafish, salmon, tilapia, stickleback and Japanese medaka
Amphibians: Xenopus laevis and X. tropicalis
Birds: turkey
Mammals: ~15 additional species
• cow, pig, cat, dog, horse, rabbit, guinea pig, elephant, kangaroo,
shrew,….
– Primates: chimp, orangutan, baboon and rhesus monkey
¤ Source: GOLDTM Genomes OnLine Database
– http://www.genomesonline.org/
Genome Biology and
Biotechnology
4. The variable human genome
International course 2005
Summary
¤ Sequence Variations in the Human Genome
¤ Haplotype structure of the sequence variations in the
human genome
– linkage disequilibrium in the human genome
– haplotype blocks in the human genome
¤ The haplotype map of the human genome
– Map of all the genetic variations in the human population
Sequence Variations in the Human Genome
¤ Most human sequence variation (>90%) results from
– SNPs (single nucleotide polymorphisms)
– SNPs are the result of very rare replication errors in which a
wrong base remains incorporated in the newly synthesized
strand
¤ Human sequence variation is responsible for
– Phenotypic variation between individuals
– Influencing the risk of common human diseases
Root causes of common human diseases
¤ Causes of human diseases are largely unknown
– preventative measures are generally inadequate
– available treatments are seldom curative
¤ Family history is one of the strongest risk factors
for nearly all diseases
– cardiovascular disease, cancer, diabetes, autoimmunity,
psychiatric illnesses and many others
– inherited genetic variation has an important role in the
pathogenesis of disease
¤ Identifying the causal genes and variants represents
an important step towards
– improved prevention, diagnosis and treatment of disease
The International HapMap Consortium et. al., Nature 437, 1299 (2005)
Heritable human diseases
¤ Rare highly heritable 'mendelian' disorders
– > 1000 genes have been identified
– variation in a single gene causes disease
¤ Common human diseases
– are thought to be due to the combined effect of
• many different susceptibility DNA variants
• interacting with environmental factors
– have proven much more challenging to study
The International HapMap Consortium et. al., Nature 437, 1299 (2005)
Common human diseases
¤ Studies of common diseases: 2 broad classes
– family-based linkage studies across the entire genome
• linkage analysis has low power except when
– a single locus explains a substantial fraction of disease
– population-based association studies of candidate genes
• association studies examine only a small fraction of the 'universe' of
sequence variation in each patient
¤ Comprehensive search for genetic influences
– examining all genetic differences in a large number of affected
individuals and controls
• complete genome resequencing
• systematically test common genetic variants
The International HapMap Consortium et. al., Nature 437, 1299 (2005)
Common genetic variants
¤ Common genetic variants
– explain much of the genetic diversity in our species
– a consequence of the historically small size and shared ancestry
of the human population
¤ Common variants with an important role in disease
•
•
•
•
•
•
•
•
•
•
HLA: autoimmunity and infection
APOE4: Alzheimer's disease, lipids
Factor VLeiden: deep vein thrombosis
PP: G: encoding PPAR; type 2 diabetes
KCNJ11: type 2 diabetes
PTPN22: rheumatoid arthritis and type 1 diabetes
CTLA4: autoimmune thyroid disease, type 1 diabetes
NOD2: inflammatory bowel disease
complement factor H: age-related macular degeneration
RET: Hirschsprung disease
The International HapMap Consortium et. al., Nature 437, 1299 (2005)
Sequence Variations in the Human Genome
¤ Most human sequence variation (>90%) results from
– SNPs (single nucleotide polymorphisms)
– SNPs occur on average every 1,000 bases when the sequences
of two human individuals are compared
– Remainder of the human sequence variation is attributable to
• insertions or deletions of one or more bases
• repeat length polymorphisms
• Rearrangements
¤ SNPs are well suited to automated, high-throughput
and low cost genotyping
– SNPs are binary and can thus easily be typed
– SNPs have a low rate of recurrent mutation
– SNPs are present at sufficient density for comprehensive
genetic analysis
High throughput SNP Genotyping Methods
¤ Primer extension
– Primer designed adjacent to the SNP, extended and the
extension product analyzed
• Fluorescence
• Mass spectrometry
¤ Oligonucleotide ligation
T/G
A/C
– Ligation requires perfect base pairing of the terminal
nucleotides
¤ Array-based hybridization
– high density Affymetrix microarrays
• 25-mer oligonucleotides are perfectly suited to discriminate SNP
alleles
• Latest product 500.00 SNP array
Haplotype structure of the sequence variations
¤ Human genetic diversity appears to be
– Limited: a small number of common polymorphisms explain the
bulk of the observed variation,
• i.e. are found in most individuals in the population
– Structured: specific combinations of alleles – haplotypes – are
observed at closely linked sites
Haplotype 1
Haplotype 2
recombination
SNP
Haplotype 3
Haplotype Structure of the Sequence Variations
¤ At a macroscopic scale (chromosome)
– recurrent recombination results in complete linkage equilibrium
• random combinations of SNPs
1st generation
N generations
Recombination
N recombination events
Random assortment of SNPs
Haplotype Structure of the Sequence Variations
¤ At a microscopic scale (gene)
– Non-random recombination results in linkage disequilibrium
• Non-random combinations of SNPs: haplotype blocks
1st generation
N generations
Haplotype blocks
Linkage disequilibrium in the human genome
Reich et. al., Nature 411, 199 (2001)
¤ Landmark paper presenting
– a systematic analysis of the extent of linkage disequilibrium in
the human genome
– a large-scale experiment to measure linkage disequilibrium (LD)
in 19 randomly selected genomic regions in
• United States population of north-European descent
• Nigerian population
Experimental Approach
¤ Selected 19 high-frequency or common SNPs in genes
as core SNPs
– High-frequency SNPs tend to be common in all populations,
facilitating cross-population comparisons
– Linkage disequilibrium around common alleles can be measured
with a modest sample size of 80–100 chromosomes
– Linkage disequilibrium around common alleles represents a
'worst case' scenario
• Such alleles are generally old and there has been ample historical
opportunity for recombination to break down ancestral haplotypes
Reprinted from: Reich et. al., Nature 411, 199 (2001)
Experimental Approach
¤ High frequency SNPs were identified at various
distances from the core SNPs
– Re-sequenced regions of ~ 2 kb at 0, 5, 10, 20, 40, 80 and
160 kb from the core SNP in 44 unrelated individuals from Utah
• Identified a total of 272 'high frequency' polymorphisms
– Measured linkage disequilibrium between two SNPs using the
classical statistic D‘
• D’ = observed linkage/maximal linkage: Pab/(Pa,Pb)
10 20
40
80
160
Core SNP
Reprinted from: Reich et. al., Nature 411, 199 (2001)
Observed Linkage Disequilibrium
¤ Linkage disequilibrium has a half-length of ~ 60 kb
– linkage disequilibrium extends much (10-fold) further than
previously predicted
Reprinted from: Reich et. al., Nature 411, 199 (2001)
Why does linkage disequilibrium extend so far?
¤ Long-range linkage disequilibrium can be explained by
– an extreme founder effect or population bottleneck
• A period when the population was so small that a few
ancestral haplotypes gave rise to the present day haplotypes
¤ Linkage disequilibrium in different populations
– short-range linkage disequilibrium is general in sub-Saharan
African populations
– long-range linkage disequilibrium is typical for northern
Europeans
• a severe bottleneck in the European population could have generated
the linkage disequilibrium
Reprinted from: Reich et. al., Nature 411, 199 (2001)
Origin of linkage disequilibrium?
¤ The bottleneck could be specific to northern Europe
– Europe was substantially depopulated during the Last Glacial
Maximum (30,000–15,000 years ago), and subsequently
recolonized by a small number of founders
• Long range linkage disequilibrium would be absent in other nonAfrican populations
¤ The bottleneck is more global
– Result of the dispersal of the modern humans from Africa
50,000 years ago
• Long-range linkage disequilibrium would then be present in a variety
of non-African populations
Reprinted from: Reich et. al., Nature 411, 199 (2001)
High-resolution Haplotype Structure in the
Human Genome
Daly et. al., Nature Genet. 29, 229 (2001)
¤ Landmark paper presenting
– High-resolution analysis of the haplotype structure across 500
kb region on chromosome 5q31
• Genotyped 103 common SNPs in 129 trios from a European-derived
population
– Low marker density of 1 SNP roughly every 5 kb
– First high-resolution picture of the patterns of genetic variation
across a large genomic region
Block-like Haplotype Diversity at 5q31
¤ The common SNPs are arranged in haplotype blocks
– span up to 100 kb
– contain multiple (five or more) common SNPs
– have only a few (2–4) haplotypes, which
• account for the majority of chromosomes (>90%) in the sample
• show no evidence of being derived from one another by
recombination
Reprinted from: Daly et. al., Nature Genet. 29, 229 (2001)
Block-like Haplotype Diversity at 5q31
¤ The haplotype blocks are separated by intervals
– in which several independent historical recombination events
seem to have occurred
¤ The historical recombination events are clustered
– multiple exchanges between most blocks
– little or no recombination within blocks.
– The clustering of recombination events is suggestive of local
hotspots of recombination
Historical recombination events
Reprinted from: Daly et. al., Nature Genet. 29, 229 (2001)
Implications of Haplotype blocks
¤ Once the haplotype blocks are identified
– they can be treated as alleles in genome-wide association studies
to find medically relevant variation
• Holy grail of pharmacogenetics
– a subset of SNPs haplotype tag SNPs – htSNPs - can be used to
uniquely distinguish the common haplotypes in each block
• A subset of all the SNPs is sufficient for whole-genome association
anlysis
Reprinted from: Daly et. al., Nature Genet. 29, 229 (2001)
Blocks of Limited Haplotype Diversity Revealed
by High-Resolution Scanning of Human
Chromosome 21
Patil et. al., Science, 294: 1719 (2001)
¤ Landmark paper presenting
– the haplotype structure of chromosome 21
– Used high-density oligonucleotide arrays, in combination with somatic cell
genetics
• To identify the common SNPs on human chromosome 21
• To directly observe the haplotype structure defined by these SNPs
• This structure reveals blocks of limited haplotype diversity in which
more than 80% of a global human sample can typically be
characterized by only three common haplotypes
Experimental Approach
¤ Discovered chr 21 SNPs and determined the
haplotype structure using
– ultra high-density oligonucleotide arrays
– in combination with somatic cell genetics
¤ SNPs discovery
– Using a public panel of 24 ethnically diverse individuals
• African, Asian, and Caucasian
– Physically separated the two chr 21 copies from each individual
• using a rodent-human somatic cell hybrid technique
– Analyzed 20 independent copies of chromosome 21
¤ Since SNPs are characterized on haploid copies
– they directly reveal haplotypes
– The SNPs of chromosome 21 reveal numerous haplotype blocks
Reprinted from: Patil et. al., Science, 294: 1719 (2001)
Haplotype Block Defined by 14 Common SNPs
15/20 individual chromosomes
Haplotype blocks
1
2
3
4
5
6
major allele
minor allele
Block of consecutive common SNPs
Nucleotide position on chrom. 21
Reprinted from: Patil et. al., Science, 294: 1719 (2001)
Haplotype Block: selection of tag SNPs
haplotype patterns
1
2
3
4
SNPs for genotyping
4 common haplotypes
Reprinted from: Patil et. al., Science, 294: 1719 (2001)
Inventories of human genome sequence variation
¤ The first inventory of SNPs was made by
– The public Human Genome Project (HGP)
• 971,077 candidate SNPs were identified as sequence differences in
regions of sequence overlap between large-insert clones
– The SNP Consortium (TSC) – a public/private consortium
• Discovered using a publicly available panel of 24 ethnically diverse
individuals
• 1,023,950 candidate SNPs identified by shotgun sequencing of
genomic fragments and aligning to the genome sequence
¤ First inventory (2001) comprised 1,4 million SNPs
– Average density of one SNP every 1.91 kb
– SNPs primarily in regions surrounding genes
• estimate 60,000 exonic SNPs in the collection
The International SNP Map Working Group, Nature 409, 928 (2001)
Human genome sequence variation
¤ It is estimated that in the world's human population
– about 10 million “common” SNPs
• With a minor allele frequency of 1% or more
• one variant per 300 bases on average
– these 10 million common SNPs constitute 90% of the variation in
the world population
– The remaining 10% of the variation is due
• A large number of SNPs that are rare in the population
• These may represent another 30 million SNPs
¤ Next frontier in the human genome
– Complete inventory of the common SNPs
– Complete map of the common SNPs: The HapMap project
The International SNP Map Working Group, Nature 409, 928 (2001)
The International HapMap Project
The International HapMap Consortium, Nature 426, 789 - 796 (2003)
¤ The goal of the International HapMap Project
– determine the common patterns of DNA sequence variation in
the human genome and
– make this information freely available in the public domain.
¤ The HapMap will
– allow the discovery of sequence variants that affect common
disease
– will facilitate development of diagnostic tools
– will enhance our ability to choose targets for therapeutic
intervention
The International HapMap Project
¤ Determine haplotype patterns across the genome
– 5 million common sequence variants
• genotyped in 270 DNA samples from populations of Africa, Asia and Europe
– Common SNPs are found in all populations
• Project includes several populations from different geographic
locations
– Yoruba, Japanese, Chinese individuals and individuals with
ancestry from Northern and Western Europe
¤ Genotyping strategy
– Phase I
• initial round of genotyping of 1.00.000 SNPs in the 270 DNA samples
– completed December 2004
– Phase II
• genotyped 5 million SNPs at ~ 1-kilobase intervals in 270 individuals
– Completed November 2005
The International HapMap Project
¤ The extent of association between nearby markers
– varies dramatically across the genome
– the patterns of association must be empirically determined for
efficient selection of tag SNPs.
¤ On the basis of empirical studies it is estimated that
– most of the information about genetic variation represented by
the 10 million common SNPs in the population could be provided
• by genotyping 200,000 to 1,000,000 tag SNPs across the genome
– Thus, a substantial reduction in the amount of genotyping can be
obtained with little loss of information, by
• using knowledge of the LD present in the genome.
Perspectives
¤ For the full potential of the HapMap to be realized
– The genotyping technology must
• become more cost efficient, and the analysis methods must be improved
– Pilot studies with other populations must be completed
• to confirm that the HapMap is generally applicable
¤ Genome-wide association projects must establish
– carefully phenotyped sets of affected and unaffected individuals for
many common diseases in a way that
• preserves confidentiality
• retains detailed clinical and environmental exposure data
¤ Careful attention must also be paid to the ethical issues that
– will be raised by the HapMap and the studies that will use it
– challenge to avoid misinterpretations or misuses of results from studies
that use the HapMap
Whole-Genome Patterns of Common DNA
Variation in Three Human Populations
Hinds et. al., Science. 307: 1072-1079 (2005)
¤ Paper presents
– Whole-genome patterns of common human DNA variation by
genotyping 1,586,383 SNPs in 71 Americans of European,
African, and Asian ancestry
– Different approach to represent the structure of genetic
variation
• LD bins: clusters of tightly linked SNPs
Extended LD bin and haplotype block structure
around the CFTR gene
Reprinted from: Hinds et. al., Science. 307: 1072-1079 (2005)
Conclusion
¤ The 1,5 Million SNPs capture
– most common human genetic variation as a result of linkage
disequilibrium
• strong correlation among common SNP alleles that define haplotypes
¤ Strong correlation between
– extended regions of linkage disequilibrium
– functional genomic elements
¤ First generation haplotype map provides a tool for
– exploring the causal role of common human DNA variation in
complex human traits
– investigating the nature of genetic variation within and between
human populations.
Reprinted from: Hinds et. al., Science. 307: 1072-1079 (2005)
A haplotype map of the human genome
The International HapMap Consortium et. al., Nature 437, 1299 (2005)
¤ Paper presents
– A map of >1 million SNPs for which accurate and complete
genotypes have been obtained in 269 DNA samples from four
populations
– The data document the generality of
•
•
•
•
recombination hotspots
a block-like structure of linkage disequilibrium
low haplotype diversity
substantial correlations of SNPs with many of their neighbours
Number of SNPs in dbSNP over time
¤ Public database dbSNP
(http://www.ncbi.nlm.nih.gov/SNP/)
– October of 2005: 10,4 million RefSNP clusters
– 4,8 million validated SNPs
The International HapMap Consortium et. al., Nature 437, 1299 (2005)
Genealogical relationships among haplotypes
The International HapMap Consortium et. al., Nature 437, 1299 (2005)
Length of LD spans
The International HapMap Consortium et. al., Nature 437, 1299 (2005)
Conclusions
¤ The phase I haplotype map documents the generality
of
–
–
–
–
block-like structure of linkage disequilibrium
low haplotype diversity
recombination hotspots
substantial correlations of SNPs with many of their neighbours
¤ Important application of the HapMap data is
– make possible comprehensive, genome-wide association studies
– Identify the root causes of common diseases
The International HapMap Consortium et. al., Nature 437, 1299 (2005)