Download as PANGENOME - Department of Human Molecular Genetics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

The Selfish Gene wikipedia , lookup

Exome sequencing wikipedia , lookup

Helitron (biology) wikipedia , lookup

Transcript
Chromosome
- the most informative
molecule of a cell
- and the most variable?
Genome organisation and
evolution
PAN-GENOME concept: bacteria and
human models
Jan Sadowski
Department of Biotechnology
Institute of Molecular Biology and Biotechnology
Faculty of Biology, Adam Mickiewicz University
Genome projects of massively parallel sequencing
(re-sequencing)
• The 1000 Genomes Project
(www.1000genomes.org)
• The 1001 Genomes Project – 1001 lines of Arabidopsis
thaliana (http://1001genomes.org)
• The Drosophila 1000 genomes – 1000 individuals/lines
from Africa and EuroAsia
• Genome 10K – 10 000 species of Vertebraceae
(www.genome10k.org)
• Numerous projects for prokaryotic organisms
GENOME definition
Tettelin et al. (2005) Proc Natl Acad Sci USA 102:13950-13955
(pan = whole)
„the full complement of genes within a bacterial
species”
Wide definition of GENOME
(as PANGENOME)
„complement of genes at selected taxonomic level”
species-based PANGENOME
genus-based PANGENOME
family-based PANGENOME
total/universal PANGENOME
lineage-specific gene set (=unique)
Beginning of „pangenome” studies
Comparative analysis of 8 bacterial strains of Streptococcus agalactiae /Tettelin
et al. (2005)
Pangenome is composed of:
„a core genome”
containing genes present in all strains
„a dispensable genome”
composed of genes absent from one or more strains
and genes that represent part of genome called
„unique genome”
specific to an individual strain
„Surprisingly, unique genes were still detected after eight genomes were sequenced, and
mathematical extrapolation predicts that new genes will still be found after sequencing
many more strains. Thus, the genomes of multiple, independent isolates are required to
understand the global complexity of bacterial species.
Wide pangenome of bacteria
Gene complement analysis
Lapierre and Gogarten (2009) Trends Genet.
Bacterial pangenome for 293 species
Core genes
~250 gene families (translation, replication and energy
homeostasis)
Character genes
~7900 gene families (colonization, servival in special
environmental nisches / symbiosis, photosynthesis)
Accessory genes
a gene set of indefinite number and unknown functions
(„serves” in distinquishing of lines and serotypes)
Mechanisms and evolutionary constraints
Three categories of genes that compose each genome have been identified: the extended
core, the character genes and the accessory pool of genes.
Genes in the extended core are evolving under different constraints and rules under
high selective pressure and only minute changes at the sequence level are allowed.
Although many instances of gene transfers have been documented, they mainly spread
in populations through vertical inheritance.
Gene duplication and domain shuffling are the preferred mode of evolution of
the character genes. This set of genes enables organisms to quickly adapt to changing
conditions and to exploit new niches. Of the three sets of genes, the character genes are
the most likely to be transferred between organisms.
The last category of genes, accessory genes pool, consists of genes with low levels of
conservation, which are scattered at low frequencies throughout the bacterial domain.
This accessory pool of genes might represent in part genes that had previous functions
in genomes but that are now stripped of selective pressure (now pseudogenes). These
fast evolving genes, perhaps residing in phage genomes most of the time, explore
sequence spaces and, occasionally, a new useful protein fold might arise from this pool
and spread through populations.
•
Species
Tested Strains
Relevant features
Helicobacter pylori
15
56% of strain-specific genes are “ORFans”
Escherichia coli O157:H7
31
Bartonella henselae
11
Streptococcus mutans
9
Campylobacter jejuni
11
Salmonella enterica
25
Even within a single serotype, 1751 ORFs
were variable
Genomic islands mediate genomic
rearrangements
Accessory genome = 20%; half shows
signs of HGT
Largest fraction of acces. genes (19%)
related to cell envelope
Core genome was only 54%
Bacillus anthracis
19
Vibrio cholerae
9
S. agalactidae
19
E. coli - Shigella
22
S. pneumoniae
20
Francisella tularensis
27
Variation in strains ranges 8-34%
of reference genome
Core genome was 97%
Extensive variation recently confirmed by
sequencing
E. coli backbone estimated at 2,800 ORFs
Variability within strains < 2.1%. Overall
variability < 10%
Regions specific to highly virulent
strains were identified
Bacterial genomes – undefined and/or unlimited?
Plots of non-asymptotic curves was obtained
Mira et al. (2010) Int. Microbiol. vol. 13
Gene complement-based PANGENOME
in Eukaryota
HUMANS
PANGENOME of humans
Population-specific or individual-specific DNA sequences (genes)
contributing to human genetic variation, that is, the nonredundant
collection of all human DNA sequences (genes) presented in the
entire human population
Objects of the study
The Asian and African complete individual genome sequences were
assembled de novo and compared to the NCBI reference human
genome. Findings showed that human genomes contain a large
amount of novel sequence that is both
population- and individual-specific
Additional analyses allowed to investigate the amount of
sequence variation that is expected to exist between any two
individuals as well as obtain information about the presence of
potentially functional genetic elements within these novel sequences.
Genome re-sequencing project
Humans
Li et al. (2010) Nature Biotechnology 28:57-62
Characterization of novel sequences
General
Length of individual-specific sequences between random pair of of human
individuals would range between 1.8 Mb and 4 Mb and with the inclusion of the
composition differences from SNP it would be in range of 4.2 Mb to 8.0 Mb.
Estimating the size of the pan-genome (calculating for 6 billion people) we should
include an additional 19-40 Mb of novel sequences over the reference genome
Number and functions of genes identified in novel sequences
Asian (YH) – 72 novel genes
African (NA18507) – 69 novel genes
30% - members of highly variable gene families (mucins 2, major
histocompatibility complex HLA
50-60% - unknown functions
Over 100 human genomes will saturate their pan-genome
Wu et al. (2014) Human Genetics
Concluding remarks
Full genomic sequences let us appreciate many forces acting
on genome evolution.
Earlier generated view of genomes as very stable sequence
storing structures gave way to a dynamic view where genomes gain
and lose genes along the way.
This constant invasion of exogenious genetic material on
genomes - from a cloud of frequently transferred genes - enhances
the chance of survival of species by introducing variability in the
population.