Download Genome Biology and

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Genome Biology and
Biotechnology
2. The genome structures of invertebrates
Prof. M. Zabeau
Department of Plant Systems Biology
Flanders Interuniversity Institute for Biotechnology (VIB)
University of Gent
International course 2005
Sequenced genomes of invertebrates
¤ Nematodes
– Caenorhabditis elegans (1998)
– Caenorhabditis briggsae (2003)
¤ Insects
–
–
–
–
Drosophila melanogaster – fruit fly (2000)
Drosophila pseudoobscura – fruit fly (2005)
Anopheles gambiae - mosquito (2002)
Bombyx mori - silkworm (2004)
¤ Tunicates: ancestral vertebrate genome
– Ciona intestinalis (2002)
Phylogeny of the invertebrates
~800 MY
>1000 MY
550 MY
Genome Sequence of the Nematode C. elegans
The C. elegans Sequencing Consortium, Science, 282, 2012 (1998)
¤ Paper presents
– The first complete genome sequence of a multicellular organism
• The initial sequence covered 97-Mbp (6 gaps)
• The complete sequence (June 2003) comprises 100,2Mbp without
gaps
Protein coding Genes
¤ First large-scale genome sequence annotation
– The gene structure predictions based on EST and protein
similarities
• Only 40% of the predicted genes had a confirming EST match
¤ The first annotation predicted 19,099 genes
–
–
–
–
An average density of 1 predicted gene per 5 kb
27% of the genome resides in predicted exons
Each gene has an average of five introns
WormBase: updated and manually curated gene set
• Currently contains 18,808 genes
Reprinted from: The C. elegans Sequencing Consortium, Science, 282, 2012 (1998)
RNA genes and repetitive sequences
¤ RNA genes
–
–
–
–
rRNA genes: occur in long tandem arrays
tRNA genes: 659 tRNA genes occur widely dispersed
Noncoding RNA genes: in dispersed multigene families
Micro RNA genes (miRNA)
• ~100 identified to date
¤ Repetitive Sequences
– Dispersed repeat sequences
• Most of them are associated with transposons of C. Elegans which are
probably no longer active in the genome
– Local repeat sequences
• Tandem, inverted, or simple sequence repeats
Reprinted from: The C. elegans Sequencing Consortium, Science, 282, 2012 (1998)
Chromosome Structure and Organization
¤ The genome structure is remarkably uniform
– Gene density is fairly constant across the chromosomes
– No localized centromeres
• Like in yeast, but in contrast to all other eukaryotes
¤ Differences between the central portion and the arms
of the chromosomes
–
–
–
–
The conserved eukaryotic genes are in the central portion
Repetitive DNA is more prevalent in the arms
Meiotic recombination is much higher on the chromosome arms
suggest that DNA in the arms might be evolving more rapidly
than in the central regions
Reprinted from: The C. elegans Sequencing Consortium, Science, 282, 2012 (1998)
Distribution of sequence elements on Chromosome I
arm
Central part
arm
TTAGGC repeats
Tandem repeats
Inverted repeats
Yeast similarities
EST matches
Predicted genes
Reprinted from: The C. elegans Sequencing Consortium, Science, 282, 2012 (1998)
Conclusions
¤ The complete sequence of the C. elegans genome has
– provided a basis for the discovery of all the genes of a
multicellular eukaryotic organism
• First inventory of eukaryotic genes
¤ C. elegans is a very effective model organism for
– eukaryotic gene analysis: widely used for functional genomics
– human disease gene research
– nematode pest control research
Reprinted from: The C. elegans Sequencing Consortium, Science, 282, 2012 (1998)
The Genome Sequence of Caenorhabditis briggsae:
A Platform for Comparative Genomics
Stein et. al., PLoS Biol 1: 166-192 (2003)
¤ Paper presents
– high-quality draft (> 10-fold coverage) sequence of C. briggsae
– Comparative genome analysis of C. briggsae and C. elegans
• The two species diverged ~ 100 million years ago
• morphologically indistinguishable
• same chromosome number (5) and genome size (104 and 100Mb)
– Comparisons of the genomes of related species allows
• More precise annotation of protein-coding genes
• Discovery of noncoding genes, regulatory sequences and “unknown”
functional elements
Colinearity of the C. briggsae and C. elegans
Genomes
¤ Alignment of sequences
– ~80% Collinearity
• inversions and
translocations
– blocks of synteny
• orthologous genes
Reprinted from: Stein et. al., PLoS
Biol 1: 166-192 (2003)
Annotation of Protein-Coding Genes
¤ Concordance of gene predictions refines gene models
– C. elegans gene annotation improvement
•
•
•
•
>6,000 (30%) genes exon addition, deletion or alterations
1,300 new genes
18,808 protein-coding genes C. elegans
19,507 protein-coding genes C. briggsae
Most concordant
Reprinted from: Stein et. al., PLoS Biol 1: 166-192 (2003)
Comparison of Protein-Coding Genes
¤ ~65% are orthologs in C. briggsae /C. elegans
– gene pairs with a one-to-one correspondence in the two species
• have a common ancestor
• have similar gene and coding sequence lengths
• show ~80% percent identity at the protein level
¤ ~25% are paralogs in C. briggsae /C. elegans
– proteins with multiple BLASTP matches in the other species
• Evolving gene families
¤ ~5% are orphans in C. briggsae /C. elegans
– proteins that have no BLASTP matches in the other species
• 807 in C. elegans and 1061 in C. briggsae genes
• Novel genes or pseudogenes?
Reprinted from: Stein et. al., PLoS Biol 1: 166-192 (2003)
Conservation of Operon Structure
¤ C. elegans is unusual among animals in having operons
– co-transcribed genes that make a polycistronic pre-mRNA
• subsequently separated into single-gene mRNAs by trans-splicing
– ~15% of C. elegans genes are encoded in ~1000 operons
• contain 2–8 genes
– 96% of the operons are preserved intact in C. briggsae genome
¤ C. elegans operons comprise
– co-regulated genes encoding proteins with related functions
– specific functional classes of genes
•
•
•
•
Transcription
RNA splicing
translation
RNA degradation
Reprinted from: Stein et. al., PLoS Biol 1: 166-192 (2003)
Repetitive sequences
¤ The different genome sizes result from
– Differences in repeat content
• 23.3 Mbp of the C. briggsae genome (104 Mbp)
• 16.5 Mbp of the C. elegans genome (100.3 Mbp)
¤ Repeated DNA families
– comprise DNA transposons or tandem arrays
– Not orthologous between the two genomes
• suggests that most repeat elements in the two genomes postdate
the divergence of the two species
– Accumulation of new repetitive elements is balanced by
deletions so that
• genome sizes remain similar
Reprinted from: Stein et. al., PLoS Biol 1: 166-192 (2003)
Chromosome Structure and Organization
¤ The centers contain orthologous (1) and essential genes (2)
– Very long synteny blocks
¤ The arms contain orphan genes (3) and repetitive elements (4)
– Short synteny blocks
– The arms of the chromosomes are evolving more rapidly than the
centers
1
2
3
4
Reprinted from: Stein et. al., PLoS Biol 1: 166-192 (2003)
Conclusions
¤ C. briggsae/C. elegans comparison shows that
– despite large differences at the genomic level, C. briggsae and C.
elegans are morphologically almost indistinguishable
– Many protein families are very dynamic
• ~200 families have expanded or contracted by > 2-fold
• several hundred families are either novel or have diverged
extensively
– share only ~ 50% of the non-coding sequence
¤ Sequencing of additional species is necessary to
– identify candidate cis-regulatory elements based on sequence
conservation
• the noise level in a two-way comparison is too high
Reprinted from: Stein et. al., PLoS Biol 1: 166-192 (2003)
The Genome Sequence of Drosophila
melanogaster
¤ Draft sequence – (2000)
– Whole-genome shotgun sequencing
• Sequence contained 128 physical gaps and 1630 sequence gaps
– Some regions were of poor sequence quality
– Demonstrated that whole-genome shotgun sequencing can be
used for large eukaryotic genomes
• Adams et. al., Science, 287, 2185 (2000)
¤ Finished sequence – (2002)
– BAC clone sequencing and gap filling
– Sequence contains 7 physical gaps and 37 sequence gaps
– Very accurate sequence: error rate of < 1/100.000
• Celniker et al., Genome Biol. ; 3: research 0079.1–0079.14 (2002)
The Drosophila Genome
¤ The (female) Drosophila genome is ~176 Mb in size
– Euchromatic part: 117 Mb completely sequenced
– heterochromatic part: partly (~20Mb) sequenced (unassembled)
• Female: estimated at ~59 Mb
• Male: the 40Mb Y chromosome is completely heterochromatic
Reprinted from: Adams et. al., Science, 287, 2185 (2000)
Euchromatin and Heterochromatin
¤ Euchromatin
– Gene rich portion of the genome
– Condenses during mitosis and de-condenses there after
– Portion of the genome that can be cloned stably in BACs
¤ Heterochromatin
– Consists mainly of simple sequence repeats (sattelite DNAs),
transposable elements, and tandem arrays of rRNA genes
– Remains condensed after mitosis
– Gene poor portion of the genome
– Contains elements required for centromere function
¤ Euchromatin - heterochromatin transition
– is gradual at the molecular level
Reprinted from: Adams et. al., Science, 287, 2185 (2000)
Euchromatic
Genome
Sequence
Transposons
centromere
Reprinted from:
Celniker et al., Genome
Biol. ; 3: research
0079.1–0079.14 (2002)
Gene Content of the Drosophila Genome
¤ Annotation of the draft genome sequence
– Predicted 13,601 genes
• >10,000 genes (>75%) supported by EST and protein matches
• This annotation was incomplete
– Large number of sequence gaps and sequencing errors
¤ Annotation of the finished genome sequence
– Predicted same number of genes: 13,676
• Majority (85%) of the gene models revised
– Improved: a collection of 250.000 ESTs and full length cDNAs
– Found only 17 pseudogenes ( much less than in C. elegans )
– Heterochromatic part may contain ~500 genes
• The 20Mb sequenced contains ~300 protein coding genes
– Reannotation reveals many complex gene models
• genes that do not fit the simple 5’UTR – exons – 3’UTR
Reprinted from: Adams et. al., Science, 287, 2185 (2000)
Complex Gene models
¤ Alternatively splicing or alternative polyadenylation
– At least ~20% of genes have >1 predicted transcript
• 65% encode two or more protein products
• 35% differ in the UTRs - most have different 5’UTRs: alternative
promoters
Reprinted from: Misra et. al., Genome Biology, 3: research 0083.1-0083.22 (2002)
Complex Gene models
¤ Dicistronic genes: 2 non-overlapping coding regions on
one mRNA
– 31 dicistronic gene pairs found represent an underestimate
Reprinted from: Misra et. al., Genome Biology, 3: research 0083.1-0083.22 (2002)
Complex Gene models
¤ Overlapping genes
– overlap of mRNAs on opposite strands: 15% of the genes
¤ Nested genes
– genes included within introns of other genes: 15% of the genes
Reprinted from: Misra et. al., Genome Biology, 3: research 0083.1-0083.22 (2002)
Conclusions
¤ The Drosophila genome sequence reveals
– genes and proteins common to all multicellular organisms
• proteins involved in transcription control and metabolism are very
similar to their human counterparts
¤ Drosophila provides an experimental platform for
– the study of of human disease genes involved in
• DNA replication and repair
• Metabolism of drugs and toxins.
Reprinted from: Adams et. al., Science, 287, 2185 (2000)
Comparative genome sequencing of Drosophila
pseudoobscura: Chromosomal, gene, and ciselement evolution
Richards et. al., Genome Res. 15: 1-18 (2005)
¤ Paper presents
– High quality draft genome sequence of a second Drosophila
species Drosophila pseudoobscura
– Comparison with the genome sequence of D. melanogaster
• Evolutionary distance is well suited to study
– Conserved and diverged genes
– Conserved regulatory elements
– Mechanisms of genome rearrangement
The D. pseudoobscura genome
¤ The euchromatic part is estimated at 131 Mb
– ~17% larger than that of D. melanogaster
– the additional sequence is
• primarily found in the intergenic regions
• only partly caused by expansion of repeated DNA
¤ The two species show a very high gene synteny
– Synteny blocks were identified
• on the basis of conservation of protein order
• ~10.500/14.000 genes are true orthologs
– All synteny blocks are short and extremely mixed
• extensive genome rearrangement in the two Drosophila
lineages
Reprinted from: Richards et. al., Genome Res. 15: 1-18 (2005)
The synteny between D. pseudoobscura and D.
melanogaster
¤ The great majority of syntenic blocks are found
– on the same chromosome arms in the two species
– Chromosomal rearrangements in the two species
• Almost exclusively paracentric inversions
Reprinted from: Richards et. al., Genome Res. 15: 1-18 (2005)
Intraspecific inversion breakpoints
¤ Repetitive sequences at the inversion breakpoints
– Frequently comprise a breakpoint motif
– Only found in D. pseudoobscura
breakpoint motifs
Reprinted from: Richards et. al., Genome Res. 15: 1-18 (2005)
Conservation of gene segments
¤ Sequence conservation in noncoding regions
– Is insufficient for the identification of regulatory sequences
– Multiple genome sequence alignments will be needed
Reprinted from: Richards et. al., Genome Res. 15: 1-18 (2005)
The Genome Sequence of the Malaria Mosquito
Anopheles gambiae
Sequence: Holt et. al., Science. 298: 129-149 (2002)
Comparison: Zdobnov et. al., Science, 298, 149 (2002)
¤ The papers present
– Draft genome sequence of the PEST strain of A. gambiae
– A comparison of the genomes and proteomes of Anopheles and
Drosophila
• Two very different diptera that diverged ~250MY ago
The Mosquito Genome Sequence
¤ The draft genome spans 278 Mb
– Covers the entire genome including the heterochromatic DNA
– Mosquito have larger genomes than Drosophila
• estimates from 250 to 500 Mb
• Transposable elements constitute ~16% of the genome
– Drosophila experienced a recent genome size reduction
¤ The predicted number of genes is ~14.000
– Very similar to Drosophila
¤ The comparison of the Anopheles and Drosophila
genomes and proteomes reveals
– considerable similarities and numerous differences
– Reflects selection and adaptation to different ecologies and life
strategies
Reprinted from: Holt et. al., Science. 298: 129-149 (2002)
Similarity at the protein level
¤ Identified 4 proteins classes
– True orthologs: ~45% (~6.000)
• Exhibit 1:1 relationship
• Genes with conserved function
– Paralogs: ~12%
• Duplicated genes
– Homologs: ~~25%
• Unclear relationship
– Orphans: 11% to 18%
• New genes
• Rapidly evolving genes
Reprinted from: Zdobnov et. al., Science, 298,
149 (2002)
The core of conserved proteins
¤ Dynamics of Gene Structure in a span of 250MY
– Exon lengths and intron frequencies are similar
– introns in Drosophila have half the length of Anopheles
• systematic reduction of noncoding regions in Drosophila
– Only 50% of the introns are perfectly conserved
• one intron gain or loss per gene per 125 My
– Intron sequences diverge rapidly
• sequence similarity in <2% of the equivalent introns
Reprinted from: Zdobnov et. al., Science, 298, 149 (2002)
Family expansions and reductions
¤ Increases and decreases in
protein families
– Related to adaptations to life
strategies and environment
¤ Expansions or reductions are
– Uneven: a single gene in one species
has many paralogs in the other
– More frequent in Anopheles
– Examples:
• Cuticular proteins
• Innate immunity genes
– FBN-like (fibrinogen) proteins
massively expanded in
Anopheles
Reprinted from: Zdobnov et. al., Science, 298, 149 (2002)
Genome Rearrangements
¤ Microsynteny
– 34% of the orthologs map to
~1000 microsynteny blocks
• 2-3 genes per block (cfr. fishhuman)
¤ Macrosynteny
– Both species have 5 five major
chromosomal arms
– Clear 1:1 homologies between
the chromosomal arms
• Inversions much more
frequent than translocations
Reprinted from: Zdobnov et. al., Science, 298, 149 (2002)
The Draft Genome of Ciona intestinalis:
Insights into chordate and vertebrate origins
Dehal et. al., Science, 298, 2157-2167 (2002)
¤ Paper presents
– Draft genome sequence of Ciona intestinalis, an ancestral chordate
– Chordates appear in the fossil record at the Cambrian explosion
• ~ 550 million years ago
Tunicates
550 MY
Ciona intestinalis
¤ Tessile, hermaphroditic marine invertebrates
¤ Adults are simple filter feeders
– Encased in a fibrous tunic
Adult
Juvenile showing the
internal structures:
•ds, digestive system
•es, endostyle
•ht, heart
•os, neuronal complex;
•pg, pharyngeal gill.
Reprinted from: Dehal et. al., Science, 298, 2157-2167 (2002)
Gene content and global comparisons
¤ Predicted ~ 16.000 gene models
– 75% of the predicted genes are supported by EST evidence
– Genes are compact and densely packed: one gene per 7.5 kb
¤ Global comparisons
– 60% of the genes have a detectable fly or worm homolog
– 20% of the genes have no clear homolog
• tunicate- specific genes
– 17% of the genes have a vertebrate homolog but no detectable
fly or worm homolog
• Many are single-copy genes for the vertebrate gene families
– signalling and regulatory processes in development
– The gene content is a reasonable approximation of the ancestral
chordate
Reprinted from: Dehal et. al., Science, 298, 2157-2167 (2002)
Future Perspectives
¤ Invertebrate genomes are sequenced at a rapid pace
– Worms: 10 species of medical and agricultural importance
• Schistosoma, Ancylostoma, Ascaris, Globodera, Meloidogyne
– Insects: ~20 species of primarily agricultural importance
• Mosquito’s, honey bee, lepidoptera and > 10 Drosophila species
– Protozoa: several species of medical importance
• Trypanosoma, Theileria, Plasmodium, Leishmania,…
– Broad range of species
• Sponge, sea urchin, Daphnia, Hydra, snail, lamprey,…
¤ Source: GOLDTM Genomes OnLine Database
– http://www.genomesonline.org/
Recommended reading
¤ The nematode genome sequence
• The C. elegans Sequencing Consortium, Science, 282, 2012 (1998)
¤ The Drosophila genome sequence
• Adams et. al., Science, 287, 2185 (2000)
Further reading
¤ Nematode genomes
– C. briggsae:
• Stein et. al., PLoS Biol 1: 166-192 (2003)
¤ Insect genomes
– Finished Drosophila genome sequence:
• Celniker et al., Genome Biol. ; 3: research 0079.1–0079.14 (2002)
– Annotation of the Drosophila genome :
• Misra et. al., Genome Biology, 3: research 0083.1-0083.22 (2002)
– Draft Drosophila pseudoobscura genome sequence
• Richards et. al., Genome Res. 15: 1-18 (2005)
– Draft mosquito genome sequence
• Holt et. al., Science. 298: 129-149 (2002)
• Zdobnov et. al., Science, 298, 149 (2002)
¤ Ciona genome
• Dehal et. al., Science, 298, 2157-2167 (2002)
Related documents