Download Comparative genomics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
IB404 - 6. Other Fungi – February 6
Comparative genomics. Comparing the DNA sequences from several
species makes it possible to eliminate spurious gene predictions,
sometimes find new ones, and find regulatory regions — short sequences
that turn genes on and off. Red boxes highlight areas of sequence
similarity between at least two species. Functional sequences — genes
and regulatory elements — tend to be conserved across all species. The
figure shows how one true regulatory element and gene might emerge
from a comparison of four yeast species.
Other Saccharomyces species
Lander
Waterston
Eric Lander’s group at MIT Whitehead Institute sequenced 3 species,
as did Bob Waterston’s group at WashU, plus four more.
Below is a schematic showing the well-conserved order and orientation
of most genes, in this case for a segment containing about 35 genes. The
red ones are shared, but the blue ones are different.
Some observations from these comparisons across four species:
1. ~500 predicted genes encoding >100 amino acids appear not to be
real.
2. Confirm most predicted introns, and find 60 more, for 300 total.
3. Intergenic nucleotides change twice as frequently as genic nucleotides.
4. 14% of intergenic sites show indels, but only 1% of genic (and no
frameshifts in the genic regions, of course).
5. Proteins range in conservation from 100% for MATa2 mating type to
just 13% for YBR184W (involved in gamete formation). This gene also
has elevated Ka/Ks ratio of 0.7 compared to ±0.1 for most genes.
Ks is the frequency of synonymous or silent changes in nucleotides.
Ka is the frequency of non-synonymous or replacement changes.
So Ks is mostly 3rd codon positions, and Ka is mostly 1st and 2nd.
If ratio of Ka/Ks = 1, then no selection or positive selection.
Usually Ka/Ks << 1, indicating stabilizing or negative selection.
Universal genetic code
1. Single codons for M & W,
so any change changes aa.
2. Pairs for 7 amino acids, so
even 3rd position changes
can change aa.
3. Only R(arg) and L(leu)
can change first position
without changing aa.
4. Serine is really strange,
indeed can only easily
change from AGY to TCN
via Threonine (ACN).
Conservation in the GAL1–GAL10 intergenic region (next slide).
Multiple alignment of the four species shows good overlap between
functional nucleotides and stretches of conservation. Asterisks denote
conserved positions in the multiple alignment. Blue arrows denote the
start and transcriptional orientation of the flanking ORFs. Experimentally
validated transcription-factor-binding footprints are boxed and labeled
according to the bound TF. Stretches of conserved nucleotides are
underlined. Note the TATA boxes where transcription starts for each
ORF. Nucleotides matching the published Gal4 motif are shown in red.
Note that there are four Gal4 binding sites, which is common for
promoters where cooperative binding of the TF is required for activation
of the promoter. Presumably this regulatory region controls both ORFs.
Scer, S. cerevisiae; Spar, S. paradoxus; Smik, S. mikatae; Sbay, S.
bayanus.
Waterston’s analysis extended this to seven species - then can see more
divergent evolution of the promoter region in more divergent species.
Gal4
Mig
Phylogenetic footprinting
Both groups undertook extensive analyses to identify additional copies
of already known regulatory motifs (enhancers and silencers), as well as
attempted to identify new ones.
1. They could identify sequence motifs upstream of genes of similar
function, hence implied in common regulation, e.g. ACTCTTTT for
amino acid metabolism, or GTACGGAT for ribosome biogenesis, or
TTGCAA for peroxisome function.
2. They identified motifs of unknown function upstream of genes with
coherent expression patterns from microarray studies, e.g. TGTTCT for
expression in mitochondria, or CAAACAAA, AAGTA and TTTCTAGA
for stress-induced genes, or TAGAAA and TTCTTTC for genes in the
cell cycle.
Note from this that enhancers are relatively short 6-12 bp regions, which
therefore can probably arise de novo, and as easily be lost.
Genome structure
Comparison of 11 yeast genomes allows reconstruction of the ancestor
of them all, roughly 100 MYA, with 4,700 genes. To get to S. cerevisiae
from this ancestor involved the whole genome duplication, loss of
roughly 3,400 genes, plus 73 inversions and 66 reciprocal translocations.
Schizosaccharomyces pombe - fission yeast
1. Large international consortium, but most
famous is Paul Nurse, who won the 2001 Nobel
for discovery of cdc2 encoding cyclin-dependent
kinase involved in cell-cycle control.
2. 14 Mbp genome encoding 5000 proteins.
3. 43% of genes have introns, total 4,700 introns.
4. Centromeres are
relatively long, 35-110 kb,
and consist of repeats
ranging up to 1.8 kb long.
5. S. cerevisiae centromeres
are only 150-180 bp long
with a 120 bp core region!
Large-scale comparisons
1. About 2/3 of these two yeasts
genes encode proteins with clear
matches in the nematode worm, C.
elegans. So these are eukaryoticspecific genes.
2. In each case about 150 genes are
shared with Ce, but not the other
yeast, so these are ancient genes,
yet lost in one or other yeast.
3. About 800 genes or 15% are
shared between the two yeasts
only, that is, are yeast-specific.
4. A similar number are unique to
each yeast, more for Sc due to
genome duplication in Sc.
Neurospora crassa - bread mold
1. Beadle and Tatum’s “one-gene-one-enzyme” hypothesis.
2. Also used for circadian rhythm, genome defense systems, and DNA
methylation studies, e.g. led to recognition of importance in cancer.
3. 40 Mbp genome done by WGS at MIT-Whitehead.
4. ±10,000 predicted genes, so proteome is twice the size of the yeasts.
5. Roughly 1 gene per 4 kb, compared with 1 per 2 kb in yeasts, and 2
introns per gene on average. So genome is rather more complicated.
6. ±4,000 of these proteins had no matches in databases at the time.
7. Repeat-induced point mutation
(RIP) is a novel genome-defense
mechanism whereby one copy of any
repeated sequence is methylated at
CpG dinucleotides and then is subject
to elevated rates of point mutations,
eventually becoming a pseudogene.
Defends against transposons, but also
limits the sizes of gene families.
Encephalitozoon cuniculi - microsporidian
1. Microspordia are obligate intracellular parasites of animals.
2. E. cuniculi infects mammals, including immune-compromised Hs.
3. Microsporidia don’t have mitochondria, so were once thought to be a
basal lineage of protists near the base of the Eukarya, but have some
mitochondrially-derived genes, so perhaps have a mitosome - below.
4. Genome is 3 Mbp with ±2000 genes.
5. Proteins 10-30% shorter than yeast.
6. Lost gene pathways, e.g. Krebs cycle.
Phytophthora infestans - an Oomycete
1. Cause of potato blight and the Irish potato famine.
2. Not really a fungus, but similar in biology.
3. 240 Mbp genome is relatively large, due to an explosion of repetitive
elements, making up 75% of the sequence.
4. Twice the size of two other Phytophthora, causing soybean root rot
and sudden oak death, due to these repeats.
5. Roughly 18,000 genes predicted, and comparisons allowed
recognition of rapidly evolving “disease effector protein” genes.
6. These mediate pathogenicity, and evasion of resistant potatoes.
7. They are buried in repeats, mediating some of their rapid evolution?