Chapter 10
Comparative Genomics
Insights gained through comparison of
genomes from different species
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Conservation and function
Sequence similarity searches
Gene finding
Regulatory sequence identification
Interaction mapping
Genes and evolution
 Human Genome Project decided to use smaller
genomes as warm-up for human genome
 Resulted in sequencing:
 Many bacteria
 Model organism genomes
 Yeast, C. elegans, Arabidopsis, Drosophila
Comparison of these genome sequences
provided basis for field of “Comparative
Early comparative genomics
 Comparative genomics prior to obtaining full
genome sequence:
 Genome size
 Compared DNA content among species
 Single copy and repetitive DNA
 Used hybridization kinetics
 Found amount of repetitive DNA differed
greatly among species
 Synteny: genes that are in the same relative
position on two different chromosomes
 Genetic and physical maps compared between
 Or between chromosomes of the same species
 Closely related species generally have similar
order of genes on chromosomes
 Synteny can be used to identify genes in one
species based on map-position in another
Synteny of Grass genomes
 Synteny among crop
genomes: rice, maize
and wheat
 Rice is smallest genome
in center
 Wheat largest - outer
 Genes found in similar
places on chromosomes
are indicated
Synteny of sequenced genomes
 When sequence from mouse and human genomes
 Find regions of remarkable synteny
 Genes are in almost identical order for long stretches
along the chromosome
Chr 14
Chr 14
Mouse/human synteny
Comparing sequenced genomes
 Comparison of genomic sequences from
different species can help identify:
Gene structure
Gene function
Regulatory sequences
Interactions between gene products
Evolution and sequence
 Genome comparisons based on observation:
conservation = function
 If no constraints on DNA sequence
 Random mutations will occur
 Over tens of millions of years these random
mutations will make two related sequences
Function and sequence
 However: if there are constraints:
 e.g. DNA codes for protein
 Or transcription factor binds DNA
 Then there will be sequence similarity when
related sequences compared
 Basic rule when comparing two related
 Sequence conservation = functional
Orthologs and Paralogs
 When comparing sequence from different
 Must distinguish between two types of closely
related sequences:
 Orthologs are genes found in two species that
had a common ancestor
 Paralogs are genes found in the same species
that were created through gene duplication
Orthologues and Paralogues
Sequence similarity and gene
 Sequence comparisons that implicate function
are widely used:
 To determine if newly sequenced cDNA or
genomic region encodes gene of known
 Search for similar sequence in other species
(or in same species)
Homology searches
 Search databases of DNA sequences
 Use computer algorithms to align sequences
 Don’t require perfect matches between
 Allow for insertions, deletions and base changes
 Most commonly used algorithms:
Homology search example
 The seasquirt, Ciona intestinalis makes a coat
primarily of cellulose
 A BLAST search was performed on the Ciona
genome using an Arabidopsis endoglucanase
gene involved in cellulose synthesis
 Extensive homology was found with a Ciona
gene flanked by genes found in Drosophila and
 It is postulated that the Ciona endoglucanase
gene may have arisen by lateral gene transfer
Discovery of endoglucanase gene
in Seasquirt genome
Arabidopsis Korrigan
Splicing factor
C. intestinalis cDNA
C. elegans and Drosophila
Homology search for the mouse
 Homology search of all
genes in the mouse
genome :
 27% in other metazoans
 29% in other eukaryotes
 6% in other chordates
 14 % in other mammals
 Less than 1% rodent
Problems of Genome annotation
 Identifying genes and regulatory regions in
sequenced genomes is challenging
 Open reading frames (ORFs) are usually good
indication of genes
 Problem is: difficult to determine which ORFs
belong to a gene
 Many mammalian genes have small exons and
large introns
 Regulatory sequences even more difficult
Computational approaches to
gene identification
 Computer programs analyze genomic sequence
 GRAIL, GeneFinder
 Look for ORFs, splice sites, poly A addition
sites etc.
 Predict gene structure
 Frequently wrong
 Usually miss exons at beginning or end of gene
 Or predict exon when doesn’t really exist
How genome comparisons help
 When comparing genomes of different species
 Genes normally have same exon/intron
 Look for conserved ORFs in both genomes
 Frequently permits accurate identification of
 Fugu/human comparison found >1000 genes
 Mouse/human comparison indicates only
30,000 genes in genome
Sequence comparison example
 Comparison of the human and mouse spermidine
synthase genes
 Revealed an additional intron in the human gene that is
not found in the mouse homologue
5,500 bp
Identifying small RNAs
 Growing evidence that
small RNAs can
regulate gene expression
 Small RNAs are 20-25
 Conservation between
genomes suggests
 Example:Small RNAs
conserved in
Arabidopsis and rice
Regulatory sequence
 A large portion of the genome contains
regulatory information
 Regulatory sequence includes:
 Cis-regulatory elements: tell genes when and
where to turn on
 Basal transcription machinery binding sites
 Enhancers
 Can be 5’ of gene, 3’ of gene or in intron
Regulatory sequences
Finding regulatory sequences
 Regulatory sequences are difficult to identify
using computer programs
 Problem is: most enhancer sequences have yet
to be identified
 They are usually short: 6-10 basepairs
 Those that are known are usually degenerate
 They can differ in one or more basepairs
 Still bind the cognate transcription factor
Comparisons to identify
regulatory elements
 Comparisons of genomes of different species
can identify regulatory elements
 Change in intergenic regions and introns
usually more rapid than in coding regions
 Nevertheless, regulatory elements tend to be
 Conserved regions called “phylogenetic
Phylogenetic footprint
 To identify conserved regulatory regions
usually requires comparing genomes of closely
related species
 If too distantly related, very difficult to find
 Nevertheless, mouse/human sequence
comparison has revealed many conserved cisregulatory elements
Mouse/human comparison
Using multiple species for
Phylogenetic footprinting
 The location of regulatory sequences can also
be found comparing several related sequences
 Multiple alignments performed
 Better able to home in on important regions
 Conservation alone not enough, need to
validate importance of elements
Interaction mapping
 Protein-protein interactions include:
 The transfer of information in a genetic
 Scaffolding to tether other proteins
 Enzymatic reactions
 Large molecular machines such as motors
Rosetta Stone
 Observation: in some species, interaction
proteins encoded by single gene
 In other species same proteins encoded in two
 Systematic search through sequenced genomes
for these relationships should identify proteins
that interact
 Called “Rosetta Stone” approach
Rosetta Stone example
 Equivalent of yeast
protein topoisomerase II
 In E. coli two proteins:
gyrase A and gyrase B
 Suggests gyrase B and
gyrase A interact
topoisomerase II
E. coli
gyrase B
gyrase A
Rosetta stone
Escherichia coli
Haemophilus influenzae
Methanococcus jannaschii
Higher level comparisons
 Comparisons between genomes not just to
better identify genes and regulatory sequences
 Evolution of adaptive traits occurs through:
 Evolution of new genes
 Changing when and where genes express
 Thus comparisons of genes found in genome
can provide information about mechanisms of
Genes and genomes
 Comparison of total gene numbers in
sequenced genomes:
 Smaller than originally expected
 Ex: Human genome thought to have 100,000
 Now think closer to 30-35,000 genes
 Suggests that many new functions arise in gene
 Use old genes in new ways
Selective expansion of genes
 Although comparisons show not as much
difference in numbers of genes as expected
 Still see striking differences in numbers of
some gene families
 Example:
 Roundworm C. elegans has a large number of
nuclear receptor genes
 Drosophila has large number of zinc-finger
transcription factors
 Plants have no G-protein coupled receptors
What is difference between man
and ape?
 Man and chimpanzee
have a genome wide
similarity of greater than
 What accounts for
differences in species?.
 Recent study suggests
due to specific gene
expression differences.
 Striking differences
found only in brain
Human/ape gene expression
 Methods being developed to identify genes
involved in adaptive traits
 Example: “Trait-to-gene”
 Underlying reasoning:
 Organisms that have a particular trait either
share related genes
 Or have developed new genes to perform same
Relating traits to genes
Species 1
Species 2
Trait A
Trait A
Species 3
Trait A
Species 4
Species 5
Trait A
 Comparisons made of bacterial genomes
 Need many genomes
 Looked for genes involved in flagellar function
 Identified 43 of 45 known genes
 Found 5 additional genes that program said
should be involved in flagella function
 Knocked out 3 and found that 2 resulted in
bacteria with defective flagella
B. subtilis 168
B. subtilis 168
Overnight growth at 37°C. Swim medium (LB + 0.25% agar).
Similar results at 20°C (4 days) and 30°C (2 days).
The goal of comparative genomics
 Synteny = similar relative positions of genes
on chromosomes
 Conservation = function
 Homology searches
 Gene structure prediction
 Regulatory sequence identification
 Interaction mapping
 Genes and evolution
