Download (a) p 1 - Biology Department | UNC Chapel Hill

Document related concepts

Human genome wikipedia , lookup

Genomics wikipedia , lookup

Epigenetics in learning and memory wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Segmental Duplication on the Human Y Chromosome wikipedia , lookup

Oncogenomics wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

X-inactivation wikipedia , lookup

Saethre–Chotzen syndrome wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

Ridge (biology) wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Pathogenomics wikipedia , lookup

Transposable element wikipedia , lookup

Public health genomics wikipedia , lookup

Genomic library wikipedia , lookup

Minimal genome wikipedia , lookup

Genomic imprinting wikipedia , lookup

Copy-number variation wikipedia , lookup

Genetic engineering wikipedia , lookup

Gene therapy wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Gene wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

History of genetic engineering wikipedia , lookup

NEDD9 wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Mir-92 microRNA precursor family wikipedia , lookup

Gene nomenclature wikipedia , lookup

Gene desert wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Genome (book) wikipedia , lookup

Genome editing wikipedia , lookup

Helitron (biology) wikipedia , lookup

RNA-Seq wikipedia , lookup

Microevolution wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Gene expression programming wikipedia , lookup

Gene expression profiling wikipedia , lookup

Genome evolution wikipedia , lookup

Designer baby wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
Putting gene family evolution
in its chromosomal context
Todd Vision
Department of Biology
University of North Carolina at Chapel Hill
Outline
Gene order rearrangement in plants
• Chromosomal perspective
• Gene family perspective
Gene duplication and functional
divergence
• Segmental duplications as a tool
Chromosomal perspective
Biological importance
• Clustering of gene function
• Clustering of transcriptional activity
Applied importance
• Conservation of gene order (synteny)
Devos and Gale 2000 Plant Cell 12, 637
Arabidopsis as a hub for plant
comparative maps
megabases
genome sizes in angiosperms
907
1000
750
560 622
473
367 367 372 415 439
500
262
250 145
0
is ch er ge ya ce go ot am an to
s
p ea mb an pa ri an rr y be ma
o
o
p cu or pa
m ca
d
a
t
i
m
b
cu
a
li
r
A
Arumuganathan and Earle 1991 Plant Mol Biol Rep 9, 208.
Arabidopsis paleopolyploidy
The Arabidopsis Genome Initiative 2000 Nature 408, 796
Non-overlapping syntenies
4200
chromosome 4 (4.6 Mb)
52
3800
54
45
3400
56
49
3000
2600
1200
1600
2000
2400
chromosome 2 (5.6 Mb)
2800
Blanc et al. 2003 Genome Res. 13, 137.
Blanc and Wolfe 2004 Plant Cell 16, 1667.
Tomato-Arabidopsis synteny
Bancroft 2001 TIG 17, 89 after Ku et al. 2000 PNAS 97, 9121.
Rice-Arabidopsis microsynteny
Mayer et al. 2001 Genome Res. 11, 1167.
Hidden syntenies
Simillion et al. 2002 PNAS 99, 13627.
Interspecies comparison can
reveal hidden syntenies
Vandepoele et al. 2002 TIG 18, 606.
Simillion et al. 2004 Genome Res. 14, 1095
From descriptive to predictive
Can we predict the gene content of
homologous segments when markers
are sparse?
Utility for QTL mapping
• Prioritize candidate genes in a QTL region
from a non-sequenced genome
• Provide markers for fine-mapping
Hidden Markov Models (HMM)
Transition probabilities t1,1
Hidden states
t1,2
1
p1(a)
Emission probabilities
p1(b)
t2,2
t2,end
2
p2(a)
p2(b)
Observed states: a->b->a
Hidden states: 1->1->2->end
Probability:
p1(a) t1,1 p1(b) t1,2 p2(a) t2,end
end
A gene content HMM
 Observed states
• a homologous gene is either observed or not
 Hidden states
• presence or absence of gene within a segment
 Emission probabilities
• A gene will be unobserved if it is not present
• A gene may be unobserved even if it is present
• Dependent on the density of the gene map
 Transition probabilities
• reflect conservation of gene content along the
branches of a phylogeny
Transition probabilities and
the segment phylogeny
1
1-a
Loss (L)
P
a
1-b
Loss-Gain
(LG)
Multiple
Loss-Gain
(MLG)
A
b
a
A1
P
1-b
1-ai
b
A1
1
1-a
P
A2
1
ai
A2
1 speciation
i  
2 duplication
Estimating model parameters
 Segment phylogeny
• Each set of homologous genes is missing from
some segments
• Estiimate an “averaged” distance matrix
• Build tree with neighbor-joining and midpoint
rooting
 HMM parameter estimation
• Loss rate(s)
• Gain rate
• Number of genes present at the root
Do parameter estimates
converge?
LG model
n=100 genes
no missing data
a1 = 0.1, a2 = 0.3
1000 replicates
Initial a
0.05
0.3
ˆ1
a
SE
ˆ2
a
SE
0.106 0.006 0.294 0.018
0.106 0.006
 0.294 0.018
Accuracy of hidden state
assignments
5 segment phylogeny, a= a 1=0.1, a2=0.3, b=0.1, 24% gain
estimated probability
1
0.8
L
LG
MLG
0.6
0.4
0.2
0
0
0.2
0.4
0.6
true probability
0.8
1
A large multiplicon
12 segments from rice and arabidopsi
56 sets of homologous genes
Vandepoele et al 2003 Plant Cell 15, 2192.
Self-validation test
? ? ? ? ?
Probability of gene presence
(8 longest segments)
Segment
True
Estimate
Diff
1
0.251
0.173
+0.078
2
0.225
0.166
+0.059
3
0.262
0.171
+0.091
4
0.149
0.175
-0.026
5
0.268
0.171
+0.097
6
0.233
0.167
+0.066
7
0.226
0.170
+0.056
8
0.148
0.168
-0.020
Branch lengths scaled so that longest branch is 1.0
Estimate of a = 0.7
Summary: gene content HMM
 Multispecies comparative maps
• Becoming more common
• Most species only partially characterized
• Usefulness also compromised by sparse synteny
 Probabilistic models will allow us to move
• from simple descriptions of the extent of synteny
• to predictive tools that can guide further
experiments
Gene family perspective
Modes of
duplication
• Tandem (T)
• Dispersed (D)
• Segmental (S)
T
D
S
A tale of two sisters: the ARF
and the Aux/IAA gene families
Modulate whole plant response to auxin
Interact via dimerization
• ARFs are transcription factors
• Aux/IAAs bind and repress ARFs in the
absence of auxin
Diversification of ARFs
Remington et al 2004 Plant Cell 135, 1738
The chromosomal context
Remington et al 2004 Plant Cell 135, 1738
Diversification of the Aux/IAAs
Remington et al 2004 Plant Cell 135, 1738
Remington et al 2004 Plant Cell 135, 1738
Why the different patterns of
diversification?
12% (ARF) vs 40% (Aux/IAA)
segmental duplications
Presumably reflects differential retention
Possible explanations
• Dosage requirements
• Coevolution with other interacting genes
• Regional transcriptional regulation
How typical is the Aux/IAA
family?
Gene family
Genes S events
Proteasome alpha & beta subunits 23
9
Ser/Thr phosphatase
26
10
Ras related GTP-binding
72
19
Auxin-independent growth
33
8
promoter
Major instrinsic protein
38
10
Calmodulin
79
20
Phosphatidylcholine transferase
30
8
Cation/hydrogen exchanger Cannon et 28
8 Biology 4, 10.
al. 2004 BMC Plant
Segmental duplication of
pathways?
Blanc and Wolfe 2004 Plant Cell 16, 1679.
Summary: gene family
perspective
Chromosomal context can matter
Gene families differ in their patterns of
duplicate gene proliferation
• Presumably due to differential retention
Polyploidy
• Qualitatively differs from other gene
duplication modes
• Divergence of whole pathways possible
Functional divergence and
chromosomal context
Do patterns of divergence (ie
spatiotemporal expression) differ among
T, D, and S duplicates?
Duplicate pairs in yeast and human
(Gu et al. 2002, Makova and Li 2003)
Appx. 50% of pairs diverge very rapidly
Proportion of divergent pairs increases
with synonymous substitions (Ks)
Less so with replacement changes (Ka)
• Plateaus at Ka ~0.3 in human
In humans, distantly related pairs with
conserved expression tend to be either
ubiquitous or very tissue specific
Digital expression profiling
 Massively Parallel Signature Sequencing
(MPSS)
• Count occurrence of 17-20 bp mRNA signatures
• Cloning and sequencing is done on microbeads
• Similar to Serial Analysis of Gene Expression
(SAGE)
 “Bar-code” counting reduces concerns of
• cross-hybridization
• probe affinity
• background hybridization
 Which enables
• Accurate counts of low expression genes
• Distinguishing expression profiles of duplicate
genes
MPSS
technology
Clone 3’ ends of
transcripts to
microbeads
Sort by FACS and
deposit in channeled
monolayer
Sequence 17-20 bp
from 5’ end by
hybridization
Brenner et al. 2000 PNAS 97:1665.
MPSS Data
signature
GATCAATCGGACTTGTC
GATCGTGCATCAGCAGT
GATCCGATACAGCTTTG
GATCTATGGGTATAGTC
GATCCATCGTTTGGTGC
GATCCCAGCAAGATAAC
GATCCTCCGTCTTCACA
GATCACTTCTCTCATTA
GATCTACCAGAACTCGG
.
.
GATCGGACCGATCGACT
Total # of tags:
frequency
2
53
212
349
417
561
672
702
814
.
.
2,935
>1,000,000
Classifying signatures
Duplicated:
expression may
be from other
site in genome
Potential alternative
splicing or nested
gene
Anti-sense transcript
or nested gene?
Potential
alternative
termination
Typical
signatures
Potential
anti-sense
transcript
Potential
un-annotated
ORF
Triangles refer to colors used on our web page:
or
Class 1 - in an exon, same strand as ORF.
Class 2 - within 500 bp after stop codon, same strand as ORF.
or
Class 3 - anti-sense of ORF (like Class 1, but on opposite strand).
or
Class 4 - in genome but NOT class 1, 2, 3, 5 or 6.
or
Class 5 - entirely within intron, same strand.
or
Class 6 - entirely within intron, anti-sense.
or
Grey = potential signature NOT expressed
Class 0 - signatures found in the expression libraries but not the genome.
Core Arabidopsis MPSS libraries
sequenced by Lynx for Blake Meyers, U. of Delaware
Library
Root
Shoot
Flower
Callus
Silique
TOTAL
Signatures
sequenced
3,645,414
2,885,229
1,791,460
1,963,474
2,018,785
12,304,362
Distinct
signatures
48,102
53,396
37,754
40,903
38,503
133,377
http://www.dbi.udel.edu/mpss
Query by
• Sequence
• Arabidopsis gene identifier
• chromosomal position
• BAC clone ID
• MPSS signature
• Library comparison
Site includes
• Library and tissue information
• FAQs and help pages
Genome-wide MPSS profile in Arabidopsis
Chr. I
Chr. II
Chr. III
Chr. IV
Chr. V
Of the 29,084 gene models,
17,849 match unambiguous, expressed class 1 and/or 2 signatures
Dataset of duplicate pairs
 Arabidopsis gene families of size 2 classified
as
• Dispersed (280)
• Segmental (149)
• Tandem (63)
 For each pair
• Measured similarity/distance in expression profile
• Estimated silent Ks and replacement KA changes
Expression distance
library 2
library 1
library 3
Major findings
Many pairs are divergent in sequence
but not expression and vice versa
Pairs have atypically high expression
• Especially slowly evolving pairs
Divergence increases with Ka,
• Particularly among S duplicates!
• Divergence tends to be highly asymmetric
Expression level >5 ppm in x libraries
Libraries
0
1
2
3
4
5
Genes in pairs
153 (15.5%)
124 (12.6%)
73 (7.4%)
93 (9.5%)
109 (11.1%)
432 (43.9%)
All genes
4160 (23.3%)
2643 (14.8%)
1727 (9.6%)
1777 (10.0%)
1930 (10.8%)
5612 (31.4%)
dN =0.48+0.37 KA, p<0.0001
Asymmetric divergence
Type of Pair
A
B
C
___________________________________________________
Young
Dispersed (Ks0.5)
14
61
8
15.7%
68.5%
9.0%
Tandem (Ks0.5)
Old
Dispersed (Ks>0.5)
Segmental (All)
8
35
31
D
6
6.7%
29
14.3%
10
51.8%
9
17.9%
16.1%
111
18.3%
24
58.1%
21
12.6%
11.0%
104
20.8%
7
69.8%
7
4.7%
4.7%
A: Each copy has higher expression in at least one library
B: One copy has higher expression in all libraries that differ and at least two libraries
differ
C: Copies differ in expression in only one library
D: Copies do not differ in expression in any libraries
Why put gene family evolution
into a chromosomal context?
We can begin to understand and utilize
patterns of evolution in gene order
We can gain insight into the function
and evolution of gene families that are
not apparent from beanbag genomics
Thanks to:
Zongli Xu
David Remington
Jason Reed
Tom Guilfoyle
Blake Meyers
NSF