Download IAP workshop, Ghent, Sept.

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Genetic engineering wikipedia , lookup

Behavioural genetics wikipedia , lookup

X-inactivation wikipedia , lookup

Oncogenomics wikipedia , lookup

Epistasis wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Twin study wikipedia , lookup

Gene desert wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Pharmacogenomics wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Human genetic variation wikipedia , lookup

Pathogenomics wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Essential gene wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

History of genetic engineering wikipedia , lookup

Heritability of IQ wikipedia , lookup

Public health genomics wikipedia , lookup

Genome evolution wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Gene wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Genomic imprinting wikipedia , lookup

Minimal genome wikipedia , lookup

Ridge (biology) wikipedia , lookup

Designer baby wikipedia , lookup

Gene expression programming wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Genome (book) wikipedia , lookup

RNA-Seq wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Microevolution wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
Mixed model analysis to discover cisregulatory haplotypes in A. Thaliana
Fanghong Zhang*, Stijn Vansteelandt*,
Olivier Thas*, Marnik Vuylsteke#
* Ghent University
#
VIB (Flanders Institute for Biotechnology)
IAP workshop, Ghent, Sept. 18th, 2008
Overview
 Genetic background
 Objectives
 Data
 Methodology
 Results
 Conclusions
IAP workshop, Ghent, Sept. 18th, 2008
2
Genetic background
 Regulation of gene expression is affected either in:
- Cis
: affecting the expression of only one of the two alleles in a
heterozygous individual;
- Trans : affecting the expression of both alleles in a heterozygous
individual;
IAP workshop, Ghent, Sept. 18th, 2008
3
Genetic background
 Why search for Cis-regulatory variants?
“low hanging fruit”: window is a small genomic region
Fast screening for markers in LD with expression trait.
 How to search for Cis-regulatory variants?
Using GASED (Genome-wide Allelic Specific Expression
Difference) approach (Kiekens et al, 2006)
- Based on a diallel design which is very popular in plant
breeding system to estimate GCA (generation combination
ability) and SCA (specific combination ability)
IAP workshop, Ghent, Sept. 18th, 2008
4
Genetic Background
 What is GASED approach?
 The expression of a gene in a F1 hybrid coming from the kth offspring of the
cross can be written as:
(c—cis-element, t-trans-element)
yijk    ci  ctii  cj  ctjj  ctij  ctji  ijk
kth offspring of
cross i  j
Genotypic variation
y ijk   
From parent j
From parent i
From both (cross-terms)
In case homozygous
gcai
gcaj

In case there is cis-effect
A cis-regulatory
divergence completely
explains the difference
between two parental lines
gcai

gca j
IAP workshop, Ghent, Sept. 18th, 2008

scaij
  ijk
In case there is no trans-effect
scaij  0
5
Objectives of this study
 Using
mixed model analysis to discover Cisregulated Arabidopsis genes
 Based on GASED approach, to partition between F1 hybrid
genotypic variation for mRNA abundance into additive and nonadditive variance components to differentiate between cis- and
trans-regulatory changes and to assign allele specific expression
differences to cis-regulatory variation.
 To find its associated haplotypes (a set of SNPs) for
these selected cis-regulated genes.
 Systematic surveys of cis-regulatory variation to identify
“superior alleles”.
IAP workshop, Ghent, Sept. 18th, 2008
6
Flow chart
Data contains all expressed genes
(25527 genes)
Step I:
Step II:
Step III:
Step IV:
Choose genes with significant
genotypic variation:σ 2genotype
0
Choose genes from Step 1 with
no trans-regulatory variation: σ 2sca_ij  0
Choose genes from step 2 displaying
significant allelic imbalance to cisregulatory variation: gcai  gca j
Choose genes from Step 3 showing
significant association with founded
haplotype blocks: βSNPi  0
IAP workshop, Ghent, Sept. 18th, 2008
7
Data
Data acquisition:
1) Scan the arrays
2) Quantitate each spot
3) Subtract noise from
background
4) Normalize
5) Export table
Data for us to analyze
IAP workshop, Ghent, Sept. 18th, 2008
8
Methodology - Step I
Mixed-Model Equations
Full model:
Gene X:
expression
values
Reduced model:
yklnm = μ + dyek + replicatel + genotypen + arraym + errorklnm
FIXED effects
RANDOM effect
Residual
yklnm = μ + dyek + replicatel + arraym + errorklnm
error ~ N(0,Σe) , Σe =I2202e ; array ~ N(0, Σa) , Σa =I1102a
genotype ~ N(0,Σgenotype) , Σ genotype=G = K2g;
K = 55 x 55 marker-based relatedness matrix:
Calculated as 1 – dR ;
dR = Rogers’ distance
(Rogers ,1972; Reif et al. 2005)
IAP workshop, Ghent, Sept. 18th, 2008
9
Methodology - Step I
Mixed-Model Equations
K = 55 x 55 marker-based relatedness matrix:
1
dR 
m
ni
(p


m
t 1
1
2
ij
 qij ) 2
Rogers (1972); Reif et al. (2005)
j 1
d R  [0,1]
d R ( F1 , P1 )  d R ( F1 , P2 )  d R ( P1 , P2 ) / 2
Melchinger et al. (1991)
pij and qij are allele frequencies of the jth allele at the ith locus
ni is the number of alleles at the ith locus (i.e. ni= 2)
m refers to the number of loci (i.e. m = 210,205)
IAP workshop, Ghent, Sept. 18th, 2008
10
Methodology - Step I
Multiple testing correction
Gene X:
H 0 : σ g2  0
vs
H a : σ g2  0
Likelihood ratio test (REML)
LRT ~ 0.52(0) + 0.52(1))
25527 Genes
p-value
Adjusted q-value (FDR)
FDR: false discovery rate
How many of the called positives are false?
5% FDR means 5% of calls are false positive
John Storey et al. (2002) : q-value to represent FDR
Estimate the proportion of features that
are truly null: π 0
^
qval 
m π0 t
# (pval  t)
We use adjusted q-value to represent FDR
IAP workshop, Ghent, Sept. 18th, 2008
11
Methodology - Step I
Multiple testing correction
^
Storey et al estimate π0 = m0 /m
under assumption that true null pvalues is uniformly distributed
(0,1)
^
qvalue 
m 0 t
(t  (0,1))
# ( pvalue  t )
We estimate π0 –adj = m0 /m under assumption
that true null p-values is 50% uniformly
distributed (0,0.5) , 50% is just 0.5.
^
adjusted _ qvalue 
IAP workshop, Ghent, Sept. 18th, 2008
m  0 _ adj t
# ( pvalue  t )
(t  (0,0.5))
12
Methodology - Step II
Mixed-Model Equations
Full model:
y klijm= μ + dyek + replicatel + gcai + gcaj + scaij + arraym + error klijm
Gene X:
expression
values
FIXED effects
RANDOM effect
Residual
Σ genotype  Kσ 2g  K(σ 2gcai  σ 2gcaj  σ 2scaij )
 LLT (I 1  10 , I1  45 )(σ 2gca , σ 2sca )
 L(I 1  10 , I1  45 )(σ 2gca , σ 2sca ) LT
L is the Cholesky decomposition
Reduced model: y klijm= μ + dyek + replicatel + gcai + gcaj + arraym + error klijm
IAP workshop, Ghent, Sept. 18th, 2008
13
Methodology - Step II
Multiple testing correction
Gene X:
H 0 :  2 sca  0
vs
H a :  2 sca  0
Likelihood ratio test (REML)
LRT ~ 0.52(0) + 0.52(1)
20976 Genes
p-value
qa-value (FNR)
 FNR: false non-discovery rate (Genovese et al , 2002)
How many of the called negatives are false?
5% FNR means 5% of calls are false negative
 Since we are interested in selecting genes with negative
scaij effect, we control FNR instead of FDR
We use qa-value to represent FNR
IAP workshop, Ghent, Sept. 18th, 2008
14
Methodology - Step II
Multiple testing correction
False non-discovery rate (FNR) :
T
| (m  R)  0]Pr(m  R)  0
mR
^
m π 0 (1 t)
π0 is the estimate of the proportion of
qaval  1 
features that are truly null
#(pval  t)
FNR  E[
IAP workshop, Ghent, Sept. 18th, 2008
15
Methodology - Step III
Mixed-Model Equations
model:
yklijm = μ + dyek + replicatel + gcai + gcaj + arraym + errorkijlm
Test 45 pairs gca
Gene X:
g1 =g2?
g1 =g3?
g1 =g4? … g1= g10?
g2 =g3?
i
 gca ?
j
g2= g4?
Two sample
dependent t-test
…
g2 =g10? ……, …… g9 = g10?
Non-standard P-value
^
standard_t 
g2=g5?
^
(g 1  g 2 )
^
^
SE( g 1  g 2 )
^
non  standard_t 
^
(( g 1  g 1 )  (g 2  g 2 ))
^
Distribution of true null
p-values is not
uniformly distributed
from 0 to 1
^
SE(( g 1  g 1 )  (g 2  g 2 ))
^
g1
^
is BLUP of
g1 , g 2
is BLUP of
IAP workshop, Ghent, Sept. 18th, 2008
g2
16
Methodology - Step III
Multiple testing correction
Gene X:
H 0 : gca _ i  gca _ j
vs
H a : gca _ i  gca _ j
two sample t-test testing BLUPs
Simulate H0 distribution from real data:
simulation-based p-value
1380 Genes
q-value (FDR)
IAP workshop, Ghent, Sept. 18th, 2008
17
Methodology - Step IV
Mixed-Model Equations
Full model:
yklim = μ + dyek + replicatel +
Gene X:
(cis-regulated)
* SNP
β
SNP
i
i
i
FIXED effects
+ genotypei + arraym + errorkijlm
RANDOM effect
Gene
Residual
chromosome
SNP1 SNP2 SNP3 ………SNPi (tag SNPs)
genotype ~ N(0,Σgenotype) , Σ genotype=G = K2g;
K = 55 x 55 marker-based relatedness matrix.
array ~ N(0,Σa) , Σ a=I1102a; error ~ N(0,Σe) , Σ e=I2202e
Reduced model: yklim = μ + dyek + replicate+ genotypei + arraym + errorkilm
IAP workshop, Ghent, Sept. 18th, 2008
18
Methodology - Step IV
Multiple testing correction
H :β
β
 ...β
0
0 SNP1
SNP2
SNPi
H : at least one β
 0
a
SNPi
Gene X:
(cis-regulated)
836 Genes
Likelihood ratio test (ML)
LRT ~ 2(2n)
n is the number of SNPs
p-value
q-value (FDR)
IAP workshop, Ghent, Sept. 18th, 2008
19
Results
Data contains all expressed
genes (25527 genes)
Adjusted_q value<0.0005
Step I:  genotype  0
20979 genes
Step II:  sca _ ij  0
Adjusted_qa value<0.01
1328 genes
Step III:
gca _ i  gca _ j
q value<0.01
972 genes
q value<0.01
Step IV:   SNPi  0
859 genes
IAP workshop, Ghent, Sept. 18th, 2008
20
Results
 Among all 25527 genes, 20979 genes have significant genotypic
variation (qvalue < 0.0005). (–Step I)
 Among these 20979 genes, 1328 genes have no-trans regulated effect
(qavalue < 0.01). (–Step II)
 Among these 1328 genes, 972 genes have showed significant different
allelic expressions (qvlaue < 0.01); these 972 genes are discovered as cisregulated. (–Step III)
 We confirm our discovery from these 972 cis-regulated genes in step IV:
 an allelic expression difference caused by cis-regulatory variant implies a
nearby polymorphism (SNP) that controls expression in LD;
 We indeed found 96.5% selected cis-regulated genes have associated
polymorphisms (haplotype blocks ) nearby.
IAP workshop, Ghent, Sept. 18th, 2008
21
Conclusions
 This
mixed-model approach used here for association mapping analysis with
Kinship matrix included are more appropriate than other recent methods in
identifying cis-regulated genes ( p-values more reliable).
 Each step’s statistical method is controlled in a more accurate way to specify
statistical significance (referring to FDR, FNR).
 Using simulation-based pvalues when testing difference between random effects
increases power of detecting association.
 A comprehensive analysis of gene expression variation in plant populations has
been described.
 Using this mixed-model analysis strategy, a detailed characterization of both the
genetic and the positional effects in the genome is provided.
 This detailed statistical analysis provides a robust and useful framework for the
future analysis of gene expression variation in large sample sizes.
Advanced statistical methods look promising in identifying interesting discoveries
in genetics.
IAP workshop, Ghent, Sept. 18th, 2008
22
Many thanks
for your attention !
IAP workshop, Ghent, Sept. 18th, 2008
23