* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download IAP workshop, Ghent, Sept.
Genetic engineering wikipedia , lookup
Behavioural genetics wikipedia , lookup
X-inactivation wikipedia , lookup
Oncogenomics wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Gene desert wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Pharmacogenomics wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Human genetic variation wikipedia , lookup
Pathogenomics wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Essential gene wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
History of genetic engineering wikipedia , lookup
Heritability of IQ wikipedia , lookup
Public health genomics wikipedia , lookup
Genome evolution wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Genomic imprinting wikipedia , lookup
Minimal genome wikipedia , lookup
Ridge (biology) wikipedia , lookup
Designer baby wikipedia , lookup
Gene expression programming wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Genome (book) wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Mixed model analysis to discover cisregulatory haplotypes in A. Thaliana Fanghong Zhang*, Stijn Vansteelandt*, Olivier Thas*, Marnik Vuylsteke# * Ghent University # VIB (Flanders Institute for Biotechnology) IAP workshop, Ghent, Sept. 18th, 2008 Overview Genetic background Objectives Data Methodology Results Conclusions IAP workshop, Ghent, Sept. 18th, 2008 2 Genetic background Regulation of gene expression is affected either in: - Cis : affecting the expression of only one of the two alleles in a heterozygous individual; - Trans : affecting the expression of both alleles in a heterozygous individual; IAP workshop, Ghent, Sept. 18th, 2008 3 Genetic background Why search for Cis-regulatory variants? “low hanging fruit”: window is a small genomic region Fast screening for markers in LD with expression trait. How to search for Cis-regulatory variants? Using GASED (Genome-wide Allelic Specific Expression Difference) approach (Kiekens et al, 2006) - Based on a diallel design which is very popular in plant breeding system to estimate GCA (generation combination ability) and SCA (specific combination ability) IAP workshop, Ghent, Sept. 18th, 2008 4 Genetic Background What is GASED approach? The expression of a gene in a F1 hybrid coming from the kth offspring of the cross can be written as: (c—cis-element, t-trans-element) yijk ci ctii cj ctjj ctij ctji ijk kth offspring of cross i j Genotypic variation y ijk From parent j From parent i From both (cross-terms) In case homozygous gcai gcaj In case there is cis-effect A cis-regulatory divergence completely explains the difference between two parental lines gcai gca j IAP workshop, Ghent, Sept. 18th, 2008 scaij ijk In case there is no trans-effect scaij 0 5 Objectives of this study Using mixed model analysis to discover Cisregulated Arabidopsis genes Based on GASED approach, to partition between F1 hybrid genotypic variation for mRNA abundance into additive and nonadditive variance components to differentiate between cis- and trans-regulatory changes and to assign allele specific expression differences to cis-regulatory variation. To find its associated haplotypes (a set of SNPs) for these selected cis-regulated genes. Systematic surveys of cis-regulatory variation to identify “superior alleles”. IAP workshop, Ghent, Sept. 18th, 2008 6 Flow chart Data contains all expressed genes (25527 genes) Step I: Step II: Step III: Step IV: Choose genes with significant genotypic variation:σ 2genotype 0 Choose genes from Step 1 with no trans-regulatory variation: σ 2sca_ij 0 Choose genes from step 2 displaying significant allelic imbalance to cisregulatory variation: gcai gca j Choose genes from Step 3 showing significant association with founded haplotype blocks: βSNPi 0 IAP workshop, Ghent, Sept. 18th, 2008 7 Data Data acquisition: 1) Scan the arrays 2) Quantitate each spot 3) Subtract noise from background 4) Normalize 5) Export table Data for us to analyze IAP workshop, Ghent, Sept. 18th, 2008 8 Methodology - Step I Mixed-Model Equations Full model: Gene X: expression values Reduced model: yklnm = μ + dyek + replicatel + genotypen + arraym + errorklnm FIXED effects RANDOM effect Residual yklnm = μ + dyek + replicatel + arraym + errorklnm error ~ N(0,Σe) , Σe =I2202e ; array ~ N(0, Σa) , Σa =I1102a genotype ~ N(0,Σgenotype) , Σ genotype=G = K2g; K = 55 x 55 marker-based relatedness matrix: Calculated as 1 – dR ; dR = Rogers’ distance (Rogers ,1972; Reif et al. 2005) IAP workshop, Ghent, Sept. 18th, 2008 9 Methodology - Step I Mixed-Model Equations K = 55 x 55 marker-based relatedness matrix: 1 dR m ni (p m t 1 1 2 ij qij ) 2 Rogers (1972); Reif et al. (2005) j 1 d R [0,1] d R ( F1 , P1 ) d R ( F1 , P2 ) d R ( P1 , P2 ) / 2 Melchinger et al. (1991) pij and qij are allele frequencies of the jth allele at the ith locus ni is the number of alleles at the ith locus (i.e. ni= 2) m refers to the number of loci (i.e. m = 210,205) IAP workshop, Ghent, Sept. 18th, 2008 10 Methodology - Step I Multiple testing correction Gene X: H 0 : σ g2 0 vs H a : σ g2 0 Likelihood ratio test (REML) LRT ~ 0.52(0) + 0.52(1)) 25527 Genes p-value Adjusted q-value (FDR) FDR: false discovery rate How many of the called positives are false? 5% FDR means 5% of calls are false positive John Storey et al. (2002) : q-value to represent FDR Estimate the proportion of features that are truly null: π 0 ^ qval m π0 t # (pval t) We use adjusted q-value to represent FDR IAP workshop, Ghent, Sept. 18th, 2008 11 Methodology - Step I Multiple testing correction ^ Storey et al estimate π0 = m0 /m under assumption that true null pvalues is uniformly distributed (0,1) ^ qvalue m 0 t (t (0,1)) # ( pvalue t ) We estimate π0 –adj = m0 /m under assumption that true null p-values is 50% uniformly distributed (0,0.5) , 50% is just 0.5. ^ adjusted _ qvalue IAP workshop, Ghent, Sept. 18th, 2008 m 0 _ adj t # ( pvalue t ) (t (0,0.5)) 12 Methodology - Step II Mixed-Model Equations Full model: y klijm= μ + dyek + replicatel + gcai + gcaj + scaij + arraym + error klijm Gene X: expression values FIXED effects RANDOM effect Residual Σ genotype Kσ 2g K(σ 2gcai σ 2gcaj σ 2scaij ) LLT (I 1 10 , I1 45 )(σ 2gca , σ 2sca ) L(I 1 10 , I1 45 )(σ 2gca , σ 2sca ) LT L is the Cholesky decomposition Reduced model: y klijm= μ + dyek + replicatel + gcai + gcaj + arraym + error klijm IAP workshop, Ghent, Sept. 18th, 2008 13 Methodology - Step II Multiple testing correction Gene X: H 0 : 2 sca 0 vs H a : 2 sca 0 Likelihood ratio test (REML) LRT ~ 0.52(0) + 0.52(1) 20976 Genes p-value qa-value (FNR) FNR: false non-discovery rate (Genovese et al , 2002) How many of the called negatives are false? 5% FNR means 5% of calls are false negative Since we are interested in selecting genes with negative scaij effect, we control FNR instead of FDR We use qa-value to represent FNR IAP workshop, Ghent, Sept. 18th, 2008 14 Methodology - Step II Multiple testing correction False non-discovery rate (FNR) : T | (m R) 0]Pr(m R) 0 mR ^ m π 0 (1 t) π0 is the estimate of the proportion of qaval 1 features that are truly null #(pval t) FNR E[ IAP workshop, Ghent, Sept. 18th, 2008 15 Methodology - Step III Mixed-Model Equations model: yklijm = μ + dyek + replicatel + gcai + gcaj + arraym + errorkijlm Test 45 pairs gca Gene X: g1 =g2? g1 =g3? g1 =g4? … g1= g10? g2 =g3? i gca ? j g2= g4? Two sample dependent t-test … g2 =g10? ……, …… g9 = g10? Non-standard P-value ^ standard_t g2=g5? ^ (g 1 g 2 ) ^ ^ SE( g 1 g 2 ) ^ non standard_t ^ (( g 1 g 1 ) (g 2 g 2 )) ^ Distribution of true null p-values is not uniformly distributed from 0 to 1 ^ SE(( g 1 g 1 ) (g 2 g 2 )) ^ g1 ^ is BLUP of g1 , g 2 is BLUP of IAP workshop, Ghent, Sept. 18th, 2008 g2 16 Methodology - Step III Multiple testing correction Gene X: H 0 : gca _ i gca _ j vs H a : gca _ i gca _ j two sample t-test testing BLUPs Simulate H0 distribution from real data: simulation-based p-value 1380 Genes q-value (FDR) IAP workshop, Ghent, Sept. 18th, 2008 17 Methodology - Step IV Mixed-Model Equations Full model: yklim = μ + dyek + replicatel + Gene X: (cis-regulated) * SNP β SNP i i i FIXED effects + genotypei + arraym + errorkijlm RANDOM effect Gene Residual chromosome SNP1 SNP2 SNP3 ………SNPi (tag SNPs) genotype ~ N(0,Σgenotype) , Σ genotype=G = K2g; K = 55 x 55 marker-based relatedness matrix. array ~ N(0,Σa) , Σ a=I1102a; error ~ N(0,Σe) , Σ e=I2202e Reduced model: yklim = μ + dyek + replicate+ genotypei + arraym + errorkilm IAP workshop, Ghent, Sept. 18th, 2008 18 Methodology - Step IV Multiple testing correction H :β β ...β 0 0 SNP1 SNP2 SNPi H : at least one β 0 a SNPi Gene X: (cis-regulated) 836 Genes Likelihood ratio test (ML) LRT ~ 2(2n) n is the number of SNPs p-value q-value (FDR) IAP workshop, Ghent, Sept. 18th, 2008 19 Results Data contains all expressed genes (25527 genes) Adjusted_q value<0.0005 Step I: genotype 0 20979 genes Step II: sca _ ij 0 Adjusted_qa value<0.01 1328 genes Step III: gca _ i gca _ j q value<0.01 972 genes q value<0.01 Step IV: SNPi 0 859 genes IAP workshop, Ghent, Sept. 18th, 2008 20 Results Among all 25527 genes, 20979 genes have significant genotypic variation (qvalue < 0.0005). (–Step I) Among these 20979 genes, 1328 genes have no-trans regulated effect (qavalue < 0.01). (–Step II) Among these 1328 genes, 972 genes have showed significant different allelic expressions (qvlaue < 0.01); these 972 genes are discovered as cisregulated. (–Step III) We confirm our discovery from these 972 cis-regulated genes in step IV: an allelic expression difference caused by cis-regulatory variant implies a nearby polymorphism (SNP) that controls expression in LD; We indeed found 96.5% selected cis-regulated genes have associated polymorphisms (haplotype blocks ) nearby. IAP workshop, Ghent, Sept. 18th, 2008 21 Conclusions This mixed-model approach used here for association mapping analysis with Kinship matrix included are more appropriate than other recent methods in identifying cis-regulated genes ( p-values more reliable). Each step’s statistical method is controlled in a more accurate way to specify statistical significance (referring to FDR, FNR). Using simulation-based pvalues when testing difference between random effects increases power of detecting association. A comprehensive analysis of gene expression variation in plant populations has been described. Using this mixed-model analysis strategy, a detailed characterization of both the genetic and the positional effects in the genome is provided. This detailed statistical analysis provides a robust and useful framework for the future analysis of gene expression variation in large sample sizes. Advanced statistical methods look promising in identifying interesting discoveries in genetics. IAP workshop, Ghent, Sept. 18th, 2008 22 Many thanks for your attention ! IAP workshop, Ghent, Sept. 18th, 2008 23