Download R i

Modeling Splice Site and Transcription Factor Binding Site Variation by Information Theory Peter K. Rogan, Ph.D. St. Jude’s Children’s Research Hospital Memphis, TN May 15, 2003 Background • Information theory provides general solutions to the problem of how to recognize members of a group of related nucleic acid (or protein) sequences. • The average information of a related set of sequences, Rsequence, represents the total sequence conservation: Rsequence = 2 - [ -f(b,l) log2 f(b,l) + e(n(l)) ] f(b,l) is the frequency of each base b at position l, e(n(l)) is a correction for the small sample size n at position l Schneider et al. J. Mol. Biol. 1984 Sequence Logo Conservation and diversity among related binding sites can be visualized using a sequence logo. The area under the logo is Rsequence, the average Information of the binding site. Definition of Individual Information • The individual information, Ri, of a single member of a sequence family is the dot product of that sequence vector and a weight matrix, Ri(b,l), based on the of the base frequencies at each position of the sequence. t Ri(j) =  s(b,l,j) Riw(b,l) l b=a (bits per site j) Distribution of Individual Information for related binding sites The average of the set of Ri values for a family of sequences is Rsequence. Second law of thermodynamics -kBT ln 2  q / R q: heat dissipated; T: temperature; R: information q>0 HLH Protein => R <0 DNA Mutation or Unrelated sequence q<0 => R >0 HLH Protein bound to WT DNA Sequence Walker Definition Among related sequences having a common function, functional sites can be distinguished from non-sites with the sequence walker. (E. coli Fis protein) bits 2 0 -4 Sequence Walker Application I The matrix can be scanned along a “test sequence” until... bits 2 0 -4 Ri = - 6.7 bits at position 179 of the sequence. The Z score is -5.4. Sequence Walker Application II … a green bar indicates a potential binding site bits 2 0 -4 Ri = 9.2 bits at position 180 of the sequence. The Z score is 0.3. mRNA splicing 5’ IVS 1 Exons 1 5’ donor acceptor IVS 2 2 3 Transcription IVS1 1 IVS2 2 3 gene DNA 3’ hnRNA 3’ Splicing or 1 2 3 Mature mRNA 1 3 Alternative mRNA Splice Site Model Building •We extracted coordinates of unique donor and acceptor splice sites of known genes from the given strand of the 10/7/00 Human Genome Working Draft. •Valid splice junctions were evaluated by information theory (Ri > 0) and the Ri(b,l) matrix was computed. •This process was iterated (~ 10 cycles) until all sites evaluated with the matrix had Ri > 0. Semi-automated Splice site Model Refinement Parameters Acc (+ strand) Starting set (n) 86,068 Refined model (n) 53,985 Site coordinates [-25, 2] Rsequence 7.45 Standard deviation 3.47 Ri of consensus sequence 22.93 Acc (- strand) 84,076 54,101 [-25, 2] 7.41 3.47 22.78 Acc_total 170,144 108,079 [-25, 2] 7.42 3.47 22.88 Acc (1992) 1,744 1,744 [-25, 2] 8.87 4.58 21.68 Don (+ strand) Don (- strand) Don_total Don (1992) Starting set (n) 86,221 84,229 170,450 1,799 Refined model (n) 56,286 55,491 111,772 1,799 Site coordinates [-3,6] [-3,6] [-3,6] [-10,10] Rsequence 6.73 6.74 6.74 8.01 Standard deviation 2.36 2.33 2.34 3.29 Ri of consensus sequence 11.80 11.80 11.79 15.18 • ~ 1/3 of exon-intron junctions are misaligned in the draft, owing to the rapid alignment procedures used (ie. BLAT). Splice junction logos: (+) strand Ri analysis of sequence variation at binding sites • • • • Effects of mutations Effects of polymorphisms Detection of cryptic sites Relationship between information content and phenotype Comparison of the binding energies of normal and variant splice junctions: Ri Gwt/ Gv = 2 where Ri = the difference between the respective Ri values, Gwt = Free energy of the natural binding site, Gv = Free energy of the variant binding site. The fold difference in binding the normal vs. the variant site is Gwt/ Gv. mRNA splicing mutations (*, ^) 5’ IVS 1 * Exons 1 donor 5’ IVS 2 2 acceptor IVS1 1 * * ^ 3 * IVS2 3 2 or 1 2 3 Leaky or no wild type mRNA or gene DNA 3’ hnRNA 3’ ^ 1 2 3 1 3 Exon Cryptic skipping (*) splicing (^) Mutant forms The minimum information required for donor site recognition Temperature sensitive mutation in COL3A1 results in 50% exon skipping and Ehlers-Danlos syndrome, Type VII. Splicing is impaired at 39 deg.C and restored at 30 deg. C, which is consistent with weak binding by U1 splicesome. Cryptic splicing mutations A C->T mutation in exon 3 of the iduronidate synthetase gene activates a cryptic donor site upstream of the natural donor site. Mechanism of exon recognition U1 splicesome U2 splice + U2AF exon 5’ mRNA acceptor donor Binding sites 3’ Mechanism of exon recognition: cryptic splicing mutation (2a) U1 splicesome U2 splice + U2AF exon 5’ mRNA Natural acceptor Activated cryptic donor Recognized Binding sites 3’ Natural donor Either not recognized or to lesser degree Mild (or leaky) splicing mutation CFTR Polymorphism (5T, 7T, 9T) Pop Freq 60% 35% 5% Splicing among 3 common alleles that differ in length in the polymorphic polythymidine tract of the IVS 8 acceptor of the CFTR gene.The shortest allele (top walker) shows 90% skipping of exon 9 and is associated with congenital absence of the vas deferens. Individuals with the two longer alleles have a normal phenotype, although the 7T allele produces less mRNA than the 9T allele. Prediction of clinical phenotypes •Hereditary non-polyposis colon cancer •Hemophilia A and B •Atherosclerosis Predicting Phenotype of HNPCC Splicing Mutations by Information Analysis The Lynch I form of HNPCC is confined to the colon, but the more severe Lynch II type shows multi-organ involvement. The HNPCC phenotype is hypothesized to be related to the amount of normal and abnormal MLH1 and MSH2 mRNA present predicted from the individual information in mutant splice sites. Lynch II mutations Lynch I mutations Mutant splice sites (n=31) in these genes contained significantly less information than the cognate natural sites. Each of the Lynch I mutations had Ri values >2.4 bits, which is consistent with reduction (not abolition) of mRNA. Lynch I and II phenotypes were distinguishable by their Ri values for all but 3 Lynch II mutations (with 2.4 to 4.8 bits). Statistical analysis: HNPCC Hypothesis: Ri values will be highest for normal splice sites, Intermediate values for Lynch I and lowest values for Lynch II syndrome. The medians for these three groups are different and in the correct order and that there are some outliers in the two Lynch mutation groups. The three groups have significantly different RI values. {Kruskal-Wallis 2 (df=2) =17.9833  P= 0.0001} Each of the groups are different from one another based on pairwise comparisons with the Wilcoxon rank-sum test: Group comparison Lynch I vs. Normal variants Lynch II vs. Normal variants Lynch I vs. Lynch II Corrected Rank-sum P Normal (Z) statistic________________ 2.68 3.73 2.17 0.0072 0.0002 0.03 Results are consistent with MSH2 -/and MSH2 +/transgenic mouse phenotypes. Increased proliferation induces widespread DNA replication errors, which are repair normally until DNA repair systems are saturated (Cancer Res. 62:2092, 2002). Mismatch repair machinery is activated by DNA damaging agents (Nature 399:806, 1999; PNAS 96:10704, 1999). Relating Information Content of F8C and F9 Splicing Mutations and Bleeding Phenotypes Ri Reduction in Protein Reduction in Protein Value Level Activity Cutoff, (bits) ________________ Mild Severe ________________ Mild Severe Bleeding Symptoms ________________ Mild Severe < 2.4 0/13 13/13 (100) 5/37 (14) 32/37 (86) 0/9 9/9 (100) > 2.4 5/7 (71) 2/7 (29) 23/36 (64) 13/36 (36) 8/21 (38) 13/21 (62) To predict severity of hemophilia, mutations in the factor VIII (F8C) or factor IX (F9) genes were analyzed for changes in RI: v The receiver operating curve discriminated mildly or moderately from severely reduced protein activity for values  2.4 bits or Ri < 7 bits (P=.001). v Using these thresholds: - 91% of mutations with severely reduced protein expression were correctly identified (n=45; P< 0.001). - 86% of mutations associated with severe bleeding and all mutations with moderate bleeding symptoms were correctly identified (n= 22 p< .0009). Information Content of Splicing Mutations in Lipid Metabolizing Genes vs. Phenotype Phenotype* Ri value cutoff (bits) Dyslipidemia Reduction in protein level or activity Mild Average Severe Mild Average Severe < 2.4 0/15 10/15 5/15 1/9 7/9 1/9 > 2.4 2/5 3/5 0/5 2/3 1/3 0/3 Fraction is the number of mutations in category / total number above or below 2.4 bits. Mutant genes included APOAII,APOB,APOCII,APOE,CBS,CETP,LCAT,LIPA,LDLR, and LPL. Generating information models of eukaryotic transcription factor cis-regulatory binding sites Unique challenges: •Variant sequences are not obvious •Requires experimental determination and validation •Effect of ascertainment bias in published sites in SELEX-generated sites •Binding protein does not necessarily signify that it activates (or represses) transcription Greek Hereditary Persistence of Fetal Hemoglobin(HBGA, -119G>A) 6.8 bits 7.3 bits (A) Mutation in the CCAAT box of the A-gamma globin gene results in 1.4 fold increased expression of fetal globin mRNA into adulthood. The CCAAT box protein binding site is strengthened by 0.5 bits (or 1.41 fold) over wild type. (B) The binding site logo and distribution of Ri values of 171 binding sites in the Transfac Database (www.biobase.de) are indicated. Models of NF-E2, GATA1, and GATA2 protein binding Sites were also constructed, but sites were not found in this interval (not shown). The Transcription Factor Binding Site Problem: Bias in Models Derived from TRANSFAC data towards Consensus Sequences* *Consensus sequences have the strongest binding, but are often not representative of the majority of sites. Model development strategy Refinement of the Pregnane X Receptor (PXR/RXRα) binding site model Initial PXR/RXR Model. Published PXR/RXR binding sites (n=15; and flanking sequences) were multiply aligned by minimization of uncertainty. The -2 to +20 interval contained most of the information, was consistent with published binding studies, and was therefore used to define the site. bits Competition Curves for Novel PXREs Identified by Model 1 To quantify the relative affinity of PXR/RXR, band density was plotted versus pmol competitor to determine the concentration of competitor required to deplete PXR/RXRα binding to the CYP3A4 proximal PXRE by 50%. Relative binding was normalized to the band intensity of the reactions with no added competitor as 100%. Comparison of predicted and measured binding affinities for novel PXR/RXRα sites identified with the initial model GENE Position (relative to ATG) PXRE (Model 2 derived walker) RI (bits) Model 1 Minimum Theoretical Change in Affinity Model Model 1 2 Observed Change in Affinity (EMSA) Model 2 CYP3A4 -270 17.3 18.0 CYP2B6 -8572 15.0 17.9 4.92 1.07 4.4 UGT1A3 -6930 10.9 17.2 84.4 1.74 4.4 UGT1A3 -8040 10.7 16.5 97.0 2.83 3.7 UGT1A6 -9216 9.9 14.3 168.9 13.0 29.6 Predicted fold differences in binding were closer to densitometricallydetermined differences when these weaker sites were added in Model 2. Model 2 Characteristics (A) Alignment of published + validated PXREs (B) Histogram (C) Sequence logo Scans of CYP3A4 and CYP2B6 promoters Each promoter was scanned with PXR/RXR model 2. Ri values are plotted versus the position of the PXRE in the CYP3A4 gene or the CYP2B6 gene. Ri values of sites on the antisense strand are shown upside down. Previously characterized PXR binding sites identified by the model are indicated in color. Activation of the CYP2B6 Distal PXRE Transient transfections with CYP2B6 and control CYP3A4 PXRE fusion constructs. Rifampin induced luciferase activitiy 4- to 5-fold in cells cotransfected with an expression plasmid for human PXR and CYP2B6-dPXRE(2X)-luc, and 2- to 3- fold in cells cotransfected with CYP3A4-pPXRE(2X)-luc. Rifampin had no effect on luciferase activity in cells transfected with the enhancerless-reporter. Average luciferase activity ± SD of three replicates from 3 independent transfections is shown. PXR/RXR Model 3 Weaker binding sites from well established PXR/RXRα target gene promoters (Ri < Rsequence) were validated and introduced into Model 3. Novel validated binding sites in Model 4 These 14 binding sites are not present in the Nov 02 human genome draft! Ri Site name Site name - Ri(b,l) matrix CYP3A4-pPXRE(0/10G) NG_000004.a148729g.a148739g 15.1 CYP3A-dNR1(0/10G) NG_000004.t141178c.t141168c 16.8 CYP3A7-dNR2(0/10G) NG_000004.a190205g.a190215g 17.6 CYP2B6-dPXRE(10G) CYP2B6.a1446g 16.2 UGT1A3b(0/10G) AF297093.t137695c.t137685c 18.3 UGT1A3a(0/10G) AF297093.a138805g.a138815g 14.9 GSTM1(0/10G) AC000031.6.a1959g.a1969g 12.0 UGT1A1gtNR1(0/10G) AF297093.1.t171676c.t171666c 7.1 UGT1A1b(0/10G) AF297093.1.t165761c.t165751c 14.0 FMO4b(10G) AL031274.1.a57947g 11.0 catalase(0/10G) AL035079.14.t43503g.a43513g 14.6 NOS2A(1A) chr17_27002541-27012540.c8336t 12.9 NOS2A(11A) chr17_27002541-27012540.c8326t 10.5 MAOBd(0/10G) Z95125.t36576c.t36566c 11.1 Possible significance of novel sites • Not present in reference sequence, but they are polymorphisms or mild mutations – Advantage is that binding is not abrogated, but reduced, ie. gene is less PXR/RXR responsive. – Possible “wobble” code for regulatory elements • Ancestral binding sequence present in primate lineage – PXR/RXR mutation rate is slower than cis-regulatory element; protein retains ability to recognize sequences that are no longer present – This could explain why heterologous cross-species transfections are faithfully regulated. Development of a Xenobiotic biosensor based on the information theory-derived optimal site Firefly RLU/Renilla RLU HepG2 cells were transiently transfected with 100 ng luciferase reporter, 5 ng pRL-CMV and 25 ng pSG5-hPXRDATG with Lipofectamine Plus. After treatment for 24 hours with 10 mM Rifampin or 0.1% DMSO (solvent), cells were harvested and Dual-luciferase assays were performed. Results are the average of three separate wells transfected and treated in parallel. 14 DMSO 12 10 uM Rifam pin 10 8 6 4 2 0 PXREv2-OPT(2X)luc CYP3A4pPXRE(2X)-luc Architecture of the Delila Genome System Performance metrics Histogram of binding site strengths for sites in genome scan >10 bits Delila-Genome Visualization Tools Visualization of successive genome scans of PXR/RXRα binding site models Monitoring PXR/RXR refinement through Table 2: Differences in total binding site counts based on genome scans of promoters with successive PXR/RXR information complete genome promoter scans weight matrices PXR/RXR Models + Number of sites in each category Unique sites Z scores A B A-B* B-A^ Threshold (Z) S I Threshold (Ri, bits) 1 2 11758 45219 1.0 589 71658 2 3 17065 157922 1.0 48657 3 4 61906 148894 1.0 5044 (A ~ B), Confidence intervals+ Ri (A S I Threshold (±S.D.) 3 2293 69954 3 23625 48622 51744 3 11044 89357 3 37822 62579 191373 3 11069 185348 3 68846 127571 @ B), (A B), (A B), (A B), S (A B), I Standard error computation for individual Ri values is based on derivation given in reference 18; *Sites found with model A but not with model B; ^sites found with model B, but not with model A; ~ Number of sites with differences in Ri values exceeding threshold Z scores; @Number of sites with differences in Ri values less than the threshold. Development and Experimental Refinement of NFkB p65/p50 Binding Site Model Panel 1. Logos for NFkB p50/p65 binding sites. (A) Model 2 based on 55 Published and 8 experimentally determined binding sites (B) Model 3 based On 55 published and 20 experimentally determined binding sites. Inset s are histogram distributions of Ri values of sites comprising each model. CYP2D6 Promoter Mutation Analysis of NFkB p65/p50 Binding Site CYP2D6: “C allele” 3.3 bits “G allele” -0.8 bits The -1496C allele contains a weak p50/p65 site (–1495 to –1508; R i =3.3 bits) that is abolished (R i < 0) in the G variant. These alleles each also contain p50 homodimer binding sites on opposite strands; however, the C allele is predicted to bind with 1.6 fold difference). The higher CYP2D6 activity greater affinity (3.5 vs. 2.7 bits; observed for the –1496G allele may be due to reduced binding and repression of CYP2D6 expression by NF-kB p50 homodimers. Future efforts • Automate binding site validation • Genomic signature of PXR/RXRα – target genes • (Hypothesis-based microarray studies of ligand-induced gene expression) Automated binding site validation: microtiter plate immunoassay • • • • • Covalently link reference oligo to plate Bind synthetic PXR/RXRα ± competitor oligo* Bind 1o RXR α (or PXR) antibody Detect with 2o antibody/ HRP (Automated with Biomek 2000 workstation) *Competitor oligos are detected in PXR/RXRα target genes and exhibit Ri values that are ±2 bits of reference oligo. Genomic analysis to identify genes regulated by transcription factors: •Requires robust binding site model •Genomic signature should delineate differences between regulated and constitutively expressed genes: • Define promoter interval interval • Binding site strength • Densities of sites • Organization of sites regulated by NF-kB + unregulated “NF-kB binding Genes sites” in gene promoters 16 Legend 14 Ri-reg (n=8) Ri-unreg (n=3) 12 Ri Ri 10 8 6 4 2 0 -10000 -10000 -8000 -8000 -6000 -6000 -4000 -4000 -2000 -2000 00 Position Position -400 bp binding sites for promoters of upregulated genes scanned by model 3 NF-kBNF-kB Binding Sites in Upregulated Genes 16 Legend 15 Ri 14 INF-beta 13 LCAM 12 E-Selectin 11 Lymphotoxin 10 TNF-alpha 9 IL-2 8 GM-CSF 7 Urokinase Ri = 4.0 6 5 4 3 2 1 0 -400 -350 -300 -250 -200 Position -150 -100 -50 0 “NF-kB binding sites” in genes not known to be regulated by NF-kB 16 15 14 Legend 13 GAPDS 12 GAPD 11 VEGF 10 9 8 Ri 7 Ri = 1.3 6 5 4 3 2 1 0 -400 -350 -300 -250 -200 Position -150 -100 -50 0 Criteria for scanning chromosomes 21/22 with NF-kB Model 3: •Average information threshold of >4 bits. Of 548 promoter intervals (400 bp each): the mean Ri values for sites in 138 promoters on the transcribed strand and 137 on the antisense strand had sites exceeding threshold. 37% of the genes on chromosome 21 would be NF-kB targets!! Also, multiple weak binding sites with low Ri values can falsely exclude genes containing strong binding sites. This genomic signature has very LOW specificity. •Eliminate promoters with only weak binding sites (Ri<Rsequence). This signature identifies smaller set of genes: 11 and 19, respectively, on chromosomes 21 and 22. Several expected cytokine genes are not identified with this genomic signature. These criteria introduce biased towards the consensus sequence (or an incomplete model). This approach appears to lack adequate sensitivity. Genomic signature determination for PXR/RXR with machine learning approach Refinement of genomic signature Add True Negatives Genome Scan True Positives Unknowns Prediction Training/Validation Add Promoter region input Freq Dist of Binding Strengths Markov Cluster Algorithm Clusters of Sites Distances from TSS Hybrid Neural Network Positive/Negative Prediction Positive Negative Experimental Confirmation Predictions of Binding Strength Network • Network Input: Frequency distributions of binding sites based on 5 bit-wide bins • Trained with 15 PXR/RXR responsive and 15 non-responsive promoter regions • Results of testing 9 positive and 22 negative promoter regions: – <TP,FP,TN,FN> = <7,4,18,2> – Sensitivity = 77.8% – Specificity = 81.8% In conclusion... •Genetic variation in binding sites can be comprehensively modeled by information theory. •Information is related to binding energy and can be used rank order binding strengths. •Beware of experimental bias towards strong binding sites. Information theory can be used to develop and refine binding site models that are representative of the range of binding strengths found in the genome. •Robust binding site models are a prerequisite for accurate mutation/polymorphism analysis and for comprehensive identification of binding sites in the genome. Contributors Children’s Mercy Hospital and Clinics: •Sashidar Gadiraju, Stan Svojanovsky •J. Steven Leeder, Carrie Vyhlidal, Ivy Hurwitz SICE, University of Missouri-Kansas City: •Deendayal Dinakarpandian, Saumil Mehta St. Jude’s Children’s Research Hospital: •Erin Schuetz University of Hamburg: •Yskert von Kodolitsch NCI: •Tom Schneider Support Merck Genome Research Foundation PHS ES10855-02

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download R i