* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download ppt
Survey
Document related concepts
Transcription factor wikipedia , lookup
Gene expression profiling wikipedia , lookup
Segmental Duplication on the Human Y Chromosome wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Pathogenomics wikipedia , lookup
Non-coding DNA wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Point mutation wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Genome editing wikipedia , lookup
Gene desert wikipedia , lookup
Transcript
Localising regulatory elements using statistical analysis and shortest unique substrings of DNA Nora Pierstorff1, Rodrigo Nunes de Fonseca2, Thomas Wiehe1 1 - Institute for Genetics, University of Cologne, Germany, Email: [email protected] 2 - Institute for Developmental Biology, University of Cologne INTRODUCTION: In order to localize regulatory regions three basic computational approaches have been followed. 1. Search for bindingsites of known transcription factors using Position Weight Matrices. [1] 2. Search for conserved motifs in upstream-regions of homologous or coregulated genes. [2] 3. Search for statistically overrepresented motifs [3] Our program SHUREG follows the third approach which is supported by two hypotheses: 1. Degenerate binding site lead the transcription factor to the bindingsite 2. New bindingsites can be created easily from degenerate bindingsites through few mutations to adapt the organism to environmental changes. Figure 1a: SHUREG prediction in the giant region Hairy Drosophila melanogaster Shureg results SHUREG - ALGORITHM: 1. Calculation of shustrings (shortest unique substrings) at every position relative to a surrounding window on forwardand backwardstrand. 2. Counting of neighbours (exact repeats in the surrounding) 3. Calculation of P-values for each shustring 4. Smoothing of P-values Hairy Drosophila melanogaster Ahab results 0,25 70 Translation start site 10213 Stripe 2 0,2 Stripe 7 Stripe 6 Stripe 7 Stripe 6 Translation start site 10213 Stripe 2 60 Stripe 5 50 Stripe 1 0,15 0,1 0,05 40 Stripe 5 30 Stripe 1 20 10 0 0 0 0 2000SHUREG 4000 6000 8000 in 10000 12000 14000 Figure 2a: prediction the hairy region Nucleotide position -10 2000 4000 6000 8000 10000 12000 14000 Figure 2b: Ahab prediction in the hairy region Nucleotide position We applied our program to different well explored regions of the Drosophila melanogaster genome. Our dataset includes segmentation and dorsalventral genes. We compare our predictions to the results of AHAB[1], a program that uses PWM‘s Figure 1 shows two predictions for the giant region. 1a is computed using Shureg. 1b is the result of the Ahabprogram applied to the same sequence. Figure 3a: SHUREG prediction in the sog region Sog Drosophila melanogater Ahab results 20 15Exon Ahab value WHY SHORTEST UNIQUE SUBSTRINGS? Analyzing the human (mouse-) genome we found 255 (293) global shustrings of length 11bp. [4] 29 (22) of the shustrings are positioned in 1000bp-upstreamregions. The probability of this distribution is 3.3 x 10-24 (5.0 x 10-18 ) Figure 1b: AHAB prediction in the giant region Ahab value Several regulatory region prediction methods using computation have been developed in the last few years. Most of the available methods require transcription factor binding site matrices to achieve reasonable results. In order to avoid the need of biological information, we developed a program named SHUREG to predict regulatory regions without any extrinsic information but the sequence itself. Calculating shustrings (shortest unique substrings) we find statistically overrepresented motifs which are assumed to be indicators of regulatory elements. [3] RESULTS: shureg value ABSTRACT: 1 CRM Figure 2a shows the Shureg prediction for the regulatory regions of the hairy gene. 2b shows the corresponding Ahab-prediction. 10 Exon 2 5 0 -5 0 2000 4000 6000 8000 10000 Figure 3b: AHAB prediction in the 12000 sog Nucleotide region the dorsalposition PWM 0 using Figure 3c: AHAB prediction in the sog region using all known PWM‘s 14000 Figure 3 is partitioned into 3 predictions. Figure 3a is the Shureg prediction for the dorsal regulated enhancer of the sog gene. Figure 3b shows the Ahab prediction using only the PWM of the Dorsal binding site. Figure 3c shows the Ahab-prediction using all known PWM‘s in an hypothetical case that we do not know the actual factors responsable for this gene regulation. DISCUSSION: To localize regulatory regions without any extrinsic information is a hard topic. To use the amount of overrepresented patterns in a region as indicator of regulatory regions is a reasonable measure and can lead to reasonable results. But it also leads to a lot false positive predictions, because we find additional overrepresented patterns which cannot be set into correlation to binding sites. To improve the predictions of our method we need to find more features to distinguish between true positive and false positive predictions, we are currently investigating the conservation of overrepresented motifs between species. References: [1] N. Rajewsky, M. Vergassola, U. Gaul, and E. D. Siggia (2002): Computational detection of genomic cis-regulatory modules, applied to body patterning in the early Drosophila embryo. BMC Bioinformatics, 3:30 [2] H. Bussemaker, H. Li, E Siggia (200): Building a dictionary for genomes: Identification of presumptive regulatory sites by statistical analysis. PNAS, Aug 2000; 97 [3] Nazina A., Papatsenko D. (2003). Statistical extraction of Drosophila cis-regulatory modules using exhaustive assessment of local word frequency. BMC Bioinformatics 4:1471-2105/4/65 [4] Haubold, B., Pierstorff, N., Moeller, F., Wiehe, T. (2005). Genome comparison without alignment using shortest unique substri ngs. BMC Bioinformatics, 6:123.