Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Prediction of Regulatory Elements for Non-Model Organisms Rachita Sharma, Patricia Evans, Virendra Bhavsar Faculty of Computer Science, University of New Brunswick, Fredericton, NB, Canada E3B 5A3 Analysis Introduction Determination of regulatory networks from available data is one of the major challenges in bioinformatics research. A regulatory network of an organism is represented by a set of genes and their regulatory relationships, which indicate how a gene or a group of genes affect (inhibit or activate) production of other gene products as shown in Figure 1. Some organisms such as yeast, Arabidopsis thaliana (thale cress, a plant) and fruit fly are being investigated very thoroughly by biologists as model organisms. We are developing a system to predict the regulatory relationships of a non-model organism (target genome), about which less information is known, using information about the regulatory relationships of a related model organism (source genome). If the organisms are closely related then the regulatory relationships are likely to be similar. Differences in the regulatory relationships between organisms can be determined by using data from both the model and non-model organisms. This research started as a part of the bioinformatics research component of the Canadian Potato Genome Project. This methodology has been implemented for mapping regulatory elements and their regulatory network. The first step of mapping regulatory elements has been tested on Yeast (Saccharomyces cerevisiae) and Arabidopsis thaliana as the source and target genomes, respectively, which diverged approximately 1.6 Giga-years ago. For any pair of genomes, only some of the transcription factors from one genome can be mapped to another genome, since the evolutionary distance between them leads to many false negatives. In addition, the number of confirmed mappings between any two genomes is unknown as it depends on the definition of a confirmed mapping used in the experiment. The predicted transcription factors are compared on the basis of how likely a sequence predicted as a transcription factor is to be a transcription factor of the target genome how likely the predicted transcription factor is to correspond to the correct type of transcription factor from the source genome Therefore, the predicted transcription factors are compared to a set of 1922 available transcription factors of the Arabidopsis thaliana genome to determine the actual number of transcription factors predicted. | Inhibition Determine associations between the genes that act as regulatory elements (transcription factors and target genes) in model and non-model organisms Predict the regulatory relationships in a non-model organism Methodology Find transcription factors of the target genome using the available regulatory element information of the source organism based on Similar sequences (TF-Seq) Same protein domain family (TF-Fam) Same protein domain sub-family (TF-SubFam) Map target genes from the source genome to the target genome based on finding transcription factor binding site motifs (TFBS) in Nucleotide data of the target genome (BS-Seq) Promoter data of the target genome (BS-Prom) Similar target gene sequences of source genome in the target genome (BSBlast) Nucleotide data of the target genome discarding binding sites located in the predicted regions of nucleosome occupancy (BS-Nuc) Gene expression data will be used to further refine the regulatory network to understand how the predicted regulatory relationships correspond to the expression levels of the genes in the data FN 1200 TP 1000 800 600 400 200 0 TF-Seq TF-Fam TF-SubFam Method Figure 2: Number of true positives, false positives, false negatives and true negatives for transcription factors identified using TF-Seq, TF-Fam, and TF-SubFam Conf irmed Similar Other TF Not TF Types Figure 3: Number of hit sequences divided into four types (Confirmed, Similar, Other TF and Not TF) using TF-Seq for BLAST e-value cut-off parameter of 0.1 Transcription factor mapping based on having the TN FP FN TP same protein domain family has better 40000 performance than the other two methods based 35000 on sequence similarity and having the same 30000 protein domain sub-family as shown in Figure 2. Also, the transcription factors predicted are of 25000 the correct type as illustrated in Figure 3 and the 20000 sequences with similar annotation may be part 15000 of the false positives. Figure 4 shows that target 10000 gene mapping by finding TFBS motifs in promoters has better performance than the other 5000 methods. The sequence similarity in BS-Blast is 0 not useful for mapping target genes, showing BS-Seq BS-Prom BS-Blast BS-Nuc that target genes with similar binding sites do not Method need to have high sequence similarity. Also, Figure 4: Number of true positives, false positives, using BS-Nuc to refine the results of BS-Prom false negatives and true negatives for target genes using the Nucleosomes Position Prediction tool identified using BS-Seq, BS-Prom, BS-Blast and BSdoes not improve the performance of the results, Nuc showing the effects of the variable position of the transcription-suppressing nucleosomes. Number of sequences Objectives FP 36000 35500 35000 34500 34000 33500 33000 32500 32000 31500 Activation Figure 1: Example of gene regulatory network TN Total number of hits Number of sequences Results Conclusion These results in this work show that TF-Fam and BS-Prom are promising methods for predicting regulatory elements for a non-model organism based on a model organism. These regulatory elements can be used further to predict the regulatory network of the non-model organism.