* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Promoter Analysis for Intestinally
Therapeutic gene modulation wikipedia , lookup
Gene desert wikipedia , lookup
Essential gene wikipedia , lookup
Transposable element wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Microevolution wikipedia , lookup
Designer baby wikipedia , lookup
Genomic imprinting wikipedia , lookup
Genome (book) wikipedia , lookup
Genome evolution wikipedia , lookup
Non-coding DNA wikipedia , lookup
Point mutation wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Pathogenomics wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Human genome wikipedia , lookup
Microsatellite wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Ridge (biology) wikipedia , lookup
Minimal genome wikipedia , lookup
Smith–Waterman algorithm wikipedia , lookup
Computational phylogenetics wikipedia , lookup
Helitron (biology) wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Gene expression profiling wikipedia , lookup
Metagenomics wikipedia , lookup
Promoter Analysis for Intestinally-Expressed C. elegans genes 1. Objectives a. Find conserved sites in the upstream regions of 74 intestinally-expressed genes b. Also analyze the orthologues of the genes in C. briggsae and C. remanei c. Provide evidence, if possible, for the ELT-2 theory of intestinal gene regulation 2. Summary: a. Motif Discovery is complete for all 74 C. elegans genes, 57 C. briggsae orthologues and 38 C. remanei orthologues b. Hit sequences have been extracted and aligned c. We need to discuss how to generate our final set of hits from this data Completed to date 1. Motif Discovery in 74 C. elegans genes a. 2 Motif Discovery algorithms used i. MotifSampler 1. Settings: 100 iterations, up to 5 motifs reported per sequence 2. Background: 150 kb of randomly chosen concatenated upstream sequences 3. Motif lengths: 6, 8, 10, 12 4. Filtering step: a. Only kept those motifs that were found in the exact same place at least 7 times b. Overlapping hits that met this criteria were merged into one long hit 5. Found motifs that met this criteria on 58 of the 74 sequences ii. RSAT 1. Word counter 2. Background: *all* upstream sequences from C. elegans had all of their “words” counted 3. Default settings used 4. Motif Lengths: 6, 7, 8 (those are the only possibilities) 5. Found significant hits on all 74 sequences b. Results: i. Image of all MotifSampler results: Cele_all_motifsampler.GIF ii. Image of filtered MotifSampler results: Cele_filtered_motifsampler.GIF iii. Image of RSAT results: Cele_RSAT.GIF iv. Image of filtered MotifSampler results plus RSAT results: Cele_RSAT_filt_motifsampler.GIF c. Observations i. Lots of overlap between MotifSampler and RSAT predictions ii. RSAT finds all occurrences of a given sequence, while MotifSampler only finds some of them and ignores others iii. However, in general RSAT returns too many results to be useful by itself, especially at length 6 bp 2. Analysis of Orthologous sequences a. Origin of orthologues: i. C. briggsae: 1. Wormbase cb25 release 2. 57 orthologues found ii. C. remanei: 1. C. elegans Wormpep sequence aligned against remanei supercontigs using WABA 2. Only non-ambiguous results that match right from the Wormpep ATG were used. 3. 38 sequences found b. Analysis method same as for C. elegans c. Motif Discovery Results: i. C. briggsae: image of filtered MotifSampler results and RSAT: Cbri_RSAT_filt_motifsampler.GIF ii. C. remanei : image of filtered MotifSampler results and RSAT: Crem_RSAT_filt_motifsampler.GIF 3. Motif Sequences a. The sequences of all hits were extracted and flipped to the strand that maximized As and Gs. b. The sequences were then run through ClustalW. Alignments can be seen in the following file: i. C. elegans : Cele_all_hits_aligned.txt ii. C. briggsae: Cbri_all_hits_aligned.txt iii. C. remanei : Crem_all_hits_aligned.txt c. Observations: i. Most of the hits in all 3 species are mostly TGATAA sites or some variation, but a few aren’t related to TGATAA at all ii. Hits vary hugely in length (due to the merging of overlapping motifsampler hits of the same length) iii. Each result set was extracted independently, so these hits overlap with each other in the original sequence and appear varying numbers of times iv. It is not at all clear which of these hits we should use to form a Position Frequency Matrix and which ones we should discard Future Work 1. Position Frequency Matrices a. Need to determine possible true motif length and consensus sequence b. The consensus sequence can then be used to scan the original upstream regions to generate full-length motifs c. A final set of PFMs and logos can be generated 2. Negative Controls (will be done last after procedure for test sets is finalized) a. Mirror image genes b. Set of 74 randomly chosen C. elegans genes