* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download SPoRE - LCQB
Biology and consumer behaviour wikipedia , lookup
Skewed X-inactivation wikipedia , lookup
Public health genomics wikipedia , lookup
History of genetic engineering wikipedia , lookup
Genomic imprinting wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Copy-number variation wikipedia , lookup
Neocentromere wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Gene therapy wikipedia , lookup
Pathogenomics wikipedia , lookup
Saethre–Chotzen syndrome wikipedia , lookup
Y chromosome wikipedia , lookup
Minimal genome wikipedia , lookup
Gene nomenclature wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Gene desert wikipedia , lookup
Gene expression profiling wikipedia , lookup
X-inactivation wikipedia , lookup
Microevolution wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Gene expression programming wikipedia , lookup
Genome evolution wikipedia , lookup
Designer baby wikipedia , lookup
Genome (book) wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
SPoRE Here we provide the R scripts that allow you to reproduce the computation of the models proposed in the article, reproduce the benchmarks of the article, predict new hot and cold spots, and use SPoRE on another genome. Requirements: R 2.14 or more (http://www.r-project.org/) (R 3.0 has also been tested) this R package: seqinr OS: o Linux has been tested o Windows works (tested with Windows XP), but the directory with R binaries needs to be added to the PATH for the "Rscript" command to work (needed by SPoRE). How to run the complete analysis: You can regenerate all the models by running this command: Rscript scripts/SPoRE_predict_all.R You can then compare them to S. Cerevisiae experimental data like this (to reproduce the benchmarks of the article): Rscript scripts/SPoRE_benchmark_all.R Model curves: The files with the curves produced with the models are in the WIG directory (and are in WIG format), and are named axis_model3-1500.wig (Red1 model 3) and DSB_model6-250.wig (DSB model 6). You can load them in a program like IGV or on an online service like UCSC Genome Browser. Hotspot prediction: To predict hot and cold spots, use the following command: Rscript scripts/SPoRE_predict_hotspots.R spo11-spots DSB_model6-250 This will predict as hot or cold the spots listed in spots-input/spo11-spots.txt, using the DSB model 6 (ie. the curve of DSB_model6-250.wig). Note that you need to generate the curves as explained above before using hotspot prediction. The example file provided, spo11-spots.txt, contains the hot and cold spots on chromosome IV that we used for the benchmark we present in the article. Hence you can reproduce our prediction, which gives an accuracy of 84%. The "hot" column in the input file is optional. If it is there, it is interpreted as experimental data telling whether it is actually a hotspot. If it is present, a benchmark is made, and the number of true/false positive/negatives is computed, as well as the accuracy. The output is stored in spots-output/spo11-spots.txt (ie the same name as the input file, but in the spots-output directory). Compared to the input table, the output table has two new columns: predictedDensity which predicts the density of DSBs in the spot predictedAsHot which tells whether SPoRE predicts the spot as hot (TRUE) or cold (FALSE) Axis site prediction: Axis site prediction works exactly as DSB hotspot prediction, except that you should use “axis_model3-1500” as the second parameter instead of “DSB_model6-250”. To adapt the analysis for another species, you have to change: genomes/yeast_genome.fasta which contains the yeast genome in FASTA format: Each sequence is a chromosome: from 1 to N. (N=16 for S. Cerevisiae) genome_info_matrix/yeast_genes_for_model.txt which is a matrix with the genes to consider If and only if you want to use model 7, which takes in account the Transcription Factor Binding Sites (TFBS) to define the promoter positions instead of an automatic approximation (like in models 3 to 6), you need to change the file data/TF.txt which contains the TFBS positions for the genes. How to format this gene matrix: Don't change the name of the columns, they are referenced by our program. They are: id: unique id for the gene (can be what you want, it just has to be unique) chromosomeNumber: chromosome number from 1 to N (integer) strand: "FORWARD" or "REVERSE" positionMin: first position of the gene (included) positionMax: last position of the gene (included) The positions are relative to the chromosome, with the first base numbered as 1. How to format the TF.txt matrix (only necessary for DSB model 7): This matrix contains the transcription factor binding sites for each gene. chr: chromosome number from 1 to N (optional - unused by SPoRE) position: position on the chromosome where the transcription factor binds to regulate the target gene value: any number (optional - unused by SPoRE) target: the id of the target gene (an id appearing in the id column of the gene matrix) TF: a name for the transcription factor (optional - unused by SPoRE) As you can see, only the “position” and “target” columns are actually used by SPoRE. The chromosome number is not used because SPoRE assumes that the position of a TFBS of a gene is on the same chromosome as the gene (which should be the case unless there is a bug in the data). Note that the values in columns target and TF are not unique since a TF may regulate several genes, and several TF may regulate a single gene. If a gene has no TFBS at all (it never appears in the “target” column) then the promoter position approximation of models 3-6 is used, so it is not a problem if the information is incomplete. In the extreme case, if TF.txt is empty, model 7 will be identical to model 6.