Download SPoRE - LCQB

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Biology and consumer behaviour wikipedia , lookup

Genomics wikipedia , lookup

Polyploid wikipedia , lookup

Skewed X-inactivation wikipedia , lookup

Public health genomics wikipedia , lookup

History of genetic engineering wikipedia , lookup

Genomic imprinting wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Copy-number variation wikipedia , lookup

Neocentromere wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Gene therapy wikipedia , lookup

Pathogenomics wikipedia , lookup

Saethre–Chotzen syndrome wikipedia , lookup

Y chromosome wikipedia , lookup

NEDD9 wikipedia , lookup

Minimal genome wikipedia , lookup

Gene nomenclature wikipedia , lookup

Gene wikipedia , lookup

RNA-Seq wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Gene desert wikipedia , lookup

Gene expression profiling wikipedia , lookup

X-inactivation wikipedia , lookup

Microevolution wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Gene expression programming wikipedia , lookup

Genome evolution wikipedia , lookup

Designer baby wikipedia , lookup

Genome (book) wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Genome editing wikipedia , lookup

Helitron (biology) wikipedia , lookup

Transcript
SPoRE
Here we provide the R scripts that allow you to reproduce the computation of the models
proposed in the article, reproduce the benchmarks of the article, predict new hot and cold
spots, and use SPoRE on another genome.
Requirements:
 R 2.14 or more (http://www.r-project.org/)
(R 3.0 has also been tested)
 this R package: seqinr
 OS:
o Linux has been tested
o Windows works (tested with Windows XP), but the directory with R binaries
needs to be added to the PATH for the "Rscript" command to work (needed by
SPoRE).
How to run the complete analysis:
You can regenerate all the models by running this command:
Rscript scripts/SPoRE_predict_all.R
You can then compare them to S. Cerevisiae experimental data like this (to reproduce the
benchmarks of the article):
Rscript scripts/SPoRE_benchmark_all.R
Model curves:
The files with the curves produced with the models are in the WIG directory (and are in WIG
format), and are named axis_model3-1500.wig (Red1 model 3) and DSB_model6-250.wig
(DSB model 6). You can load them in a program like IGV or on an online service like UCSC
Genome Browser.
Hotspot prediction:
To predict hot and cold spots, use the following command:
Rscript scripts/SPoRE_predict_hotspots.R spo11-spots DSB_model6-250
This will predict as hot or cold the spots listed in spots-input/spo11-spots.txt, using the DSB
model 6 (ie. the curve of DSB_model6-250.wig). Note that you need to generate the curves
as explained above before using hotspot prediction.
The example file provided, spo11-spots.txt, contains the hot and cold spots on chromosome
IV that we used for the benchmark we present in the article. Hence you can reproduce our
prediction, which gives an accuracy of 84%.
The "hot" column in the input file is optional. If it is there, it is interpreted as experimental data
telling whether it is actually a hotspot. If it is present, a benchmark is made, and the number
of true/false positive/negatives is computed, as well as the accuracy.
The output is stored in spots-output/spo11-spots.txt (ie the same name as the input file, but
in the spots-output directory). Compared to the input table, the output table has two new
columns:
 predictedDensity which predicts the density of DSBs in the spot
 predictedAsHot which tells whether SPoRE predicts the spot as hot (TRUE) or cold
(FALSE)
Axis site prediction:
Axis site prediction works exactly as DSB hotspot prediction, except that you should use
“axis_model3-1500” as the second parameter instead of “DSB_model6-250”.
To adapt the analysis for another species, you have to change:



genomes/yeast_genome.fasta which contains the yeast genome in FASTA format:
Each sequence is a chromosome: from 1 to N. (N=16 for S. Cerevisiae)
genome_info_matrix/yeast_genes_for_model.txt which is a matrix with the genes to
consider
If and only if you want to use model 7, which takes in account the Transcription
Factor Binding Sites (TFBS) to define the promoter positions instead of an automatic
approximation (like in models 3 to 6), you need to change the file data/TF.txt which
contains the TFBS positions for the genes.
How to format this gene matrix:
Don't change the name of the columns, they are referenced by our program. They are:
 id: unique id for the gene (can be what you want, it just has to be unique)
 chromosomeNumber: chromosome number from 1 to N (integer)
 strand: "FORWARD" or "REVERSE"
 positionMin: first position of the gene (included)
 positionMax: last position of the gene (included)
The positions are relative to the chromosome, with the first base numbered as 1.
How to format the TF.txt matrix (only necessary for DSB model 7):
This matrix contains the transcription factor binding sites for each gene.
 chr: chromosome number from 1 to N (optional - unused by SPoRE)
 position: position on the chromosome where the transcription factor binds to regulate
the target gene
 value: any number (optional - unused by SPoRE)
 target: the id of the target gene (an id appearing in the id column of the gene matrix)
 TF: a name for the transcription factor (optional - unused by SPoRE)
As you can see, only the “position” and “target” columns are actually used by SPoRE. The
chromosome number is not used because SPoRE assumes that the position of a TFBS of a
gene is on the same chromosome as the gene (which should be the case unless there is a
bug in the data). Note that the values in columns target and TF are not unique since a TF
may regulate several genes, and several TF may regulate a single gene. If a gene has no
TFBS at all (it never appears in the “target” column) then the promoter position
approximation of models 3-6 is used, so it is not a problem if the information is incomplete. In
the extreme case, if TF.txt is empty, model 7 will be identical to model 6.