Download Determining presence/absence threshold for your dataset

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

RNA-Seq wikipedia , lookup

Gene expression programming wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Gene nomenclature wikipedia , lookup

Gene desert wikipedia , lookup

Gene expression profiling wikipedia , lookup

Epigenetics in learning and memory wikipedia , lookup

Microevolution wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Designer baby wikipedia , lookup

Pathogenomics wikipedia , lookup

Transcript
Determining presence/absence threshold for your dataset
In PanCGHweb there are two ways to determine the presence/absence calling threshold. One is based
on Receiver Operating Curves (ROC) generated for microarray data of reference strains and the other is
based on plotting histograms of presence of scores for ortholog groups of reference strains.
Note: this guide uses publicly available data that was also described in the manuscript describing the
PanCGH algorithm (PMID: 19129208). However, these steps can be applied to any dataset. In order for
the signal distribution to be comparable across arrays, make sure that arrays are within-array and
between-array normalized (see Fig. 6).
Finding optimal presence/absence calling threshold using ROC curves
ROCs enable a user to define an optimal presence / absence threshold taking into account the tradeoff
between false-positive rates and true-positive rates.
Generating ROCs and determining an optimal PanCGH presence/absence threshold
ROCs can be generated for your dataset following the steps below.
1. Select only reference strains used for the array probe design from the NCBI genbank drop-down
list (see Fig. 1.A). In demo run mode you can select the indicated 4 Lactococcus lactis strains.
2. Upload the microarray probe sequences as a FASTA file (see Fig. 1. B). If you are running the
program in demo mode skip this step.
3. Click the ‘Upload File(s)’ button to proceed to upload array files.
4. Upload array files one by one. These files should contain probe signals for strains selected in
step 1 (see Fig. 2). Click ‘Proceed’ to go to the parameter settings page.
5. In the parameter settings page the option “Presence/absence calling threshold determination”
has a default value of “Predefined”. Change it to “Optimal” (see Fig. 3.A). The genotype calling
process will be initiated once you click “Proceed”.
6. In the run phase the genotype calling progress will be shown (see Fig. 4). ROCs will be generated
along with other plots after the genotype calling method has finished. Click on the “ROC curves”
(Fig. 5.B) link to open a page that shows figures with ROCs (Fig. 7).
7. Each plot shows a ROC of all reference strains based on data of an uploaded array. For example
in Fig. 8 ROCs of 4 selected reference strains based on an array hybridized with IL1403 is shown.
In the figure legend (below right corner in Fig. 8) the Genbank accession id of each reference
strain is shown. NC_002662 is the genbank accession id of L. lactis IL1403. The threshold around
5.5 would result in better false-positive and true-positive rates. Figures 9 to 11 show ROC curves
of reference strains for three other arrays.
1
Finding the optimal presence/absence calling threshold using histograms
The presence / absence calling threshold can also be determined using histograms that are created by
following the steps described below.
1. Select only reference strains from the genbank list shown in Fig. 1.A. Do not upload any other
sequence data except probe sequences (see Fig. 1.B).
2. Only upload array files where the selected strains (Fig. 1.A.) were hybridized. Therefore, set the
number of array files accordingly (see Fig. 1.C.; in this example 4). Click ‘Upload File(s)’ button to
start uploading sequence files.
3. Upload array files, where selected strains were hybridized (see Fig. 2). Click ‘Proceed’ to go to
parameter settings page.
4. In the first run use default settings (see Fig. 3). After inspection of the histograms (see below),
an optimal presence / absence calling threshold can be determined and this value should be
used in the next run.
5. Click the “Histograms” link (see Fig. 5.C) to open a histogram for each reference strain. A plot
that is based on array data where this strain was hybridized should be selected (see Fig. 12). For
instance NC_002662 is a Genbank accession id for a strain Lactococcus lactis IL1403 and the
corresponding array name used for this strain was IL1403. So opening that figure would show a
histogram as in Fig. 13. It shows the distribution of presence scores of OGs. Using the genome
annotation the presence / absence of genes in L. lactis IL1403 is known. Therefore, OGs are
divided into 2 groups: OGs containing at least one gene from of IL1403 (black) and OGs with no
gene from IL1403 (grey). From this plot it can be concluded that that an optimal
presence/absence threshold should be between 5.2 and 5.8. However, it is important to take
into account that a pangenome array not only targets genes of a single strain. So the optimal
presence/absence threshold should be determined by considering the values for other
reference strains as well (see Figures 14 to 16).
6. Choose the presence/absence threshold that is optimal for all reference strains. Based on Figs.
13-16 a threshold of 5.5 is optimal for the 4 reference strains.
7. Restart the program (see Fig. 1.D).
8. Repeat steps 1 to 3, but in step 4 use the threshold value you determined in step 6 (see Fig. 3.B).
2
D
A
B
C
E
Fig. 1. Start page of PanCGHweb.
3
Fig. 2. Upload microarray data.
4
A
B
Fig. 3. Parameters settings page.
5
Fig. 4. Run phase of PanCGHweb.
6
A
C
B
Fig 5. Results page.
7
Fig. 6. Box and whisker plot of all probe signals.
8
Fig. 7. Page showing figures with ROC curves of reference strains.
9
Fig. 8. ROC curves of reference strains for array where strain IL1403 was hybridized.
10
Fig. 9. ROC curves of reference strains for array where strain KF147 was hybridized.
Fig. 10. ROC curves of reference strains for array where strain MG1363 was hybridized.
11
Fig. 11. ROC curves of reference strains for array where strain SK11 was hybridized.
12
Fig. 12. Page showing histograms of reference strains.
13
Fig. 13. Distribution of OGs containing at least a gene from IL1403 (black) and OGs containing no gene
from IL1403 (grey).
14
Fig. 14. Distribution of OGs containing at least a gene from KF147 (black) and OGs containing no gene
from KF147 (grey).
15
Fig. 15. Distribution of OGs containing at least a gene from MG1363 (black) and OGs containing no
gene from MG1363 (grey).
16
Fig. 16 . Distribution of OGs containing at least a gene from SK11 (black) and OGs containing no gene
from SK11 (grey).
17