Download IBDWS 3.0 Appendix

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Proofs of Fermat's little theorem wikipedia , lookup

Sequence wikipedia , lookup

Elementary mathematics wikipedia , lookup

Transcript
Comments
Comments may be inserted on its own line anywhere in the input file. Comment
lines must begin with a “;” (semicolon), making the entire line a comment. Comments
may not be inserted at the end, after input data.
Example of appropriate comment format:
; This is a comment you can write whatever you choose.
Example of inappropriate comment format:
Hemoglobin 1 3 ; A comment should never be at the end of an input line
Fasta block
The input file starts with the desired DNA sequence data in FASTA [1] format.
The first word immediately proceeding the ‘>’ symbol (no space in between) will be
marked as the name of the sequence, until a white space is encountered. It is important to
keep note of the name of the sequence for later input parameters. The name will be case
sensitive when matching in the population information block.
Example:
>Brown This is the description after the first space of this line
Following the description line (line starting with “>”) will be the sequence. If an
RNA sequence is submitted, ‘U’ (Uracil) will be replaced by ‘T’ (Thymine). Protein
sequences are unavailable for processing at this time. All sequences submitted must be of
the same length. Valid characters are ‘A’,’C’,’G’,’T’,’-‘ (for gap), ‘:’ (for gap), ‘?’ (for
unknown), and ‘N’ (for unknown). Characters are case insensitive. All other characters
will be converted to ‘N’ by default in the Python check. The entire sequences need not be
within the same line. The maximum description length is set to 50 characters while the
maximum sequence length is set to 10000 characters. Each character in a sequence is
treated as a separate site. Consecutive gaps, such as ‘--‘, can either be treated as one
character or two, depending on the settings. See examples below.
Treating Consecutive Gaps as One
When this option is chosen, every successive base pair position to a gap
that is also a gap will be ignored. If overlapping of consecutive gaps are
encountered, the bypass of the base pair positions will end as soon as the initial
gap encounter and its successive gaps end.
Given the following sequence alignment below:
AGCCTAGACT
AGC---AACT
[1]
Computational Molecular Biology and Bioinformatics, University of Southern California, P. Hardy, “FASTA Format,” 1996,
http://www-hto.usc.edu/software/seqaln/doc/fasta-format.html
If we were to treat consecutive gaps as one, the alignment will now be:
AGCCGACT
AGC-AACT
For overlapping alignments as given:
TGAC---AGACT
TGACG---AACT
The alignment after treating consecutive gaps as one will be:
TGAC-AGACT
TGACG-AACT
Ploidy Level
Haploid Data (e.g., mtDNA sequences)
Following the FASTA entries, the marker “HAPLOID” or “DIPLOID” is
required to indicate whether the individual organisms to be analyzed are haploid
or diploid respectively. Obviously in this case, “HAPLOID” data will be entered.
Diploid Data (two sequences per individual)
If diploid data are entered then the marker “DIPLOID” will be entered.
Diploid data will have two sequences per individual. Each individual will have its
individual number and population number listed twice in the population
information block. More details on this will be discussed in the next section.
Population Information Block
Next, the marker “POPULATION_INFO” is required. The information provided
below this marker associates a DNA allele sequence, individual, and population. Each
line will be of the form:
<sequence name> <space or tab> <individual number> <space or tab> <population number>
Example:
gene1 1 1
gene2 2 1
gene3 3 2
gene4 4 2
gene5 5 2
As stated earlier, the name of sequences is important to keep note of as this will
be the first parameter of the line above. The <sequence name> is the first word following
the ‘>’ symbol and must be found from the submitted FASTA entries otherwise an error
will be thrown. Again the case of the letters in the sequence name has to match that given
in the Fasta block. Then a space or tab is given. IBDWS will know to filter out more than
one space or tab. Next <individual number> and <population number> will be integer
values and separated by a space or tab.
Individual numbers must be unique with the exception that diploid organisms will
be listed twice. Once an individual is associated with a population, the second allele copy
must have the same individual-population association.
Example:
gene1 1 1
gene2 1 1
gene3 2 1
gene5 2 1
gene4 3 2
gene1 3 2
Although individual numbers need not be listed in any order, there must be an
individual number in the range of 1 and the number of individuals. For example, if there
are 3 individuals, there must exist individual ‘1’, ’2’, and ’3’. If individuals are numbered
‘2’, ‘3’, and ‘6’, this will throw an error. The same applies to population numbers.
Geographic Distance Block
Next, the marker “GEOGRAPHIC_DISTANCE” is required. Below this marker
is a list of the distances between all population pairs. Each line will be of the form:
<population number> <space or tab> <population number> <space or tab> <geographic
distance>
<population number> will be an integer value. All <geographic distances> will be a
double value greater than 0. The paired population identifiers must be in consecutive and
nested order. The following example lists geographic distances for four populations. For
DNA data, the matrix format for geographic distances is currently not enabled.
Example:
1 2 5.983
1 3 2.843
1 4 .343
2 3 10.938
2 4 1.293
3 4 8.332
Table A.1. Input Fields and Associated Data Types
Input Fields
Data Type
Maximum
Length/Value
Fasta header (includes name and
description)
Sequence
string
50
string
10000
sequence name
string
50
individual number
unsigned int
10000
population number
unsigned int
500
geographic distance
double
-
indicator value
double
-
Indicator Block
Finally, an optional “INDICATORS” marker can be provided. These indicators
may reflect the presence of barriers to dispersal between some population pairs, an
alternative geographic distance calculation, etc. The significance of the indicators will be
assessed with a partial Mantel Test. Each indicator line will be of the form:
<population number> <space or tab> <population number> <space or tab> <indicator value>
The indicators field has the same restrictions as the geographic distances field in
that population numbers must be in consecutive and nested order and an integer.
<indicator value> is a double value and can be positive, negative or zero in value.
For DNA data, the matrix format for indicators is currently not enabled. More
information about indicator format can be found at
http://www.bio.sdsu.edu/pub/andy/IBDManual.pdf.
Additionally, comments may be included in the input data. Comments must be in
a line of its own and begin with a ‘;’ in the front of the line. Comments may be put at any
point of the input data file.
Example:
;this is a comment
>name1 description
ACCTCTCCGCTACCTC
;follow up comment
>name2 description
CCCGCTCCGCTACCTC