Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Comments Comments may be inserted on its own line anywhere in the input file. Comment lines must begin with a “;” (semicolon), making the entire line a comment. Comments may not be inserted at the end, after input data. Example of appropriate comment format: ; This is a comment you can write whatever you choose. Example of inappropriate comment format: Hemoglobin 1 3 ; A comment should never be at the end of an input line Fasta block The input file starts with the desired DNA sequence data in FASTA [1] format. The first word immediately proceeding the ‘>’ symbol (no space in between) will be marked as the name of the sequence, until a white space is encountered. It is important to keep note of the name of the sequence for later input parameters. The name will be case sensitive when matching in the population information block. Example: >Brown This is the description after the first space of this line Following the description line (line starting with “>”) will be the sequence. If an RNA sequence is submitted, ‘U’ (Uracil) will be replaced by ‘T’ (Thymine). Protein sequences are unavailable for processing at this time. All sequences submitted must be of the same length. Valid characters are ‘A’,’C’,’G’,’T’,’-‘ (for gap), ‘:’ (for gap), ‘?’ (for unknown), and ‘N’ (for unknown). Characters are case insensitive. All other characters will be converted to ‘N’ by default in the Python check. The entire sequences need not be within the same line. The maximum description length is set to 50 characters while the maximum sequence length is set to 10000 characters. Each character in a sequence is treated as a separate site. Consecutive gaps, such as ‘--‘, can either be treated as one character or two, depending on the settings. See examples below. Treating Consecutive Gaps as One When this option is chosen, every successive base pair position to a gap that is also a gap will be ignored. If overlapping of consecutive gaps are encountered, the bypass of the base pair positions will end as soon as the initial gap encounter and its successive gaps end. Given the following sequence alignment below: AGCCTAGACT AGC---AACT [1] Computational Molecular Biology and Bioinformatics, University of Southern California, P. Hardy, “FASTA Format,” 1996, http://www-hto.usc.edu/software/seqaln/doc/fasta-format.html If we were to treat consecutive gaps as one, the alignment will now be: AGCCGACT AGC-AACT For overlapping alignments as given: TGAC---AGACT TGACG---AACT The alignment after treating consecutive gaps as one will be: TGAC-AGACT TGACG-AACT Ploidy Level Haploid Data (e.g., mtDNA sequences) Following the FASTA entries, the marker “HAPLOID” or “DIPLOID” is required to indicate whether the individual organisms to be analyzed are haploid or diploid respectively. Obviously in this case, “HAPLOID” data will be entered. Diploid Data (two sequences per individual) If diploid data are entered then the marker “DIPLOID” will be entered. Diploid data will have two sequences per individual. Each individual will have its individual number and population number listed twice in the population information block. More details on this will be discussed in the next section. Population Information Block Next, the marker “POPULATION_INFO” is required. The information provided below this marker associates a DNA allele sequence, individual, and population. Each line will be of the form: <sequence name> <space or tab> <individual number> <space or tab> <population number> Example: gene1 1 1 gene2 2 1 gene3 3 2 gene4 4 2 gene5 5 2 As stated earlier, the name of sequences is important to keep note of as this will be the first parameter of the line above. The <sequence name> is the first word following the ‘>’ symbol and must be found from the submitted FASTA entries otherwise an error will be thrown. Again the case of the letters in the sequence name has to match that given in the Fasta block. Then a space or tab is given. IBDWS will know to filter out more than one space or tab. Next <individual number> and <population number> will be integer values and separated by a space or tab. Individual numbers must be unique with the exception that diploid organisms will be listed twice. Once an individual is associated with a population, the second allele copy must have the same individual-population association. Example: gene1 1 1 gene2 1 1 gene3 2 1 gene5 2 1 gene4 3 2 gene1 3 2 Although individual numbers need not be listed in any order, there must be an individual number in the range of 1 and the number of individuals. For example, if there are 3 individuals, there must exist individual ‘1’, ’2’, and ’3’. If individuals are numbered ‘2’, ‘3’, and ‘6’, this will throw an error. The same applies to population numbers. Geographic Distance Block Next, the marker “GEOGRAPHIC_DISTANCE” is required. Below this marker is a list of the distances between all population pairs. Each line will be of the form: <population number> <space or tab> <population number> <space or tab> <geographic distance> <population number> will be an integer value. All <geographic distances> will be a double value greater than 0. The paired population identifiers must be in consecutive and nested order. The following example lists geographic distances for four populations. For DNA data, the matrix format for geographic distances is currently not enabled. Example: 1 2 5.983 1 3 2.843 1 4 .343 2 3 10.938 2 4 1.293 3 4 8.332 Table A.1. Input Fields and Associated Data Types Input Fields Data Type Maximum Length/Value Fasta header (includes name and description) Sequence string 50 string 10000 sequence name string 50 individual number unsigned int 10000 population number unsigned int 500 geographic distance double - indicator value double - Indicator Block Finally, an optional “INDICATORS” marker can be provided. These indicators may reflect the presence of barriers to dispersal between some population pairs, an alternative geographic distance calculation, etc. The significance of the indicators will be assessed with a partial Mantel Test. Each indicator line will be of the form: <population number> <space or tab> <population number> <space or tab> <indicator value> The indicators field has the same restrictions as the geographic distances field in that population numbers must be in consecutive and nested order and an integer. <indicator value> is a double value and can be positive, negative or zero in value. For DNA data, the matrix format for indicators is currently not enabled. More information about indicator format can be found at http://www.bio.sdsu.edu/pub/andy/IBDManual.pdf. Additionally, comments may be included in the input data. Comments must be in a line of its own and begin with a ‘;’ in the front of the line. Comments may be put at any point of the input data file. Example: ;this is a comment >name1 description ACCTCTCCGCTACCTC ;follow up comment >name2 description CCCGCTCCGCTACCTC