Download GBS Pipeline Documentation. - WSU Plant Pathology

GBS Pipeline (GIO) Documentation The contents of the “Results” folder is divided into three subfolders: “Genotypes”, “LinkageGroups”, and “TagsAnalyzed”. Each of these three folders contains unique information about the results of your GBS analysis. The contents of each subfolder are described below. Genotypes: This folder has a total of 10 different files. Some of these contain genotype calls for the parents against the population. These calls include unique information about the SNPs, such as SNP position and chromosome assignments, when compared to the Chinese Spring draft assembly and the Washington Wheat Transcriptome. There are several columns that are found only in files 2, 3, 4, and 5 that provide more information on the genotypes called. These columns are titled: a) SNP_tag: The common sequence between parent A and parent B, with the SNP represented using the IUPAC nomenclature. b) Type: Indicates if it is an indel or a SNP call c) SNP_position(s): Indicates which nucleotide position the SNP(s) occurred, add 1 to the given value to find the actual nucleotide location d) Gene?: Shows whether the GBS tag was homologous with the Washington Wheat Transcriptome. Note: NonGene indicates that it was not found in the transcriptome database, and could potentially still be a gene. e) Chromo_count: The number of chromosome arms the GBS tag was homologous with. f) Chromos_TagA: A list of the chromosome arms whose sequences had homology with the GBS tag in parent A g) Chromos_TagB: A list of the chromosome arms whose sequences had homology with the GBS tag in parent B h) TagA: The actual sequence/GBS tag in parent A that had SNP(s) or indel(s) with TagB. i) TagB: The actual sequence/GBS tag in parent B that had SNP(s) or indel(s) with TagA. 1. Neighbors.txt/NeighborCosegregation.txt: Contains the set of dominant and co-dominant tags that co-segregate in 65% of the progeny. 2. CalledGenotypesWithFiltersMMDDYYY.txt: This file has the information on all possible SNPs between the parents across the population if present in 20% to 80% of the population (default parameter else user specified). Some of this information is represented by the four characters A, B, H, and U. A: The individual has parent A’s sequence/GBS tag B: The individual has parent B’s sequence/GBS tag H: The individual is heterozygous, having both parent’s sequences U: The data for this individual is missing for this particular SNP Note: Missing data does not necessarily represent a truly missing sequence in all cases. Due to the nature of the GBS technology, a GBS tag that is present in a parent may not have been sequenced even if it is present in a particular progeny. MarkerA and MarkerB are identifiers for the corresponding GBS tags in parent A and parent B respectively. 3. CalledGenotypesWithFilters.MMDDYYYY.txt.Nonrepetitive: This file has all the nonhomeologous SNPs for the parents against the population if present in 20% to 80% of the population (default parameter else user specified). MarkerA and MarkerB are identifiers for the corresponding GBS tags in parent A and parent B respectively. 4. CalledGenotypesWithFilters.MMDDYYYY_IMPUTED.txt.Nonrepetitive: This file has the same information as 3, however each value that was missing (represented with a ‘U’) has been extrapolated using a special mathematical concept referred to as Random Forests. This imputation technique uses a special form of machine learning that builds large prediction trees, then takes the mean value across these trees to impute the missing value found in the column containing the genotypes for that progeny. More information can be found on this subject on Wikipedia, and the name of the ‘R’ software package used for imputation is “missForest”. Citation: Stekhoven, D., Buhlmann, P. MissForest—non-parametric missing value imputation for mixed-type data. 2011. Bioinformatics, Oxford Journals. Vol 28, Issue 1. Pg. 112-118. [http://bioinformatics.oxfordjournals.org/content/28/1/112] 5. 6. 7. 8. NOTE: Some columns may be missing from this file, as missForest automatically removes columns that have missing values for every element in that column. (You cannot impute missing values with no neighbor information!) CalledGenotypesWithFilters.MMDDYYYY.txt.Repetitive: This file has all the homeologues present in the CalledGenotypesWithFiltersMMDDYYY.txt file. It will generally be of much larger size than the nonrepetitive version. MarkerA and MarkerB are identifiers for the corresponding GBS tags in parent A and parent B respectively. ClusteredNonrepetitiveMarkersForMapping.txt: This file gives basically the same information given by CalledGenotypesWithFilters.MMDDYYYY.txt.Nonrepetitive, except there is a column inserted at the beginning of the spreadsheet indicating what cluster each SNP belongs to. MarkerA and MarkerB are identifiers for the corresponding GBS tags in parent A and parent B respectively. Labeled.ParentAdominantSNPS.txt: This file contains the Dominant sequences found in parent A, and their presence or absence on the resulting population of the cross between parent A and B. If the progeny has a ‘C’ in its respective cell, that particular sequence is present in that individual. For information on the columns “Gene?”, “ChromosCount”, “Chromos”, please refer to sections d), e), f) above. “Marker” is the unique identifier given to this dominant parent A GBS tag. “Tag” is the actual sequence or GBS tag unique to parent A. Labeled.ParentBdominantSNPS.txt: This file contains the Dominant sequence found in parent B, and their presence or absence on the resulting population of the cross between parent A and B. If the progeny has a ‘D’ in its respective cell, that particular sequence is present in that individual. For information on the columns “Gene?”, “ChromosCount”, “Chromos”, please refer to sections d), e), g) above. “Marker” is the unique identifier given 9. 10. 11. 12. to this dominant parent B GBS tag. “Tag” is the actual sequence or GBS tag unique to parent B. List_of_Names.txt: This document relates to the other documents in this folder in that in the header row of files similar to CalledGenotypesWithFiltersMMDDYYY.txt, there are numbers representing specific members of the population. The numbers in List_of_Names.txt correspond to these numbers, and the fastq files listed after these numbers are the members of the population they correspond to. Using this file we can identify which specific organisms each column belongs to. MarkerNeighbors.txt: The genotypes for the population corresponding to all the markers (Codominant, Parent A Dominant, Parent B Dominant) listed in Neighbors.txt/NeighborCosegregation.txt PresenceAbsenceCodominant.txt: This file describes the presence of codominant SNPs in the progeny of the cross between parent A and B. If a particular cell has a 1, it means the SNP is present in that particular individual. Any individual with both parent’s SNPs of a particular SNP pair is considered to be heterozygous, and is represented by an ‘H’ in the CalledGenotypesWithFilters.MMDDYYYY.txt.Nonrepetitive file. ReadDepth.Nonrepetitive: This excel file gives the read depth of each codominant marker for every member of the population that was analyzed. The read depth represents the number of times each marker was detected in each individual. LinkageGroups: This folder contains files related to the position of many of the SNPs identified. By using MSTmap, these files were generated and contain the position of the given sequences relative to the many markers given to the program. /LinkageGroups/ may include two subdirectories, Chromosomes/ and CosegregationClusters/. Chromosomes/ contain genetic map data for each chromosome arm, based on markers homologous to draft assembly of Chinese Spring genome. CosegregationClusters/ contains genetic map data for each chromosome arm, constructed from clusters of markers based on cosegregation, regardless of chromosome assignment. TagsAnalyzed: This folder has the unique set of sequences within each parent and each member of the population, which are 75 base pairs or user specified length. Association Panel Analysis Results are similar to those described, but without segregation data, linkage groups cannot be discovered. The tag frequency data can be used for association studies such as genome-wide association studies. Reference: GIO - Genes In Order - Linux scripts to convert raw NGS sequencing data to genetic maps, Daniel Z. Skinner*, Vandhana Krishnan, Deven See (unpublished). USDA-ARS and Washington State University, Pullman, WA Imputation Method: https://en.wikipedia.org/wiki/Random_forest

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download GBS Pipeline Documentation. - WSU Plant Pathology