Download GBS Pipeline Documentation. - WSU Plant Pathology

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Karyotype wikipedia , lookup

Human genetic variation wikipedia , lookup

Hybrid (biology) wikipedia , lookup

Metagenomics wikipedia , lookup

Public health genomics wikipedia , lookup

Microevolution wikipedia , lookup

X-inactivation wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

RNA-Seq wikipedia , lookup

Molecular Inversion Probe wikipedia , lookup

Neocentromere wikipedia , lookup

Polyploid wikipedia , lookup

SNP genotyping wikipedia , lookup

Tag SNP wikipedia , lookup

Transcript
GBS Pipeline (GIO) Documentation
The contents of the “Results” folder is divided into three subfolders: “Genotypes”, “LinkageGroups”, and
“TagsAnalyzed”. Each of these three folders contains unique information about the results of your GBS
analysis. The contents of each subfolder are described below.
Genotypes: This folder has a total of 10 different files. Some of these contain genotype calls for the
parents against the population. These calls include unique information about the SNPs, such as SNP
position and chromosome assignments, when compared to the Chinese Spring draft assembly and the
Washington Wheat Transcriptome. There are several columns that are found only in files 2, 3, 4, and 5
that provide more information on the genotypes called. These columns are titled:
a) SNP_tag: The common sequence between parent A and parent B, with the SNP represented
using the IUPAC nomenclature.
b) Type: Indicates if it is an indel or a SNP call
c) SNP_position(s): Indicates which nucleotide position the SNP(s) occurred, add 1 to the given
value to find the actual nucleotide location
d) Gene?: Shows whether the GBS tag was homologous with the Washington Wheat
Transcriptome. Note: NonGene indicates that it was not found in the transcriptome database,
and could potentially still be a gene.
e) Chromo_count: The number of chromosome arms the GBS tag was homologous with.
f) Chromos_TagA: A list of the chromosome arms whose sequences had homology with the GBS
tag in parent A
g) Chromos_TagB: A list of the chromosome arms whose sequences had homology with the GBS
tag in parent B
h) TagA: The actual sequence/GBS tag in parent A that had SNP(s) or indel(s) with TagB.
i) TagB: The actual sequence/GBS tag in parent B that had SNP(s) or indel(s) with TagA.
1. Neighbors.txt/NeighborCosegregation.txt: Contains the set of dominant and co-dominant
tags that co-segregate in 65% of the progeny.
2. CalledGenotypesWithFiltersMMDDYYY.txt: This file has the information on all possible
SNPs between the parents across the population if present in 20% to 80% of the population
(default parameter else user specified). Some of this information is represented by the four
characters A, B, H, and U.
A: The individual has parent A’s sequence/GBS tag
B: The individual has parent B’s sequence/GBS tag
H: The individual is heterozygous, having both parent’s sequences
U: The data for this individual is missing for this particular SNP
Note: Missing data does not necessarily represent a truly missing sequence in all cases. Due
to the nature of the GBS technology, a GBS tag that is present in a parent may not have
been sequenced even if it is present in a particular progeny.
MarkerA and MarkerB are identifiers for the corresponding GBS tags in parent A and parent
B respectively.
3. CalledGenotypesWithFilters.MMDDYYYY.txt.Nonrepetitive: This file has all the nonhomeologous SNPs for the parents against the population if present in 20% to 80% of the
population (default parameter else user specified). MarkerA and MarkerB are identifiers for
the corresponding GBS tags in parent A and parent B respectively.
4. CalledGenotypesWithFilters.MMDDYYYY_IMPUTED.txt.Nonrepetitive: This file has the
same information as 3, however each value that was missing (represented with a ‘U’) has
been extrapolated using a special mathematical concept referred to as Random Forests. This
imputation technique uses a special form of machine learning that builds large prediction
trees, then takes the mean value across these trees to impute the missing value found in the
column containing the genotypes for that progeny. More information can be found on this
subject on Wikipedia, and the name of the ‘R’ software package used for imputation is
“missForest”.
Citation: Stekhoven, D., Buhlmann, P. MissForest—non-parametric missing value
imputation for mixed-type data. 2011. Bioinformatics, Oxford Journals. Vol 28, Issue 1. Pg.
112-118. [http://bioinformatics.oxfordjournals.org/content/28/1/112]
5.
6.
7.
8.
NOTE: Some columns may be missing from this file, as missForest automatically removes
columns that have missing values for every element in that column. (You cannot impute
missing values with no neighbor information!)
CalledGenotypesWithFilters.MMDDYYYY.txt.Repetitive: This file has all the homeologues
present in the CalledGenotypesWithFiltersMMDDYYY.txt file. It will generally be of much
larger size than the nonrepetitive version. MarkerA and MarkerB are identifiers for the
corresponding GBS tags in parent A and parent B respectively.
ClusteredNonrepetitiveMarkersForMapping.txt: This file gives basically the same
information given by CalledGenotypesWithFilters.MMDDYYYY.txt.Nonrepetitive, except
there is a column inserted at the beginning of the spreadsheet indicating what cluster each
SNP belongs to. MarkerA and MarkerB are identifiers for the corresponding GBS tags in
parent A and parent B respectively.
Labeled.ParentAdominantSNPS.txt: This file contains the Dominant sequences found in
parent A, and their presence or absence on the resulting population of the cross between
parent A and B. If the progeny has a ‘C’ in its respective cell, that particular sequence is
present in that individual. For information on the columns “Gene?”, “ChromosCount”,
“Chromos”, please refer to sections d), e), f) above. “Marker” is the unique identifier given
to this dominant parent A GBS tag. “Tag” is the actual sequence or GBS tag unique to parent
A.
Labeled.ParentBdominantSNPS.txt: This file contains the Dominant sequence found in
parent B, and their presence or absence on the resulting population of the cross between
parent A and B. If the progeny has a ‘D’ in its respective cell, that particular sequence is
present in that individual. For information on the columns “Gene?”, “ChromosCount”,
“Chromos”, please refer to sections d), e), g) above. “Marker” is the unique identifier given
9.
10.
11.
12.
to this dominant parent B GBS tag. “Tag” is the actual sequence or GBS tag unique to parent
B.
List_of_Names.txt: This document relates to the other documents in this folder in that in
the header row of files similar to CalledGenotypesWithFiltersMMDDYYY.txt, there are
numbers representing specific members of the population. The numbers in
List_of_Names.txt correspond to these numbers, and the fastq files listed after these
numbers are the members of the population they correspond to. Using this file we can
identify which specific organisms each column belongs to.
MarkerNeighbors.txt: The genotypes for the population corresponding to all the markers
(Codominant, Parent A Dominant, Parent B Dominant) listed in
Neighbors.txt/NeighborCosegregation.txt
PresenceAbsenceCodominant.txt: This file describes the presence of codominant SNPs in
the progeny of the cross between parent A and B. If a particular cell has a 1, it means the
SNP is present in that particular individual. Any individual with both parent’s SNPs of a
particular SNP pair is considered to be heterozygous, and is represented by an ‘H’ in the
CalledGenotypesWithFilters.MMDDYYYY.txt.Nonrepetitive file.
ReadDepth.Nonrepetitive: This excel file gives the read depth of each codominant marker
for every member of the population that was analyzed. The read depth represents the
number of times each marker was detected in each individual.
LinkageGroups: This folder contains files related to the position of many of the SNPs identified. By using
MSTmap, these files were generated and contain the position of the given sequences relative to the
many markers given to the program. /LinkageGroups/ may include two subdirectories, Chromosomes/
and CosegregationClusters/. Chromosomes/ contain genetic map data for each chromosome arm, based
on markers homologous to draft assembly of Chinese Spring genome. CosegregationClusters/ contains
genetic map data for each chromosome arm, constructed from clusters of markers based on cosegregation, regardless of chromosome assignment.
TagsAnalyzed: This folder has the unique set of sequences within each parent and each member of the
population, which are 75 base pairs or user specified length.
Association Panel Analysis Results are similar to those described, but without segregation data, linkage
groups cannot be discovered. The tag frequency data can be used for association studies such as
genome-wide association studies.
Reference: GIO - Genes In Order - Linux scripts to convert raw NGS sequencing
data to genetic maps, Daniel Z. Skinner*, Vandhana Krishnan, Deven See
(unpublished).
USDA-ARS and Washington State University, Pullman, WA
Imputation Method: https://en.wikipedia.org/wiki/Random_forest