Download The SNP gff file is tab

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Gene expression programming wikipedia , lookup

Designer baby wikipedia , lookup

Genomic imprinting wikipedia , lookup

Copy-number variation wikipedia , lookup

Gene expression profiling wikipedia , lookup

Gene wikipedia , lookup

X-inactivation wikipedia , lookup

Genome (book) wikipedia , lookup

Microevolution wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

RNA-Seq wikipedia , lookup

Molecular Inversion Probe wikipedia , lookup

Tag SNP wikipedia , lookup

SNP genotyping wikipedia , lookup

Transcript
SNP gff3:
Col 1: chromosome ID
Col 2: source of result derived from (for SNP gff3 always "SoapSNP")
Col 3: type of item (for SNP gff always "SNP")
Col 4: start (SNP position)
Col 5: end (certainly it is same to start in SNP)
Col 6: quality score in phred unit
Col 7: strand (always "+" because of method)
Col 8: phase of SNPs, only available if the SNP is in the coding region
Col 9: this field contain some sub-fields separated by space
ID: the unique ID of a SNP. "rs***" is for SNP in dbSNP, “NOM1_” for novel SNPs.
status: if the SNP is known or novel? "dbSNP" is for those in NCBI dbSNP dataset
and "novel" for those not found in dbSNP
ref: reference base of NCBI at the site
allele: Diploid alleles of on this position
support1: number of reads support for first allele.
support2: number of reads support for second allele
location: annotated region where the SNP located
Indel gff3:
Col 1: chromosome ID
Col 2: source of result derived from (soap)
Col 3: type of item (indel)
Col 4: start position of indel
Col 5: end position of indel.
For deletions, the start position is defined as the position of first base lost and the
end position is defined as the first base AFTER the deletion. Therefore, end subtracted by
start will give the length of deletion. For insertions, the start and end is the same, the first
base on the reference after the insertion event.
Col 6: number of reads supporting the indel
Col 7: strand information (“+”)
Col 8: phase of indels, only available if the indel is in the coding region
Col 9: this field contain some sub-fields separated by space
ID: indel ID. “rs*” for those found in dbSNP and “NOM1_*” for novel ones.
Status: if the status is a number, then it indicates the indel is found in dbSNP. The
number itself indicate the relative deviation of coordinate comparing to the dbSNP. For
example, status = 0 means the indel is found at the exact coordinate reported in dbSNP;
status = -2 means the indel is found on the 2bp upstream of some dbSNP. The coordinate
may have deviations from dbSNP because ambiguous method to determine the position.
Type: negative numbers for deletion and positive for insertion. The absolute value
is length of indel.
location= annotated region where the indel located
base = nucleotides of the insertion/deletion
CNV gff3:
Col 1: chromosome ID
Col 2: source of result derived from
Col 3: type of item
Col 4: start position of indel
Col 5: end position of indel
Col 6: Quality, for CNV, this is marked as “*”
Col 7: strand(always “+”)
Col 8: phase(not available in CNV)
Col 9: tags
ID: the ID of CNV
Type: type of CNV. “DupCNV” for duplications (extra copy number) and “DelCNV”
for deletions (reduced copy number).
DGV-variation: If the CNV exists in DGV variation database, then it will report the
overlapped DGV with its DGV info; else, it will report as novel.
DGV-indel: If the CNV exists in DGV indel database, then it will report the
overlapped DGV with its DGV info; else, it will report as novel.
mRNA: genes fallen into or partially fallen into the CNV regions. All genes
overlapped with the CNV will be quoted into a pair of “”. The genes will be reported as
“contained”, which means the whole gene is in the CNV element, or “broken” which means
only a part of the gene is in the region. The genes are generally reported like
“[Contained/Broken]: [refGene accession ID]: [full name of the gene];”
ncRNA: non-coding RNA. The quoted part actually also is a gff annotation of
ncRNA. It will be like this: ”[Contained/Broken]: [Annotation of ncRNA in gff format]”.
Transposons: a sub-annotation in gff3 format will be quoted. The fields Julie have
mentioned in previous mail are actually defined by RepeatMasker: “div” for divergence %
comparing to repeat consensus used in RepeatMasker, “ins” is inserted % and “del” is
deleted %.
Tandem: a sub-annotation in gff3 format will be quoted. The annotions is similar to
tranposons.